Neural Architecture Search
Design neural architecture search systems that automatically discover optimal model architectures using dynamic programming and path optimization—the same principles from grid path counting scaled to exponential search spaces.
Problem Statement
Design a Neural Architecture Search (NAS) System that:
- Automatically discovers neural network architectures that outperform hand-designed models
- Searches efficiently through exponentially large search spaces
- Optimizes multiple objectives (accuracy, latency, memory, FLOPS)
- Scales to production (finds models deployable on mobile/edge devices)
- Supports different domains (vision, NLP, speech, multi-modal)
Functional Requirements
- Search space definition:
- Define architecture space (layers, operations, connections)
- Support modular search (cells, blocks, stages)
- Enable constrained search (max latency, max params)
- Search strategy:
- Reinforcement learning (controller RNN)
- Evolutionary algorithms (mutation, crossover)
- Gradient-based (DARTS, differentiable NAS)
- Random search (baseline)
- Bayesian optimization
- Performance estimation:
- Train architectures to evaluate quality
- Early stopping for bad candidates
- Weight sharing / one-shot models (ENAS, DARTS)
- Performance predictors (surrogate models)
- Multi-objective optimization:
- Accuracy vs latency
- Accuracy vs model size
- Accuracy vs FLOPS
- Pareto frontier identification
- Distributed search:
- Parallel architecture evaluation
- Distributed training of candidates
- Efficient resource allocation
- Transfer and reuse:
- Transfer architectures across tasks
- Re-use components from previous searches
- Meta-learning for search initialization
Non-Functional Requirements
- Efficiency: Find good architecture in <100 GPU days (vs manual design months)
- Quality: Discovered models competitive with hand-designed ones
- Generalizability: Architectures transfer across datasets
- Interpretability: Understand why architecture works
- Reproducibility: Same search produces same results
Understanding the Requirements
Why NAS?
Manual architecture design is:
- Time-consuming (months of expert effort)
- Limited by human intuition and expertise
- Hard to optimize for specific constraints (mobile latency, memory)
- Difficult to explore unconventional designs
NAS automates:
- Architecture discovery
- Multi-objective optimization
- Hardware-aware design
- Cross-domain transfer
Scale of the Problem
Search space size:
- A simple NAS space with 6 layers, 5 operations per layer: 5^6 = 15,625 architectures
- NASNet search space: ~10^18 possible architectures
- Without smart search, infeasible to evaluate all
Computational cost:
- Training one model: 1-10 GPU days
- Naive search (10K architectures): 10K-100K GPU days
- Smart search (NAS): 100-1000 GPU days
- Goal: Reduce by 10-100x through efficient search
The Path Optimization Connection
Just like Unique Paths counts paths through a grid using DP:
| Unique Paths | Neural Architecture Search | Speech Arch Search |
|---|---|---|
| m×n grid | Layer×operation space | Encoder×decoder configs |
| Count all paths | Count/evaluate architectures | Evaluate speech models |
| DP optimization | DP/RL search | DP search |
| O(m×n) vs O(2^(m+n)) | Smart search vs brute force | Efficient vs exhaustive |
| Path reconstruction | Architecture construction | Model construction |
Both use dynamic programming / smart search to navigate exponentially large spaces efficiently.
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Neural Architecture Search System │
└─────────────────────────────────────────────────────────────────┘
Search Controller
┌──────────────────────────────────┐
│ Strategy: RL / EA / DARTS │
│ - Propose architectures │
│ - Update based on performance │
└──────────────┬───────────────────┘
│
↓
┌────────────────┐
│ Search Space │
│ - Layers │
│ - Operations │
│ - Connections │
└────────┬───────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌───────▼────────┐ ┌──────▼──────┐ ┌────────▼────────┐
│ Architecture │ │ Performance │ │ Multi-obj │
│ Evaluator │ │ Predictor │ │ Optimizer │
│ │ │ │ │ │
│ - Train model │ │ - Surrogate │ │ - Pareto │
│ - Measure acc │ │ - Skip bad │ │ - Constraints │
│ - Measure lat │ │ candidates│ │ - Trade-offs │
└───────┬────────┘ └──────┬──────┘ └────────┬────────┘
│ │ │
└──────────────────┼──────────────────┘
│
┌────────▼────────┐
│ Distributed │
│ Training │
│ - Worker pool │
│ - GPU cluster │
└────────┬────────┘
│
┌────────▼────────┐
│ Results Store │
│ - Architectures│
│ - Metrics │
│ - Models │
└─────────────────┘
Key Components
- Search Controller: Proposes new architectures to try
- Search Space: Defines valid architecture configurations
- Architecture Evaluator: Trains and evaluates architectures
- Performance Predictor: Estimates performance without full training
- Multi-objective Optimizer: Balances accuracy, latency, size
- Distributed Training: Parallel evaluation of architectures
- Results Store: Tracks all evaluated architectures
Component Deep-Dives
1. Search Space Definition
Define what architectures are possible:
from typing import List, Dict
from dataclasses import dataclass
@dataclass
class Operation:
"""A single operation in the search space."""
name: str
params: Dict
@dataclass
class SearchSpace:
"""
NAS search space definition.
Similar to Unique Paths grid:
- Grid dimensions → num_layers, ops_per_layer
- Paths through grid → architectures through search space
"""
num_layers: int
operations: List[Operation]
connections: str # "sequential", "skip", "dense"
def count_architectures(self) -> int:
"""
Count total possible architectures.
Like counting paths in Unique Paths:
- If sequential: ops_per_layer ^ num_layers
- If with skip connections: much larger
"""
if self.connections == "sequential":
return len(self.operations) ** self.num_layers
else:
# With skip connections, combinatorially larger
return -1 # Too many to count exactly
# Example search space
MOBILENET_SEARCH_SPACE = SearchSpace(
num_layers=20,
operations=[
Operation("conv3x3", {"kernel_size": 3, "stride": 1}),
Operation("conv5x5", {"kernel_size": 5, "stride": 1}),
Operation("depthwise_conv3x3", {"kernel_size": 3}),
Operation("depthwise_conv5x5", {"kernel_size": 5}),
Operation("maxpool3x3", {"kernel_size": 3, "stride": 1}),
Operation("skip", {}),
],
connections="skip"
)
def encode_architecture(arch_ops: List[str], search_space: SearchSpace) -> str:
"""
Encode architecture as string.
Args:
arch_ops: List of operation names per layer
Returns:
String encoding (for hashing/caching)
"""
return "-".join(arch_ops)
def decode_architecture(arch_string: str) -> List[str]:
"""Decode architecture string to operation list."""
return arch_string.split("-")
2. Search Strategy - Reinforcement Learning
Use RL controller to generate architectures:
import torch
import torch.nn as nn
class NASController(nn.Module):
"""
RNN controller that generates architectures.
Similar to DP in Unique Paths:
- Build architecture layer-by-layer
- Use previous decisions to inform next
- Optimize for high reward (accuracy)
"""
def __init__(
self,
num_layers: int,
num_operations: int,
hidden_size: int = 100
):
super().__init__()
self.num_layers = num_layers
self.num_operations = num_operations
# RNN to track state across layers
self.rnn = nn.LSTM(
input_size=num_operations,
hidden_size=hidden_size,
num_layers=1
)
# Output layer: predict operation for next layer
self.fc = nn.Linear(hidden_size, num_operations)
def forward(self):
"""
Generate an architecture.
Returns:
architecture: List of operation indices
log_probs: Log probabilities for REINFORCE
"""
architecture = []
log_probs = []
# Initial input
inputs = torch.zeros(1, 1, self.num_operations)
hidden = None
# Generate layer-by-layer (like DP building solution)
for layer in range(self.num_layers):
# RNN step
output, hidden = self.rnn(inputs, hidden)
# Predict operation for this layer
logits = self.fc(output.squeeze(0))
probs = torch.softmax(logits, dim=-1)
# Sample operation
dist = torch.distributions.Categorical(probs)
action = dist.sample()
architecture.append(action.item())
log_probs.append(dist.log_prob(action))
# Next input is one-hot of chosen operation
inputs = torch.zeros(1, 1, self.num_operations)
inputs[0, 0, action] = 1.0
return architecture, log_probs
def update(self, log_probs: List[torch.Tensor], reward: float, optimizer):
"""
Update controller using REINFORCE.
Args:
log_probs: Log probabilities of sampled actions
reward: Accuracy of generated architecture
optimizer: Controller optimizer
"""
# REINFORCE loss: -sum(log_prob * reward)
policy_loss = []
for log_prob in log_probs:
policy_loss.append(-log_prob * reward)
loss = torch.stack(policy_loss).sum()
# Update controller
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Training loop
def train_nas_controller(
controller: NASController,
search_space: SearchSpace,
num_iterations: int = 1000
):
"""
Train NAS controller to generate good architectures.
"""
optimizer = torch.optim.Adam(controller.parameters(), lr=0.001)
for iteration in range(num_iterations):
# Generate architecture
arch, log_probs = controller()
# Evaluate architecture (train small model)
reward = evaluate_architecture(arch, search_space)
# Update controller
controller.update(log_probs, reward, optimizer)
if iteration % 100 == 0:
print(f"Iteration {iteration}: Best reward = {reward:.3f}")
3. Search Strategy - Differentiable NAS (DARTS)
DARTS makes architecture search differentiable:
class DARTSSearchSpace(nn.Module):
"""
Differentiable architecture search.
Key idea: Instead of discrete choice, use weighted combination.
Learn weights (architecture parameters) via gradient descent.
"""
def __init__(self, num_layers: int, operations: List[nn.Module]):
super().__init__()
self.num_layers = num_layers
self.operations = nn.ModuleList(operations)
# Architecture parameters (learnable weights for each operation)
self.alpha = nn.Parameter(
torch.randn(num_layers, len(operations))
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass with weighted operations.
Each layer computes weighted sum of all operations.
"""
for layer in range(self.num_layers):
# Get architecture weights for this layer
weights = torch.softmax(self.alpha[layer], dim=0)
# Compute weighted sum of operations
layer_output = sum(
w * op(x)
for w, op in zip(weights, self.operations)
)
x = layer_output
return x
def get_best_architecture(self) -> List[int]:
"""
Extract discrete architecture from learned weights.
Choose operation with highest weight per layer.
"""
architecture = []
for layer in range(self.num_layers):
weights = torch.softmax(self.alpha[layer], dim=0)
best_op = torch.argmax(weights).item()
architecture.append(best_op)
return architecture
# DARTS training (bi-level optimization)
def train_darts(search_space: DARTSSearchSpace, train_data, val_data, epochs: int = 50):
"""
Train DARTS to find optimal architecture.
Bi-level optimization:
- Inner loop: optimize model weights
- Outer loop: optimize architecture parameters
"""
# Model weights optimizer
model_optimizer = torch.optim.SGD(
search_space.parameters(),
lr=0.025,
momentum=0.9
)
# Architecture parameters optimizer
arch_optimizer = torch.optim.Adam(
[search_space.alpha],
lr=0.001
)
for epoch in range(epochs):
# Update model weights on train data
for batch in train_data:
model_optimizer.zero_grad()
loss = compute_loss(search_space(batch['x']), batch['y'])
loss.backward()
model_optimizer.step()
# Update architecture parameters on val data
for batch in val_data:
arch_optimizer.zero_grad()
loss = compute_loss(search_space(batch['x']), batch['y'])
loss.backward()
arch_optimizer.step()
# Extract final architecture
best_arch = search_space.get_best_architecture()
return best_arch
4. Performance Estimation Strategies
Problem: Training every architecture fully is too expensive.
Solutions:
a) Early stopping:
def evaluate_with_early_stopping(
arch: List[int],
train_data,
val_data,
max_epochs: int = 50,
patience: int = 5
):
"""
Train architecture with early stopping.
Stop if validation accuracy doesn't improve for `patience` epochs.
"""
model = build_model_from_arch(arch)
optimizer = torch.optim.Adam(model.parameters())
best_val_acc = 0.0
epochs_without_improvement = 0
for epoch in range(max_epochs):
# Train
train_one_epoch(model, train_data, optimizer)
# Validate
val_acc = evaluate(model, val_data)
if val_acc > best_val_acc:
best_val_acc = val_acc
epochs_without_improvement = 0
else:
epochs_without_improvement += 1
# Early stop
if epochs_without_improvement >= patience:
break
return best_val_acc
b) Weight sharing (ENAS):
class SuperNet(nn.Module):
"""
Super-network containing all possible operations.
Different architectures share weights.
Train once, evaluate many architectures quickly.
"""
def __init__(self, search_space: SearchSpace):
super().__init__()
# Create all operations (shared across architectures)
self.ops = nn.ModuleList([
create_operation(op)
for op in search_space.operations
])
def forward(self, x: torch.Tensor, architecture: List[int]) -> torch.Tensor:
"""
Forward pass for specific architecture.
Args:
x: Input
architecture: List of operation indices per layer
"""
for layer, op_idx in enumerate(architecture):
x = self.ops[op_idx](x)
return x
def evaluate_architecture(self, arch: List[int], val_data) -> float:
"""
Evaluate architecture without training.
Uses shared weights - much faster than training from scratch.
"""
self.eval()
total_correct = 0
total_samples = 0
with torch.no_grad():
for batch in val_data:
outputs = self.forward(batch['x'], arch)
preds = outputs.argmax(dim=1)
total_correct += (preds == batch['y']).sum().item()
total_samples += len(batch['y'])
return total_correct / total_samples
c) Performance prediction:
from sklearn.ensemble import RandomForestRegressor
class PerformancePredictor:
"""
Predict architecture performance without training.
Train a surrogate model: architecture features → accuracy.
"""
def __init__(self):
self.model = RandomForestRegressor(n_estimators=100)
self.trained = False
def extract_features(self, arch: List[int]) -> np.ndarray:
"""
Extract features from architecture.
Features:
- Number of each operation type
- Depth (number of layers)
- Estimated FLOPs
- Estimated params
"""
features = []
# Count each operation type
from collections import Counter
op_counts = Counter(arch)
for op_idx in range(max(arch) + 1):
features.append(op_counts.get(op_idx, 0))
# Add depth
features.append(len(arch))
return np.array(features)
def train(self, architectures: List[List[int]], accuracies: List[float]):
"""Train predictor on evaluated architectures."""
X = np.array([self.extract_features(arch) for arch in architectures])
y = np.array(accuracies)
self.model.fit(X, y)
self.trained = True
def predict(self, arch: List[int]) -> float:
"""Predict accuracy for architecture."""
if not self.trained:
raise ValueError("Predictor not trained")
features = self.extract_features(arch).reshape(1, -1)
return self.model.predict(features)[0]
Data Flow
NAS Pipeline
1. Initialize search
└─> Define search space
└─> Initialize controller (RL/EA/DARTS)
└─> Set up distributed workers
2. Search loop (1000-10000 iterations)
└─> Controller proposes architecture
└─> (Optional) Performance predictor filters bad candidates
└─> Evaluate architecture:
- Train on subset of data
- Measure accuracy, latency, size
└─> Update controller based on reward
└─> Store results
3. Post-processing
└─> Identify Pareto frontier (best accuracy-latency trade-offs)
└─> Retrain top candidates on full data
└─> Final evaluation on test set
4. Deployment
└─> Export best architecture
└─> Optimize for target hardware
└─> Deploy to production
Scaling Strategies
Distributed Architecture Evaluation
import ray
@ray.remote(num_gpus=1)
class ArchitectureWorker:
"""Worker that evaluates architectures on GPU."""
def __init__(self, search_space: SearchSpace):
self.search_space = search_space
def evaluate(self, arch: List[int], train_subset, val_subset) -> Dict:
"""
Evaluate architecture.
Returns:
Dictionary with accuracy, latency, params, flops
"""
# Build model
model = build_model_from_arch(arch, self.search_space)
# Train briefly
train_model(model, train_subset, epochs=10)
# Evaluate
accuracy = evaluate(model, val_subset)
latency = measure_latency(model)
params = count_parameters(model)
flops = estimate_flops(model)
return {
"accuracy": accuracy,
"latency_ms": latency,
"params": params,
"flops": flops
}
class DistributedNAS:
"""Distributed NAS system."""
def __init__(self, search_space: SearchSpace, num_workers: int = 8):
self.search_space = search_space
# Create worker pool
self.workers = [
ArchitectureWorker.remote(search_space)
for _ in range(num_workers)
]
self.num_workers = num_workers
def search(self, controller: NASController, num_iterations: int = 1000):
"""
Distributed NAS search.
Args:
controller: Architecture generator
num_iterations: Number of architectures to try
"""
results = []
# Process in batches (parallel evaluation)
batch_size = self.num_workers
for iteration in range(0, num_iterations, batch_size):
# Generate batch of architectures
architectures = []
log_probs_batch = []
for _ in range(batch_size):
arch, log_probs = controller()
architectures.append(arch)
log_probs_batch.append(log_probs)
# Evaluate in parallel
futures = [
self.workers[i % self.num_workers].evaluate.remote(
architectures[i],
get_train_subset(),
get_val_subset()
)
for i in range(batch_size)
]
eval_results = ray.get(futures)
# Update controller with rewards
for arch, log_probs, result in zip(architectures, log_probs_batch, eval_results):
reward = result['accuracy']
controller.update(log_probs, reward, controller_optimizer)
results.append((arch, result))
return results
Monitoring & Metrics
Key Metrics
Search Progress:
- Best accuracy found so far
- Number of architectures evaluated
- Search efficiency (good arch per GPU day)
- Diversity of architectures explored
Architecture Quality:
- Accuracy vs latency scatter plot
- Pareto frontier (optimal trade-offs)
- Architecture complexity distribution (params, FLOPs)
Resource Usage:
- GPU utilization
- Training time per architecture
- Total GPU hours consumed
Visualization
- Architecture topology graphs
- Performance over search iterations
- Pareto frontier (accuracy vs latency/size)
- Operation frequency (which ops are most common in good models)
Failure Modes & Mitigations
| Failure Mode | Impact | Mitigation |
|---|---|---|
| Search collapse | Controller generates same arch repeatedly | Entropy regularization, exploration bonus |
| Overfitting to search | Arch good on val, bad on test | Proper val/test splits, cross-validation |
| Poor weight sharing | ENAS/supernet gives misleading results | Standalone training for top candidates |
| Hardware mismatch | Arch fast on A100, slow on mobile | Include target hardware in eval |
| Expensive search | 1000s of GPU days | Early stopping, predictor, weight sharing |
Real-World Case Study: Google’s EfficientNet
Google’s NAS Approach
Goal: Find architectures that are both accurate and efficient (mobile-friendly).
Method:
- Multi-objective NAS optimizing accuracy and FLOPS
- Compound scaling (depth, width, resolution)
- Progressive search (coarse → fine)
Architecture:
- Search space: MobileNetV2-based
- Search strategy: Reinforcement learning
- Evaluation: Early stopping + supernet
- Scale: 1000 architectures evaluated, 100 GPU days
Results:
- EfficientNet-B0: 77.1% top-1 on ImageNet, 390M FLOPs
- 10x more efficient than previous SOTA (at same accuracy)
- Transfer learning: Worked across domains (detection, segmentation)
Key Lessons
- Multi-objective is crucial: Accuracy alone isn’t enough
- Progressive search: Start coarse, refine best candidates
- Transfer across tasks: Good architecture for ImageNet → good for other vision tasks
- Hardware-aware: Include latency/FLOPS in objective
- Compound scaling: After finding base arch, scale systematically
Cost Analysis
NAS vs Manual Design
| Approach | Time | GPU Cost | Quality | Notes |
|---|---|---|---|---|
| Manual design | 3-6 months | 100 GPU days | Good | Expert-dependent |
| Random search | N/A | 1000 GPU days | Poor | Baseline |
| RL-based NAS | 1 month | 200 GPU days | Better | EfficientNet-style |
| DARTS | 1 week | 4 GPU days | Good | Fast but less stable |
| Transfer + fine-tune | 1 week | 10 GPU days | Good | Use existing NAS results |
ROI Calculation:
- Manual design: 3 months engineer time ($60K) + 100 GPU days ($30K) = $90K
- NAS: 1 month engineer time ($20K) + 200 GPU days ($60K) = $80K
- Savings: $10K + better model
Advanced Topics
1. Once-For-All Networks
Train a single super-network that contains many sub-networks:
class OnceForAllNetwork:
"""
Train once, deploy many architectures.
Enables instant architecture selection without retraining.
"""
def __init__(self):
self.supernet = create_supernet()
self.trained = False
def train_supernet(self, train_data):
"""
Train supernet to support all sub-architectures.
Progressive shrinking strategy:
- Train largest network first
- Progressively train smaller sub-networks
"""
# Implementation details...
pass
def extract_subnet(self, target_latency_ms: float):
"""
Extract sub-network meeting latency constraint.
No training needed!
"""
# Search for subnet with latency < target
# Use efficiency predictor
pass
2. Hardware-Aware NAS
Include hardware metrics in search:
def hardware_aware_nas(search_space, target_hardware: str):
"""
Search for architectures optimized for specific hardware.
Args:
target_hardware: "mobile", "edge_tpu", "nvidia_t4", etc.
"""
# Measure latency on target hardware
def measure_latency_on_target(arch):
model = build_model(arch)
# Deploy to target, measure
return measure_on_hardware(model, target_hardware)
# Multi-objective: accuracy + latency on target
def fitness(arch):
acc = evaluate_accuracy(arch)
lat = measure_latency_on_target(arch)
# Combine (accuracy high, latency low)
return acc - 0.01 * lat # Weight latency penalty
3. Transfer NAS
Transfer architectures across tasks:
def transfer_nas(source_task: str, target_task: str):
"""
Transfer NAS results from source to target task.
Example: ImageNet → COCO detection
"""
# Load architectures found on source task
source_archs = load_nas_results(source_task)
# Top-K from source
top_archs = sorted(source_archs, key=lambda x: x['accuracy'], reverse=True)[:10]
# Fine-tune on target task
target_results = []
for arch in top_archs:
# Build model with source architecture
model = build_model_from_arch(arch['architecture'])
# Fine-tune on target task
fine_tune(model, target_task_data)
# Evaluate
target_acc = evaluate(model, target_task_test_data)
target_results.append((arch, target_acc))
# Best transferred architecture
best = max(target_results, key=lambda x: x[1])
return best
Key Takeaways
✅ NAS automates architecture design - discovers models competitive with or better than hand-designed ones
✅ Search space is exponential - like paths in a grid, exponentially many architectures
✅ DP and smart search reduce complexity - from infeasible to practical
✅ Multiple search strategies - RL (flexible), DARTS (fast), evolutionary (robust)
✅ Weight sharing critical - enables evaluating 1000s of architectures efficiently
✅ Multi-objective optimization - accuracy vs latency vs size
✅ Hardware-aware NAS - optimize for target deployment platform
✅ Transfer learning works - architectures transfer across tasks
✅ Same DP principles as Unique Paths - break into subproblems, build optimal solution
✅ Production deployment - once-for-all networks, progressive search, cost-aware
Connection to Thematic Link: Dynamic Programming and Path Optimization
All three Day 21 topics use DP for path optimization:
DSA (Unique Paths):
- Count paths in m×n grid using DP
- Recurrence: paths(i,j) = paths(i-1,j) + paths(i,j-1)
- Reduces O(2^(m+n)) to O(m×n)
ML System Design (Neural Architecture Search):
- Search through exponential architecture space
- Use DP/RL/gradient-based methods to find optimal
- Build architectures from optimal sub-architectures
Speech Tech (Speech Architecture Search):
- Search encoder/decoder configurations
- Use DP to evaluate speech model paths
- Find optimal ASR/TTS architectures
The unifying principle: navigate exponentially large search spaces by breaking into subproblems, solving optimally, and building up the final solution—whether counting grid paths, finding neural architectures, or designing speech models.
Originally published at: arunbaby.com/ml-system-design/0021-neural-architecture-search
If you found this helpful, consider sharing it with others who might benefit.