Neural Architecture Search

18 minute read

Design neural architecture search systems that automatically discover optimal model architectures using dynamic programming and path optimization—the same principles from grid path counting scaled to exponential search spaces.

Problem Statement

Design a Neural Architecture Search (NAS) System that:

Automatically discovers neural network architectures that outperform hand-designed models
Searches efficiently through exponentially large search spaces
Optimizes multiple objectives (accuracy, latency, memory, FLOPS)
Scales to production (finds models deployable on mobile/edge devices)
Supports different domains (vision, NLP, speech, multi-modal)

Functional Requirements

Search space definition:
- Define architecture space (layers, operations, connections)
- Support modular search (cells, blocks, stages)
- Enable constrained search (max latency, max params)
Search strategy:
- Reinforcement learning (controller RNN)
- Evolutionary algorithms (mutation, crossover)
- Gradient-based (DARTS, differentiable NAS)
- Random search (baseline)
- Bayesian optimization
Performance estimation:
- Train architectures to evaluate quality
- Early stopping for bad candidates
- Weight sharing / one-shot models (ENAS, DARTS)
- Performance predictors (surrogate models)
Multi-objective optimization:
- Accuracy vs latency
- Accuracy vs model size
- Accuracy vs FLOPS
- Pareto frontier identification
Distributed search:
- Parallel architecture evaluation
- Distributed training of candidates
- Efficient resource allocation
Transfer and reuse:
- Transfer architectures across tasks
- Re-use components from previous searches
- Meta-learning for search initialization

Non-Functional Requirements

Efficiency: Find good architecture in <100 GPU days (vs manual design months)
Quality: Discovered models competitive with hand-designed ones
Generalizability: Architectures transfer across datasets
Interpretability: Understand why architecture works
Reproducibility: Same search produces same results

Understanding the Requirements

Why NAS?

Manual architecture design is:

Time-consuming (months of expert effort)
Limited by human intuition and expertise
Hard to optimize for specific constraints (mobile latency, memory)
Difficult to explore unconventional designs

NAS automates:

Architecture discovery
Multi-objective optimization
Hardware-aware design
Cross-domain transfer

Scale of the Problem

Search space size:

A simple NAS space with 6 layers, 5 operations per layer: 5^6 = 15,625 architectures
NASNet search space: ~10^18 possible architectures
Without smart search, infeasible to evaluate all

Computational cost:

Training one model: 1-10 GPU days
Naive search (10K architectures): 10K-100K GPU days
Smart search (NAS): 100-1000 GPU days
Goal: Reduce by 10-100x through efficient search

The Path Optimization Connection

Just like Unique Paths counts paths through a grid using DP:

Unique Paths	Neural Architecture Search	Speech Arch Search
m×n grid	Layer×operation space	Encoder×decoder configs
Count all paths	Count/evaluate architectures	Evaluate speech models
DP optimization	DP/RL search	DP search
O(m×n) vs O(2^(m+n))	Smart search vs brute force	Efficient vs exhaustive
Path reconstruction	Architecture construction	Model construction

Both use dynamic programming / smart search to navigate exponentially large spaces efficiently.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                Neural Architecture Search System                 │
└─────────────────────────────────────────────────────────────────┘

                       Search Controller
            ┌──────────────────────────────────┐
            │  Strategy: RL / EA / DARTS       │
            │  - Propose architectures         │
            │  - Update based on performance   │
            └──────────────┬───────────────────┘
                           │
                           ↓
                  ┌────────────────┐
                  │  Search Space  │
                  │  - Layers      │
                  │  - Operations  │
                  │  - Connections │
                  └────────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
┌───────▼────────┐ ┌──────▼──────┐ ┌────────▼────────┐
│  Architecture  │ │ Performance │ │  Multi-obj      │
│  Evaluator     │ │ Predictor   │ │  Optimizer      │
│                │ │             │ │                 │
│ - Train model  │ │ - Surrogate │ │ - Pareto        │
│ - Measure acc  │ │ - Skip bad  │ │ - Constraints   │
│ - Measure lat  │ │   candidates│ │ - Trade-offs    │
└───────┬────────┘ └──────┬──────┘ └────────┬────────┘
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │
                  ┌────────▼────────┐
                  │  Distributed    │
                  │  Training       │
                  │  - Worker pool  │
                  │  - GPU cluster  │
                  └────────┬────────┘
                           │
                  ┌────────▼────────┐
                  │  Results Store  │
                  │  - Architectures│
                  │  - Metrics      │
                  │  - Models       │
                  └─────────────────┘

Key Components

Search Controller: Proposes new architectures to try
Search Space: Defines valid architecture configurations
Architecture Evaluator: Trains and evaluates architectures
Performance Predictor: Estimates performance without full training
Multi-objective Optimizer: Balances accuracy, latency, size
Distributed Training: Parallel evaluation of architectures
Results Store: Tracks all evaluated architectures

Component Deep-Dives

1. Search Space Definition

Define what architectures are possible:

from typing import List, Dict
from dataclasses import dataclass

@dataclass
class Operation:
    """A single operation in the search space."""
    name: str
    params: Dict

@dataclass
class SearchSpace:
    """
    NAS search space definition.
    
    Similar to Unique Paths grid:
    - Grid dimensions → num_layers, ops_per_layer
    - Paths through grid → architectures through search space
    """
    num_layers: int
    operations: List[Operation]
    connections: str  # "sequential", "skip", "dense"
    
    def count_architectures(self) -> int:
        """
        Count total possible architectures.
        
        Like counting paths in Unique Paths:
        - If sequential: ops_per_layer ^ num_layers
        - If with skip connections: much larger
        """
        if self.connections == "sequential":
            return len(self.operations) ** self.num_layers
        else:
            # With skip connections, combinatorially larger
            return -1  # Too many to count exactly


# Example search space
MOBILENET_SEARCH_SPACE = SearchSpace(
    num_layers=20,
    operations=[
        Operation("conv3x3", {"kernel_size": 3, "stride": 1}),
        Operation("conv5x5", {"kernel_size": 5, "stride": 1}),
        Operation("depthwise_conv3x3", {"kernel_size": 3}),
        Operation("depthwise_conv5x5", {"kernel_size": 5}),
        Operation("maxpool3x3", {"kernel_size": 3, "stride": 1}),
        Operation("skip", {}),
    ],
    connections="skip"
)


def encode_architecture(arch_ops: List[str], search_space: SearchSpace) -> str:
    """
    Encode architecture as string.
    
    Args:
        arch_ops: List of operation names per layer
        
    Returns:
        String encoding (for hashing/caching)
    """
    return "-".join(arch_ops)


def decode_architecture(arch_string: str) -> List[str]:
    """Decode architecture string to operation list."""
    return arch_string.split("-")

2. Search Strategy - Reinforcement Learning

Use RL controller to generate architectures:

import torch
import torch.nn as nn

class NASController(nn.Module):
    """
    RNN controller that generates architectures.
    
    Similar to DP in Unique Paths:
    - Build architecture layer-by-layer
    - Use previous decisions to inform next
    - Optimize for high reward (accuracy)
    """
    
    def __init__(
        self,
        num_layers: int,
        num_operations: int,
        hidden_size: int = 100
    ):
        super().__init__()
        
        self.num_layers = num_layers
        self.num_operations = num_operations
        
        # RNN to track state across layers
        self.rnn = nn.LSTM(
            input_size=num_operations,
            hidden_size=hidden_size,
            num_layers=1
        )
        
        # Output layer: predict operation for next layer
        self.fc = nn.Linear(hidden_size, num_operations)
    
    def forward(self):
        """
        Generate an architecture.
        
        Returns:
            architecture: List of operation indices
            log_probs: Log probabilities for REINFORCE
        """
        architecture = []
        log_probs = []
        
        # Initial input
        inputs = torch.zeros(1, 1, self.num_operations)
        hidden = None
        
        # Generate layer-by-layer (like DP building solution)
        for layer in range(self.num_layers):
            # RNN step
            output, hidden = self.rnn(inputs, hidden)
            
            # Predict operation for this layer
            logits = self.fc(output.squeeze(0))
            probs = torch.softmax(logits, dim=-1)
            
            # Sample operation
            dist = torch.distributions.Categorical(probs)
            action = dist.sample()
            
            architecture.append(action.item())
            log_probs.append(dist.log_prob(action))
            
            # Next input is one-hot of chosen operation
            inputs = torch.zeros(1, 1, self.num_operations)
            inputs[0, 0, action] = 1.0
        
        return architecture, log_probs
    
    def update(self, log_probs: List[torch.Tensor], reward: float, optimizer):
        """
        Update controller using REINFORCE.
        
        Args:
            log_probs: Log probabilities of sampled actions
            reward: Accuracy of generated architecture
            optimizer: Controller optimizer
        """
        # REINFORCE loss: -sum(log_prob * reward)
        policy_loss = []
        for log_prob in log_probs:
            policy_loss.append(-log_prob * reward)
        
        loss = torch.stack(policy_loss).sum()
        
        # Update controller
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


# Training loop
def train_nas_controller(
    controller: NASController,
    search_space: SearchSpace,
    num_iterations: int = 1000
):
    """
    Train NAS controller to generate good architectures.
    """
    optimizer = torch.optim.Adam(controller.parameters(), lr=0.001)
    
    for iteration in range(num_iterations):
        # Generate architecture
        arch, log_probs = controller()
        
        # Evaluate architecture (train small model)
        reward = evaluate_architecture(arch, search_space)
        
        # Update controller
        controller.update(log_probs, reward, optimizer)
        
        if iteration % 100 == 0:
            print(f"Iteration {iteration}: Best reward = {reward:.3f}")

3. Search Strategy - Differentiable NAS (DARTS)

DARTS makes architecture search differentiable:

class DARTSSearchSpace(nn.Module):
    """
    Differentiable architecture search.
    
    Key idea: Instead of discrete choice, use weighted combination.
    Learn weights (architecture parameters) via gradient descent.
    """
    
    def __init__(self, num_layers: int, operations: List[nn.Module]):
        super().__init__()
        
        self.num_layers = num_layers
        self.operations = nn.ModuleList(operations)
        
        # Architecture parameters (learnable weights for each operation)
        self.alpha = nn.Parameter(
            torch.randn(num_layers, len(operations))
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with weighted operations.
        
        Each layer computes weighted sum of all operations.
        """
        for layer in range(self.num_layers):
            # Get architecture weights for this layer
            weights = torch.softmax(self.alpha[layer], dim=0)
            
            # Compute weighted sum of operations
            layer_output = sum(
                w * op(x)
                for w, op in zip(weights, self.operations)
            )
            
            x = layer_output
        
        return x
    
    def get_best_architecture(self) -> List[int]:
        """
        Extract discrete architecture from learned weights.
        
        Choose operation with highest weight per layer.
        """
        architecture = []
        
        for layer in range(self.num_layers):
            weights = torch.softmax(self.alpha[layer], dim=0)
            best_op = torch.argmax(weights).item()
            architecture.append(best_op)
        
        return architecture


# DARTS training (bi-level optimization)
def train_darts(search_space: DARTSSearchSpace, train_data, val_data, epochs: int = 50):
    """
    Train DARTS to find optimal architecture.
    
    Bi-level optimization:
    - Inner loop: optimize model weights
    - Outer loop: optimize architecture parameters
    """
    # Model weights optimizer
    model_optimizer = torch.optim.SGD(
        search_space.parameters(),
        lr=0.025,
        momentum=0.9
    )
    
    # Architecture parameters optimizer
    arch_optimizer = torch.optim.Adam(
        [search_space.alpha],
        lr=0.001
    )
    
    for epoch in range(epochs):
        # Update model weights on train data
        for batch in train_data:
            model_optimizer.zero_grad()
            loss = compute_loss(search_space(batch['x']), batch['y'])
            loss.backward()
            model_optimizer.step()
        
        # Update architecture parameters on val data
        for batch in val_data:
            arch_optimizer.zero_grad()
            loss = compute_loss(search_space(batch['x']), batch['y'])
            loss.backward()
            arch_optimizer.step()
    
    # Extract final architecture
    best_arch = search_space.get_best_architecture()
    return best_arch

4. Performance Estimation Strategies

Problem: Training every architecture fully is too expensive.

Solutions:

a) Early stopping:

def evaluate_with_early_stopping(
    arch: List[int],
    train_data,
    val_data,
    max_epochs: int = 50,
    patience: int = 5
):
    """
    Train architecture with early stopping.
    
    Stop if validation accuracy doesn't improve for `patience` epochs.
    """
    model = build_model_from_arch(arch)
    optimizer = torch.optim.Adam(model.parameters())
    
    best_val_acc = 0.0
    epochs_without_improvement = 0
    
    for epoch in range(max_epochs):
        # Train
        train_one_epoch(model, train_data, optimizer)
        
        # Validate
        val_acc = evaluate(model, val_data)
        
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            epochs_without_improvement = 0
        else:
            epochs_without_improvement += 1
        
        # Early stop
        if epochs_without_improvement >= patience:
            break
    
    return best_val_acc

b) Weight sharing (ENAS):

class SuperNet(nn.Module):
    """
    Super-network containing all possible operations.
    
    Different architectures share weights.
    Train once, evaluate many architectures quickly.
    """
    
    def __init__(self, search_space: SearchSpace):
        super().__init__()
        
        # Create all operations (shared across architectures)
        self.ops = nn.ModuleList([
            create_operation(op)
            for op in search_space.operations
        ])
    
    def forward(self, x: torch.Tensor, architecture: List[int]) -> torch.Tensor:
        """
        Forward pass for specific architecture.
        
        Args:
            x: Input
            architecture: List of operation indices per layer
        """
        for layer, op_idx in enumerate(architecture):
            x = self.ops[op_idx](x)
        
        return x
    
    def evaluate_architecture(self, arch: List[int], val_data) -> float:
        """
        Evaluate architecture without training.
        
        Uses shared weights - much faster than training from scratch.
        """
        self.eval()
        total_correct = 0
        total_samples = 0
        
        with torch.no_grad():
            for batch in val_data:
                outputs = self.forward(batch['x'], arch)
                preds = outputs.argmax(dim=1)
                total_correct += (preds == batch['y']).sum().item()
                total_samples += len(batch['y'])
        
        return total_correct / total_samples

c) Performance prediction:

from sklearn.ensemble import RandomForestRegressor

class PerformancePredictor:
    """
    Predict architecture performance without training.
    
    Train a surrogate model: architecture features → accuracy.
    """
    
    def __init__(self):
        self.model = RandomForestRegressor(n_estimators=100)
        self.trained = False
    
    def extract_features(self, arch: List[int]) -> np.ndarray:
        """
        Extract features from architecture.
        
        Features:
        - Number of each operation type
        - Depth (number of layers)
        - Estimated FLOPs
        - Estimated params
        """
        features = []
        
        # Count each operation type
        from collections import Counter
        op_counts = Counter(arch)
        for op_idx in range(max(arch) + 1):
            features.append(op_counts.get(op_idx, 0))
        
        # Add depth
        features.append(len(arch))
        
        return np.array(features)
    
    def train(self, architectures: List[List[int]], accuracies: List[float]):
        """Train predictor on evaluated architectures."""
        X = np.array([self.extract_features(arch) for arch in architectures])
        y = np.array(accuracies)
        
        self.model.fit(X, y)
        self.trained = True
    
    def predict(self, arch: List[int]) -> float:
        """Predict accuracy for architecture."""
        if not self.trained:
            raise ValueError("Predictor not trained")
        
        features = self.extract_features(arch).reshape(1, -1)
        return self.model.predict(features)[0]

Data Flow

NAS Pipeline

1. Initialize search
   └─> Define search space
   └─> Initialize controller (RL/EA/DARTS)
   └─> Set up distributed workers

2. Search loop (1000-10000 iterations)
   └─> Controller proposes architecture
   └─> (Optional) Performance predictor filters bad candidates
   └─> Evaluate architecture:
       - Train on subset of data
       - Measure accuracy, latency, size
   └─> Update controller based on reward
   └─> Store results

3. Post-processing
   └─> Identify Pareto frontier (best accuracy-latency trade-offs)
   └─> Retrain top candidates on full data
   └─> Final evaluation on test set

4. Deployment
   └─> Export best architecture
   └─> Optimize for target hardware
   └─> Deploy to production

Scaling Strategies

Distributed Architecture Evaluation

import ray

@ray.remote(num_gpus=1)
class ArchitectureWorker:
    """Worker that evaluates architectures on GPU."""
    
    def __init__(self, search_space: SearchSpace):
        self.search_space = search_space
    
    def evaluate(self, arch: List[int], train_subset, val_subset) -> Dict:
        """
        Evaluate architecture.
        
        Returns:
            Dictionary with accuracy, latency, params, flops
        """
        # Build model
        model = build_model_from_arch(arch, self.search_space)
        
        # Train briefly
        train_model(model, train_subset, epochs=10)
        
        # Evaluate
        accuracy = evaluate(model, val_subset)
        latency = measure_latency(model)
        params = count_parameters(model)
        flops = estimate_flops(model)
        
        return {
            "accuracy": accuracy,
            "latency_ms": latency,
            "params": params,
            "flops": flops
        }


class DistributedNAS:
    """Distributed NAS system."""
    
    def __init__(self, search_space: SearchSpace, num_workers: int = 8):
        self.search_space = search_space
        
        # Create worker pool
        self.workers = [
            ArchitectureWorker.remote(search_space)
            for _ in range(num_workers)
        ]
        
        self.num_workers = num_workers
    
    def search(self, controller: NASController, num_iterations: int = 1000):
        """
        Distributed NAS search.
        
        Args:
            controller: Architecture generator
            num_iterations: Number of architectures to try
        """
        results = []
        
        # Process in batches (parallel evaluation)
        batch_size = self.num_workers
        
        for iteration in range(0, num_iterations, batch_size):
            # Generate batch of architectures
            architectures = []
            log_probs_batch = []
            
            for _ in range(batch_size):
                arch, log_probs = controller()
                architectures.append(arch)
                log_probs_batch.append(log_probs)
            
            # Evaluate in parallel
            futures = [
                self.workers[i % self.num_workers].evaluate.remote(
                    architectures[i],
                    get_train_subset(),
                    get_val_subset()
                )
                for i in range(batch_size)
            ]
            
            eval_results = ray.get(futures)
            
            # Update controller with rewards
            for arch, log_probs, result in zip(architectures, log_probs_batch, eval_results):
                reward = result['accuracy']
                controller.update(log_probs, reward, controller_optimizer)
                results.append((arch, result))
        
        return results

Monitoring & Metrics

Key Metrics

Search Progress:

Best accuracy found so far
Number of architectures evaluated
Search efficiency (good arch per GPU day)
Diversity of architectures explored

Architecture Quality:

Accuracy vs latency scatter plot
Pareto frontier (optimal trade-offs)
Architecture complexity distribution (params, FLOPs)

Resource Usage:

GPU utilization
Training time per architecture
Total GPU hours consumed

Visualization

Architecture topology graphs
Performance over search iterations
Pareto frontier (accuracy vs latency/size)
Operation frequency (which ops are most common in good models)

Failure Modes & Mitigations

Failure Mode	Impact	Mitigation
Search collapse	Controller generates same arch repeatedly	Entropy regularization, exploration bonus
Overfitting to search	Arch good on val, bad on test	Proper val/test splits, cross-validation
Poor weight sharing	ENAS/supernet gives misleading results	Standalone training for top candidates
Hardware mismatch	Arch fast on A100, slow on mobile	Include target hardware in eval
Expensive search	1000s of GPU days	Early stopping, predictor, weight sharing

Real-World Case Study: Google’s EfficientNet

Google’s NAS Approach

Goal: Find architectures that are both accurate and efficient (mobile-friendly).

Method:

Multi-objective NAS optimizing accuracy and FLOPS
Compound scaling (depth, width, resolution)
Progressive search (coarse → fine)

Architecture:

Search space: MobileNetV2-based
Search strategy: Reinforcement learning
Evaluation: Early stopping + supernet
Scale: 1000 architectures evaluated, 100 GPU days

Results:

EfficientNet-B0: 77.1% top-1 on ImageNet, 390M FLOPs
10x more efficient than previous SOTA (at same accuracy)
Transfer learning: Worked across domains (detection, segmentation)

Key Lessons

Multi-objective is crucial: Accuracy alone isn’t enough
Progressive search: Start coarse, refine best candidates
Transfer across tasks: Good architecture for ImageNet → good for other vision tasks
Hardware-aware: Include latency/FLOPS in objective
Compound scaling: After finding base arch, scale systematically

Cost Analysis

NAS vs Manual Design

Approach	Time	GPU Cost	Quality	Notes
Manual design	3-6 months	100 GPU days	Good	Expert-dependent
Random search	N/A	1000 GPU days	Poor	Baseline
RL-based NAS	1 month	200 GPU days	Better	EfficientNet-style
DARTS	1 week	4 GPU days	Good	Fast but less stable
Transfer + fine-tune	1 week	10 GPU days	Good	Use existing NAS results

ROI Calculation:

Manual design: 3 months engineer time ($60K) + 100 GPU days ($30K) = $90K
NAS: 1 month engineer time ($20K) + 200 GPU days ($60K) = $80K
Savings: $10K + better model

Advanced Topics

1. Once-For-All Networks

Train a single super-network that contains many sub-networks:

class OnceForAllNetwork:
    """
    Train once, deploy many architectures.
    
    Enables instant architecture selection without retraining.
    """
    
    def __init__(self):
        self.supernet = create_supernet()
        self.trained = False
    
    def train_supernet(self, train_data):
        """
        Train supernet to support all sub-architectures.
        
        Progressive shrinking strategy:
        - Train largest network first
        - Progressively train smaller sub-networks
        """
        # Implementation details...
        pass
    
    def extract_subnet(self, target_latency_ms: float):
        """
        Extract sub-network meeting latency constraint.
        
        No training needed!
        """
        # Search for subnet with latency < target
        # Use efficiency predictor
        pass

2. Hardware-Aware NAS

Include hardware metrics in search:

def hardware_aware_nas(search_space, target_hardware: str):
    """
    Search for architectures optimized for specific hardware.
    
    Args:
        target_hardware: "mobile", "edge_tpu", "nvidia_t4", etc.
    """
    # Measure latency on target hardware
    def measure_latency_on_target(arch):
        model = build_model(arch)
        # Deploy to target, measure
        return measure_on_hardware(model, target_hardware)
    
    # Multi-objective: accuracy + latency on target
    def fitness(arch):
        acc = evaluate_accuracy(arch)
        lat = measure_latency_on_target(arch)
        
        # Combine (accuracy high, latency low)
        return acc - 0.01 * lat  # Weight latency penalty

3. Transfer NAS

Transfer architectures across tasks:

def transfer_nas(source_task: str, target_task: str):
    """
    Transfer NAS results from source to target task.
    
    Example: ImageNet → COCO detection
    """
    # Load architectures found on source task
    source_archs = load_nas_results(source_task)
    
    # Top-K from source
    top_archs = sorted(source_archs, key=lambda x: x['accuracy'], reverse=True)[:10]
    
    # Fine-tune on target task
    target_results = []
    for arch in top_archs:
        # Build model with source architecture
        model = build_model_from_arch(arch['architecture'])
        
        # Fine-tune on target task
        fine_tune(model, target_task_data)
        
        # Evaluate
        target_acc = evaluate(model, target_task_test_data)
        target_results.append((arch, target_acc))
    
    # Best transferred architecture
    best = max(target_results, key=lambda x: x[1])
    return best

Key Takeaways

✅ NAS automates architecture design - discovers models competitive with or better than hand-designed ones

✅ Search space is exponential - like paths in a grid, exponentially many architectures

✅ DP and smart search reduce complexity - from infeasible to practical

✅ Multiple search strategies - RL (flexible), DARTS (fast), evolutionary (robust)

✅ Weight sharing critical - enables evaluating 1000s of architectures efficiently

✅ Multi-objective optimization - accuracy vs latency vs size

✅ Hardware-aware NAS - optimize for target deployment platform

✅ Transfer learning works - architectures transfer across tasks

✅ Same DP principles as Unique Paths - break into subproblems, build optimal solution

✅ Production deployment - once-for-all networks, progressive search, cost-aware

Connection to Thematic Link: Dynamic Programming and Path Optimization

All three Day 21 topics use DP for path optimization:

DSA (Unique Paths):

Count paths in m×n grid using DP
Recurrence: paths(i,j) = paths(i-1,j) + paths(i,j-1)
Reduces O(2^(m+n)) to O(m×n)

ML System Design (Neural Architecture Search):

Search through exponential architecture space
Use DP/RL/gradient-based methods to find optimal
Build architectures from optimal sub-architectures

Speech Tech (Speech Architecture Search):

Search encoder/decoder configurations
Use DP to evaluate speech model paths
Find optimal ASR/TTS architectures

The unifying principle: navigate exponentially large search spaces by breaking into subproblems, solving optimally, and building up the final solution—whether counting grid paths, finding neural architectures, or designing speech models.

Originally published at: arunbaby.com/ml-system-design/0021-neural-architecture-search

If you found this helpful, consider sharing it with others who might benefit.