24 minute read

“Tuning speech models for peak performance.”

TL;DR

Speech models have unique hyperparameters spanning audio processing (sample rate, mel bins, window size), model architecture (encoder type, layers, attention heads), and training (SpecAugment, CTC weight, learning rate schedule). Because training is expensive, multi-fidelity strategies like Hyperband and Bayesian optimization via Optuna are essential for efficient search. Production pipelines decouple tuning on proxy datasets from full training, while tools like Ray Tune and MLflow track experiments at scale. For the models being tuned, see sequence-to-sequence speech and multi-task speech learning.

A vintage radio with multiple analog tuning dials and frequency knobs

1. Speech-Specific Hyperparameters

Speech models have unique hyperparameters beyond standard ML:

Audio Processing:

  • Sample Rate: 8kHz (telephony) vs. 16kHz (standard) vs. 48kHz (high-quality).
  • Window Size: 25ms? 50ms?
  • Hop Length: 10ms? 20ms?
  • Num Mel Bins: 40? 80? 128?

Model Architecture:

  • Encoder Type: LSTM? Transformer? Conformer?
  • Num Layers: 6? 12? 24?
  • Attention Heads: 4? 8? 16?

Training:

  • SpecAugment: Mask how many time/frequency bins?
  • CTC vs. Attention: Which loss weight?

2. The Cost Problem

Challenge: Speech models are expensive to train.

  • Whisper Large: 1 week on 256 GPUs.
  • Conformer-XXL: 3 days on 64 GPUs.

Implication: We can’t afford 100 trials. Need smart search.

3. Multi-Fidelity Tuning for ASR

Idea: Use smaller datasets/models as proxies.

Fidelity Levels:

  1. Low: Train on 1 hour of data, 3 layers, 1 epoch.
  2. Medium: Train on 10 hours, 6 layers, 5 epochs.
  3. High: Train on 100 hours, 12 layers, 50 epochs.

Hyperband Strategy:

  • Start 64 trials at low fidelity.
  • Promote top 16 to medium.
  • Promote top 4 to high.

4. Optuna for Speech

import optuna

def objective(trial):
    # Audio hyperparameters
    n_mels = trial.suggest_int('n_mels', 40, 128, step=8)
    win_length = trial.suggest_int('win_length', 20, 50, step=5)

    # Model hyperparameters
    num_layers = trial.suggest_int('num_layers', 4, 12)
    d_model = trial.suggest_categorical('d_model', [256, 512, 1024])

    # Training hyperparameters
    lr = trial.suggest_float('lr', 1e-5, 1e-3, log=True)

    # Build and train model
    model = build_asr_model(n_mels, win_length, num_layers, d_model)
    wer = train_and_evaluate(model, lr)

    return wer # Minimize WER

    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=50)

5. Neural Architecture Search (NAS)

Goal: Automatically find the best architecture.

Search Space:

  • Encoder: LSTM, GRU, Transformer, Conformer.
  • Decoder: CTC, Attention, Transducer.
  • Connections: Skip connections? Residual?

Search Algorithm:

  • ENAS (Efficient NAS): Share weights across architectures.
  • DARTS (Differentiable): Make architecture choices continuous, use gradient descent.

6. Case Study: ESPnet Tuning

ESPnet (End-to-End Speech Processing toolkit) has built-in tuning.

# Define search space in YAML
espnet_tune.py \
 --config conf/tuning.yaml \
 --n-trials 100 \
 --backend optuna

conf/tuning.yaml:

search_space:
 encoder_layers: [4, 6, 8, 12]
 attention_heads: [4, 8]
 dropout: [0.1, 0.2, 0.3]
 learning_rate: [1e-4, 5e-4, 1e-3]

7. Summary

Aspect Strategy
Search Bayesian Optimization (Optuna)
Fidelity Hyperband (small data first)
Architecture NAS (ENAS, DARTS)
Parallelization Ray Tune (multi-GPU)

8. Deep Dive: Audio Augmentation Hyperparameters

SpecAugment is crucial for speech models. But how much augmentation?

Hyperparameters:

  • Time Masking: How many time steps to mask? (10? 20? 50?)
  • Frequency Masking: How many mel bins? (5? 10? 20?)
  • Num Masks: How many masks per spectrogram? (1? 2? 3?)

Tuning Strategy:

def objective(trial):
    time_mask = trial.suggest_int('time_mask', 10, 100, step=10)
    freq_mask = trial.suggest_int('freq_mask', 5, 30, step=5)
    num_masks = trial.suggest_int('num_masks', 1, 3)

    augmenter = SpecAugment(time_mask, freq_mask, num_masks)
    model = train_with_augmentation(augmenter)

    return model.wer

Insight: More augmentation helps on small datasets, hurts on large ones.

Conformer is SOTA for ASR. But which variant?

Search Space:

  • Num Layers: 12? 18? 24?
  • d_model: 256? 512? 1024?
  • Conv Kernel Size: 15? 31? 63?
  • Attention Heads: 4? 8? 16?

Cost: Training a 24-layer Conformer takes 3 days on 8 GPUs.

Multi-Fidelity Strategy:

  1. Proxy: Train 6-layer model on 10 hours.
  2. Correlation: Check if proxy WER correlates with full model WER.
  3. Transfer: Top configs from proxy → Full training.

10. Deep Dive: Learning Rate Schedules

Speech models are sensitive to LR schedules.

Options:

  1. Warmup + Decay:
    • Warmup: Linear increase for 10k steps.
    • Decay: Cosine or exponential.
  2. Noam Scheduler (Transformer): \text{LR} = d_{\text{model}}^{-0.5} \cdot \min(step^{-0.5}, step \cdot warmup^{-1.5})
  3. ReduceLROnPlateau: Reduce when validation loss plateaus.

Tuning:

def objective(trial):
    warmup_steps = trial.suggest_int('warmup_steps', 5000, 25000, step=5000)
    peak_lr = trial.suggest_float('peak_lr', 1e-4, 1e-3, log=True)

    scheduler = NoamScheduler(warmup_steps, peak_lr)
    model = train_with_scheduler(scheduler)

    return model.wer

11. System Design: Distributed Tuning for TTS

Scenario: Tune a multi-speaker TTS model.

Challenges:

  • Long Training: 1M steps = 1 week on 1 GPU.
  • Many Hyperparameters: 20+ (encoder, decoder, vocoder).

Solution:

  1. Stage 1: Tune encoder/decoder (freeze vocoder).
  2. Stage 2: Tune vocoder (freeze encoder/decoder).
  3. Parallelization: Ray Tune with 64 GPUs.

Code:

from ray import tune

def train_tts(config):
    model = build_tts(
    encoder_layers=config['encoder_layers'],
    decoder_layers=config['decoder_layers'],
    lr=config['lr']
    )

    for step in range(100000):
        loss = train_step(model)
        if step % 1000 == 0:
            tune.report(loss=loss)

            config = {
            'encoder_layers': tune.choice([4, 6, 8]),
            'decoder_layers': tune.choice([4, 6]),
            'lr': tune.loguniform(1e-5, 1e-3)
            }

            tune.run(train_tts, config=config, num_samples=50, resources_per_trial={'gpu': 1})

12. Deep Dive: Transfer Learning from Pre-Tuned Models

Idea: Start from a model that’s already tuned for a similar task.

Example:

  • Source: English ASR (tuned on LibriSpeech).
  • Target: Spanish ASR.
  • Transfer: Use English hyperparameters as starting point.

Fine-Tuning Search Space:

  • Keep architecture fixed.
  • Only tune learning rate and data augmentation.

Speedup: 5x fewer trials needed.

13. Production Tuning Workflow

Step 1: Baseline

  • Train with default hyperparameters.
  • Measure WER/MOS.

Step 2: Coarse Search

  • Use Random Search with 20 trials.
  • Identify promising regions.

Step 3: Fine Search

  • Use Bayesian Optimization with 30 trials.
  • Focus on promising region.

Step 4: Validation

  • Train best config 3 times (different seeds).
  • Report mean ± std.

Step 5: A/B Test

  • Deploy to 5% of users.
  • Monitor real-world metrics.

14. Deep Dive: Batch Size and Gradient Accumulation

Problem: Larger batch sizes improve training stability but require more GPU memory.

Hyperparameters:

  • Batch Size: 8? 16? 32? 64?
  • Gradient Accumulation Steps: 1? 2? 4? 8?

Effective Batch Size = batch_size × gradient_accumulation_steps × num_gpus

Tuning Strategy:

def objective(trial):
    batch_size = trial.suggest_categorical('batch_size', [8, 16, 32])
    grad_accum = trial.suggest_categorical('grad_accum', [1, 2, 4])

    effective_bs = batch_size * grad_accum

    # Adjust learning rate proportionally
    base_lr = 1e-4
    lr = base_lr * (effective_bs / 32)

    model = train(batch_size, grad_accum, lr)
    return model.wer

Insight: Effective batch size of 128-256 works best for most speech models.

15. Deep Dive: Optimizer Selection

Options:

  1. Adam: Default choice. Adaptive learning rates.
  2. AdamW: Adam with weight decay decoupling. Better generalization.
  3. SGD + Momentum: Simpler, sometimes better for very large models.
  4. Adafactor: Memory-efficient (no momentum buffer). Good for TPUs.

Hyperparameters:

  • Beta1, Beta2: Momentum parameters for Adam.
  • Weight Decay: L2 regularization strength.
  • Epsilon: Numerical stability constant.

Tuning:

def objective(trial):
    optimizer_name = trial.suggest_categorical('optimizer', ['adam', 'adamw', 'sgd'])

    if optimizer_name in ['adam', 'adamw']:
        beta1 = trial.suggest_float('beta1', 0.85, 0.95)
        beta2 = trial.suggest_float('beta2', 0.95, 0.999)
        weight_decay = trial.suggest_float('weight_decay', 1e-6, 1e-2, log=True)

        optimizer = AdamW(params, lr=lr, betas=(beta1, beta2), weight_decay=weight_decay)
    else:
        momentum = trial.suggest_float('momentum', 0.8, 0.99)
        optimizer = SGD(params, lr=lr, momentum=momentum)

        return train_with_optimizer(optimizer)

16. Case Study: Google’s Conformer Tuning

Background: Google trained Conformer models for production ASR.

Search Space:

  • 144 hyperparameter combinations.
  • Trained on 60,000 hours of audio.

Key Findings:

  1. Convolution Kernel Size: 31 was optimal (not 15 or 63).
  2. Dropout: 0.1 for large datasets, 0.3 for small.
  3. SpecAugment: Time mask 100, freq mask 27.

Cost: $500,000 in GPU hours.

Result: 5% relative WER improvement over baseline.

17. Case Study: Meta’s Wav2Vec 2.0 Self-Supervised Tuning

Challenge: Pre-training on 60,000 hours of unlabeled audio.

Hyperparameters Tuned:

  • Masking Probability: 0.065 (6.5% of time steps masked).
  • Mask Length: 10 time steps.
  • Contrastive Temperature: 0.1.
  • Quantizer Codebook Size: 320.

Search Method: Grid search with 20 configurations.

Key Insight: Masking probability is the most sensitive hyperparameter. 6.5% is optimal; 5% or 8% degrades performance significantly.

18. Deep Dive: Early Stopping Strategies

Problem: How do we know when to stop a trial?

Strategies:

  1. Validation Loss Plateau: Stop if loss doesn’t improve for N epochs.
  2. Hyperband: Stop bottom 50% of trials at each rung.
  3. Median Stopping: Stop if current performance is below median of all trials at this step.

Optuna Pruner:

import optuna

def objective(trial):
    model = build_model(trial.params)

    for epoch in range(100):
        val_loss = train_epoch(model)

        # Report intermediate value
        trial.report(val_loss, epoch)

        # Prune if not promising
        if trial.should_prune():
            raise optuna.TrialPruned()

            return val_loss

            study = optuna.create_study(
            pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=10)
            )
            study.optimize(objective, n_trials=100)

Speedup: 3-5x faster by killing bad trials early.

19. Deep Dive: Hyperparameter Importance Analysis

Question: Which hyperparameters matter most?

Optuna Importance:

import optuna.importance

# After study completes
importance = optuna.importance.get_param_importances(study)

for param, score in importance.items():
    print(f"{param}: {score:.3f}")

Example Output:

learning_rate: 0.45
num_layers: 0.25
dropout: 0.15
batch_size: 0.10
optimizer: 0.05

Insight: Focus future tuning on top 2-3 hyperparameters.

20. Production Deployment: Model Registry

Problem: Track 100s of tuning experiments.

Solution: MLflow Model Registry.

import mlflow

def objective(trial):
    params = {
    'lr': trial.suggest_float('lr', 1e-5, 1e-3, log=True),
    'num_layers': trial.suggest_int('num_layers', 4, 12)
    }

    with mlflow.start_run():
        # Log hyperparameters
        mlflow.log_params(params)

        # Train model
        model = train(params)
        wer = evaluate(model)

        # Log metrics
        mlflow.log_metric('wer', wer)

        # Log model
        mlflow.pytorch.log_model(model, 'model')

        return wer

Benefits:

  • Reproducibility: Every experiment is tracked.
  • Comparison: Compare trials in UI.
  • Deployment: Promote best model to production.

21. Advanced: Population-Based Training (PBT)

Idea: Evolve hyperparameters during training (like genetic algorithms).

Algorithm:

  1. Start with N models (population) with random hyperparameters.
  2. Train all models for T steps.
  3. Exploit: Replace worst 20% with copies of best 20%.
  4. Explore: Perturb hyperparameters of copied models (e.g., lr *= random.choice([0.8, 1.2])).
  5. Repeat steps 2-4.

Ray Tune PBT:

from ray.tune.schedulers import PopulationBasedTraining

pbt = PopulationBasedTraining(
time_attr='training_iteration',
metric='wer',
mode='min',
perturbation_interval=5,
hyperparam_mutations={
'lr': lambda: np.random.uniform(1e-5, 1e-3),
'dropout': lambda: np.random.uniform(0.1, 0.5)
}
)

tune.run(train_model, scheduler=pbt, num_samples=20)

Advantage: Adapts hyperparameters online. Can find schedules that static tuning misses.

22. Deep Dive: Handling Noisy Objectives

Problem: WER varies due to randomness (data shuffling, weight initialization).

Solution: Run each config multiple times, report mean.

def objective(trial):
    wers = []
    for seed in [42, 123, 456]:
        set_seed(seed)
        model = train(trial.params)
        wers.append(evaluate(model))

        return np.mean(wers)

Trade-off: 3x slower, but more reliable.

Alternative: Use larger validation set to reduce variance.

24. Deep Dive: Bayesian Optimization Internals for Speech

Speech hyperparameter spaces are often high-dimensional and continuous. Random search is inefficient. Bayesian Optimization (BO) builds a probabilistic model of the objective function f(x) (e.g., WER) and uses it to select the most promising hyperparameters to evaluate next.

1. Gaussian Processes (GP): BO typically uses a GP as a surrogate model. A GP defines a distribution over functions.

  • Prior: Before seeing any data, we assume f(x) follows a multivariate normal distribution.
  • Posterior: After observing data points D = \{(x_1, y_1), ..., (x_n, y_n)\}, we update the distribution.
  • Mean Function \mu(x): The expected value of WER at hyperparameter configuration x.
  • Covariance Function k(x, x'): Encodes assumptions about smoothness. If x and x' are similar, f(x) and f(x') should be similar.
  • Common kernel: Matern 5/2 (allows for some roughness, suitable for non-smooth deep learning landscapes).

2. Acquisition Functions: How do we choose the next x_{n+1}? We maximize an acquisition function \alpha(x).

  • Expected Improvement (EI): EI(x) = \mathbb{E}[\max(f(x^*) - f(x), 0)] where f(x^*) is the best WER observed so far. This balances exploring high-uncertainty regions and exploiting low-mean regions.
  • Upper Confidence Bound (UCB): UCB(x) = \mu(x) - \kappa \sigma(x) (Note: minus because we minimize WER). \kappa controls the exploration-exploitation trade-off.

3. Tree-Structured Parzen Estimator (TPE): Standard GPs scale cubically O(n^3) with the number of trials. TPE (used by Optuna) scales linearly.

  • Instead of modeling p(y|x), TPE models p(x|y) and p(y).
  • It defines two densities for hyperparameters x:
  • l(x) if y < y^* (promising configs)
  • g(x) if y \ge y^* (bad configs)
  • It chooses x to maximize the ratio l(x) / g(x).
  • Why it works for Speech: Speech pipelines have conditional hyperparameters (e.g., “If optimizer=SGD, tune momentum. If Adam, ignore momentum”). TPE handles this tree structure naturally.

25. Deep Dive: Ray Tune Architecture

When tuning massive speech models, we need distributed compute. Ray Tune is the industry standard.

Architecture:

  1. Driver: The script where you define the search space and tune.run().
  2. Trial Executor: Manages the lifecycle of trials.
  3. Search Algorithm: (e.g., Optuna, HyperOpt) Suggests new configurations.
  4. Scheduler: (e.g., ASHA, PBT) Decides whether to pause, stop, or resume trials based on intermediate results.
  5. Trainable (Actor): A Ray Actor (process) that runs the training loop.

Resource Management:

  • Ray abstracts resources (CPU, GPU).
  • resources_per_trial={"cpu": 4, "gpu": 1}.
  • If you have 8 GPUs, Ray Tune runs 8 concurrent trials.
  • Fractional GPUs: {"gpu": 0.5} allows running 2 small trials on one GPU (useful for small ASR models or proxy tasks).

Fault Tolerance:

  • Speech training takes days. Nodes fail.
  • Ray Tune automatically checkpoints trials.
  • If a node dies, Ray reschedules the trial on a healthy node and resumes from the last checkpoint.

Code Example: Custom Stopper

from ray.tune import Stopper

class WERPlateauStopper(Stopper):
    def __init__(self, patience=5, metric="wer"):
        self.patience = patience
        self.metric = metric
        self.best_wer = float("inf")
        self.no_improve_count = 0

    def __call__(self, trial_id, result):
        current_wer = result[self.metric]
        if current_wer < self.best_wer:
            self.best_wer = current_wer
            self.no_improve_count = 0
        else:
            self.no_improve_count += 1
            return self.no_improve_count >= self.patience

    def stop_all(self):
        return False

26. System Design: Auto-Tuning Pipeline for Production

Scenario: A company constantly ingests new audio data (call center logs). They need to retrain and retune models weekly.

Pipeline:

  1. Data Ingestion:
    • New audio lands in S3.
    • Airflow job triggers preprocessing (MFCC extraction, text normalization).
  2. Proxy Dataset Creation:
    • Randomly sample 5% of the new data (~100 hours) for hyperparameter tuning.
    • Full dataset (2000 hours) reserved for final training.
  3. Hyperparameter Search (Ray Tune + K8s):
    • Spin up ephemeral K8s cluster with Spot Instances (cheaper).
    • Run 50 trials of ASHA on the Proxy Dataset.
    • Search space: LR, SpecAugment, Dropout.
    • Output: Best configuration JSON.
  4. Full Training:
    • Launch a distributed training job (PyTorch DDP) on the full dataset using the Best Configuration.
    • No tuning here, just training.
  5. Evaluation & Gating:
    • Evaluate on a held-out Golden Set.
    • If WER < Current Production Model, promote to Staging.
  6. Deployment:
    • TorchServe loads the new model.
    • Canary deployment to 1% traffic.

Benefit: This decouples the expensive “Search” phase (done on small data) from the expensive “Train” phase (done once).

27. Deep Dive: Tuning for Edge Deployment

Deploying ASR on mobile devices (Android/iOS) introduces new constraints: Model Size and Latency.

Hyperparameters to Tune:

  1. Quantization Aware Training (QAT):
    • Bit Width: 8-bit vs 4-bit weights.
    • Observer Type: MinMax vs MovingAverage.
  2. Pruning:
    • Sparsity Level: 50%? 75%? 90%?
    • Pruning Schedule: Linear vs Cubic.
  3. Architecture:
    • Depth Multiplier: Scale down channel dimensions (MobileNet style).
    • Streaming Chunk Size: 100ms vs 400ms (Latency vs Accuracy trade-off).

Multi-Objective Optimization: We want to minimize WER and minimize Latency. Loss = WER + \lambda \times Latency

Pareto Frontier: Instead of a single best model, we want a set of models that represent optimal trade-offs.

  • Model A: WER 5%, Latency 200ms.
  • Model B: WER 6%, Latency 50ms.

Optuna Multi-Objective:

def objective(trial):
    # ... build and evaluate model ...
    wer = evaluate_wer(model)
    latency = measure_latency(model)
    return wer, latency

    study = optuna.create_study(directions=["minimize", "minimize"])
    study.optimize(objective, n_trials=100)

    # Plot Pareto Frontier
    optuna.visualization.plot_pareto_front(study)

28. Deep Dive: Hyperparameters for Low-Resource Speech

When you only have 10 hours of Swahili audio, tuning is different.

Key Hyperparameters:

  1. Dropout: Needs to be much higher (0.3 - 0.5) to prevent overfitting.
  2. SpecAugment: Aggressive masking helps significantly.
  3. Freezing Layers:
    • Start with a pre-trained English Wav2Vec 2.0.
    • Hyperparam: How many bottom layers to freeze? (Freeze 0? 6? 12?)
    • Tuning often shows freezing the feature extractor (CNN) is crucial, but fine-tuning top Transformer layers is necessary.
  4. Learning Rate: Needs to be smaller for fine-tuning (1e-5) compared to pre-training (1e-3).

Few-Shot Tuning:

  • Use MAML (Model-Agnostic Meta-Learning) to find initial hyperparameters that adapt quickly to new languages.

29. Case Study: Tuning Whisper for Code-Switching

Problem: “Hinglish” (Hindi + English) ASR. Base Model: OpenAI Whisper Large-v2.

Tuning Strategy:

  • LoRA (Low-Rank Adaptation): Fine-tuning 1.5B parameters is too slow. Tune low-rank matrices instead.
  • Hyperparameters:
  • Rank (r): 8? 16? 64? (Higher = more capacity, slower).
  • Alpha: Scaling factor.
  • Target Modules: Query/Value projections? Or all linear layers?

Results:

  • Tuning r=16 on q_proj and v_proj yielded best results.
  • Tuning all linear layers led to overfitting on the small Hinglish dataset.
  • Learning Rate: 1e-4 was optimal (standard fine-tuning uses 1e-5, but LoRA allows higher LR).

30. Advanced: Differentiable Architecture Search (DARTS) Math

NAS is usually discrete (try architecture A, then B). DARTS relaxes this to a continuous space.

Concept:

  • Construct a super-graph containing all possible operations (Conv3x3, Conv5x5, MaxPool, Identity) between nodes.
  • Assign a weight \alpha_o to each operation o.
  • The output of a node is a weighted sum: \bar{o}(x) = \sum_{o \in O} \frac{\exp(\alpha_o)}{\sum_{o'} \exp(\alpha_{o'})} o(x).
  • We can now differentiate WER with respect to \alpha!

Bilevel Optimization: \min_\alpha \mathcal{L}_{val}(w^*(\alpha), \alpha) \text{s.t. } w^*(\alpha) = \text{argmin}_w \mathcal{L}_{train}(w, \alpha)

  • Inner loop: Train weights w (standard SGD).
  • Outer loop: Update architecture \alpha (gradient descent on validation loss).

Application to Speech:

  • Used to discover optimal Convolution cells for ASR encoders.
  • Result: Found architectures that outperform manually designed ResNets with fewer parameters.

31. Deep Dive: The Interaction of Hyperparameters

Hyperparameters are not independent.

  1. Batch Size & Learning Rate:
    • Linear Scaling Rule: If you double batch size, double learning rate.
    • Square Root Rule: Multiply LR by \sqrt{2}.
    • Tuning implication: Don’t tune them independently. Tune base_lr and scale it dynamically based on batch_size.
  2. Model Depth & Residual Scale:
    • Deeper models (50+ layers) are harder to train.
    • DeepNorm / ReZero: Scale weights by \frac{1}{\sqrt{2N}}.
    • Tuning: If num_layers is a hyperparameter, the initialization scale must also be a function of it.
  3. Regularization & Data Size:
    • If you increase SpecAugment, you might need to decrease Dropout. They both provide regularization. Too much leads to underfitting.

32. Best Practices & Pitfalls

Do’s:

  • Log Everything: Use W&B / MLflow. You will forget what Config #34 was.
  • Set Random Seeds: For reproducibility.
  • Use Log Scale: For LR and Weight Decay. 1e-4 to 1e-2 is a huge range linearly, but reasonable logarithmically.
  • Monitor Gradient Norm: If gradients explode, your LR is too high, regardless of what the tuner says.

Don’ts:

  • Don’t Tune on Test Set: The cardinal sin. You will overfit to the test set. Use a Validation set.
  • Don’t Grid Search: It’s a waste of compute. Random Search is better. Bayesian is best.
  • Don’t Ignore Defaults: Start with SOTA defaults (e.g., from ESPnet recipes). Tune around them.
  • Don’t Tune Everything: Focus on LR, Batch Size, Regularization. Architecture tuning yields diminishing returns compared to data cleaning.

33. Cost-Benefit Analysis of Tuning

Is it worth it?

Scenario:

  • Baseline Model: WER 10.0%. Training cost $100.
  • Tuned Model: WER 9.5%. Tuning cost $2000 (20 trials).

ROI Calculation:

  • If this is a hobby project: No.
  • If this is a call center transcribing 1M hours/year:
  • 0.5% WER reduction = 5% fewer human corrections.
  • Human correction cost = $100/hour.
  • Savings = Huge. Yes.

Green AI:

  • Hyperparameter tuning has a massive carbon footprint.
  • Mitigation: Use Transfer Learning, Multi-Fidelity tuning, and share best configs (Model Cards).

AutoML-Zero: Can we evolve the code of the algorithm?

LLM as Tuner:

  • Feed the training logs (loss curves) to GPT-4.
  • Ask: “The loss is oscillating. What should I change?”
  • GPT-4: “Decrease learning rate by half and increase beta2.”
  • Why it works: LLMs have read millions of ML papers and GitHub issues. They understand the physics of training dynamics better than random search.
  • OMNI (OpenAI): Future systems might just take data + metric and output a deployed API, handling all tuning internally.

35. Deep Dive: Troubleshooting Common Tuning Failures

Even with Optuna, things go wrong.

1. The “Flatline” Loss:

  • Symptom: Loss stays constant from epoch 0.
  • Cause: LR too high (gradients exploded) or too low (stuck in local minima).
  • Fix: Tune LR on a logarithmic scale from 1e-6 to 1e-1.

2. The “Divergence” Spike:

  • Symptom: Loss decreases, then suddenly shoots to NaN.
  • Cause: Batch size too small for the LR, or bad data batch.
  • Fix: Gradient Clipping (clip_grad_norm_). Tune clipping threshold (1.0 vs 5.0).

3. The “Overfitting” Gap:

  • Symptom: Train loss 0.1, Val loss 5.0.
  • Cause: Model too big, not enough regularization.
  • Fix: Increase Dropout, Weight Decay, and SpecAugment.

4. The “OOM” (Out of Memory):

  • Symptom: CUDA OOM error.
  • Cause: Batch size too large.
  • Fix: Prune trials that OOM. Ray Tune handles this by marking the trial as failed.

36. Deep Dive: The Mathematics of Learning Rate Schedules

Why do we need schedules?

SGD Update: w_{t+1} = w_t - \eta \nabla L(w_t)

The Landscape:

  • Early training: Landscape is rough. Large steps help escape local valleys.
  • Late training: We are near the minimum. Large steps oscillate. We need to decay \eta.

Cosine Annealing: \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))

  • Smooth decay. No sharp drops.
  • Hyperparams: T_{max} (cycle length), \eta_{min}.

Cyclic Learning Rates (CLR):

  • Oscillate LR between base and max.
  • Intuition: “Pop” the model out of sharp minima (poor generalization) into flat minima (good generalization).

  • Intuition: “Pop” the model out of sharp minima (poor generalization) into flat minima (good generalization).

37. Deep Dive: The Future - Quantum Hyperparameter Optimization

As classical computers hit Moore’s Law limits, Quantum Computing offers a new frontier.

Quantum Annealing:

  • D-Wave systems can solve optimization problems by finding the ground state of a Hamiltonian.
  • Application: Finding the optimal discrete architecture (NAS) can be mapped to a QUBO (Quadratic Unconstrained Binary Optimization) problem.
  • Speedup: Potentially exponential speedup for discrete search spaces.

Grover’s Search:

  • Can search an unstructured database of N items in O(\sqrt{N}) time.
  • Implication: Random search could become quadratically faster.

38. Ethical Considerations in Hyperparameter Tuning

Tuning is not value-neutral.

1. Bias Amplification:

  • If you tune for global WER, the model might sacrifice accuracy on minority accents to improve the majority.
  • Fix: Tune for Worst-Case WER across demographic groups, not Average WER.

2. Energy Consumption:

  • Training a large Transformer with NAS emits as much CO2 as 5 cars in their lifetime.
  • Responsibility: Report “CO2e” alongside WER in papers. Use Green algorithms.

3. Accessibility:

  • Only Big Tech has the compute to tune 100B parameter models.
  • Democratization: Release pre-tuned checkpoints and “recipes” so smaller labs don’t have to re-tune from scratch.
  • Transparency: Disclose the carbon footprint of your tuning process.

39. Further Reading

To dive deeper into the mathematics and systems of hyperparameter tuning, check out these seminal papers:

  1. “Algorithms for Hyper-Parameter Optimization” (Bergstra et al., 2011): The paper that introduced TPE and showed Random Search beats Grid Search.
  2. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” (Li et al., 2018): The foundation of modern early-stopping strategies.
  3. “Ray Tune: A Framework for Distributed Hyperparameter Tuning” (Liaw et al., 2018): The system design behind scalable tuning.
  4. “Optuna: A Next-generation Hyperparameter Optimization Framework” (Akiba et al., 2019): Introduced the define-by-run API that we use today.
  5. “Neural Architecture Search with Reinforcement Learning” (Zoph & Le, 2017): The paper that started the NAS craze (and burned a lot of GPU hours).
  6. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition” (Park et al., 2019): Essential reading for speech regularization.

40. Summary

Aspect Strategy
Search Bayesian Optimization (Optuna)
Fidelity Hyperband (small data first)
Architecture NAS (ENAS, DARTS)
Parallelization Ray Tune (multi-GPU)
Transfer Use pre-tuned models
Early Stopping Median Pruner
Tracking MLflow Registry
Edge Multi-objective (WER + Latency)
Production Auto-tuning pipelines on K8s
Troubleshooting Log-scale LR, Gradient Clipping
Ethics Tune for Worst-Case WER (Fairness)

FAQ

Why is hyperparameter tuning especially challenging for speech models?

Speech models are expensive to train, with large Conformer models taking days on dozens of GPUs. This means we cannot afford hundreds of random trials. Smart search strategies like Bayesian optimization with Optuna and multi-fidelity methods like Hyperband that use smaller proxy datasets to eliminate bad configurations early are essential for staying within compute budgets.

How does multi-fidelity tuning work for speech?

Multi-fidelity tuning starts many trials at low fidelity using small datasets, few layers, and one epoch. It promotes only the top performers to medium fidelity, and the best of those to high fidelity with full data and training duration. This Hyperband strategy saves 3-5x compute by killing unpromising trials early before they waste GPU hours.

What are the most important hyperparameters to tune for ASR models?

Learning rate is typically the most impactful hyperparameter, followed by model depth, dropout, and SpecAugment settings. Batch size and optimizer choice have diminishing returns. Running importance analysis with Optuna after a study quantifies which parameters matter most for your specific setup, letting you focus future tuning effort on the top 2-3 parameters.

How should you approach tuning for edge deployment of speech models?

Edge deployment requires multi-objective optimization that simultaneously minimizes WER and inference latency. Key hyperparameters include quantization bit width, pruning sparsity level, and streaming chunk size. Optuna multi-objective studies generate a Pareto frontier showing optimal accuracy-latency tradeoffs so you can pick the right model for your device constraints.


Originally published at: arunbaby.com/speech-tech/0038-speech-hyperparameter-tuning

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch