Hyperparameter Optimization
“Finding the perfect knobs to turn.”
TL;DR
Hyperparameter optimization is the systematic search for the best model configuration, and doing it well can mean the difference between a mediocre model and a state-of-the-art one. This article covers everything from grid and random search through Bayesian optimization, Hyperband, and population-based training, with practical code examples using Optuna and Ray Tune. Understanding these techniques is essential for anyone working on model architecture design or experiment tracking systems.

1. The Problem: Too Many Knobs
Training a neural network involves many hyperparameters:
- Learning rate: 0.001? 0.01? 0.0001?
- Batch size: 32? 64? 128?
- Number of layers: 3? 5? 10?
- Dropout rate: 0.1? 0.3? 0.5?
- Optimizer: Adam? SGD? AdamW?
Challenge: The search space is exponential. For 10 hyperparameters with 5 values each, that’s 5^{10} = 9.7 million combinations!
2. Search Strategies
1. Grid Search
- Idea: Try all combinations.
- Pros: Exhaustive, guaranteed to find best in grid.
- Cons: Exponentially expensive.
from sklearn.model_selection import GridSearchCV
param_grid = {
'learning_rate': [0.001, 0.01, 0.1],
'batch_size': [32, 64, 128],
'num_layers': [3, 5, 7]
}
# Total trials: 3 × 3 × 3 = 27
2. Random Search
- Idea: Sample random combinations.
- Pros: More efficient than grid search.
- Insight: Most hyperparameters don’t matter much. Random search explores more of the important ones.
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
'learning_rate': [0.0001, 0.001, 0.01, 0.1],
'batch_size': [16, 32, 64, 128, 256]
}
# Try 20 random combinations
3. Bayesian Optimization
- Idea: Build a probabilistic model of the objective function.
- Acquisition Function: Decides where to sample next (balance exploration vs. exploitation).
- Pros: Sample-efficient. Converges faster than random search.
3. Bayesian Optimization Deep Dive
Algorithm:
- Surrogate Model: Gaussian Process (GP) models
f(\theta) \approx \text{validation accuracy}. - Acquisition Function: Expected Improvement (EI) or Upper Confidence Bound (UCB).
\text{EI}(\theta) = \mathbb{E}[\max(f(\theta) - f(\theta^*), 0)]Where\theta^*is the current best. - Optimize Acquisition: Find
\thetathat maximizes EI. - Evaluate: Train model with
\theta, observe accuracy. - Update GP: Add new observation, repeat.
Libraries:
- Optuna: Most popular in ML.
- Hyperopt: Tree-structured Parzen Estimator (TPE).
- Ray Tune: Distributed tuning at scale.
4. Optuna Example
import optuna
def objective(trial):
# Suggest hyperparameters
lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
batch_size = trial.suggest_categorical('batch_size', [16, 32, 64])
dropout = trial.suggest_float('dropout', 0.1, 0.5)
# Train model
model = build_model(lr, batch_size, dropout)
val_acc = train_and_evaluate(model)
return val_acc
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best params: {study.best_params}")
print(f"Best value: {study.best_value}")
5. Advanced: Multi-Fidelity Optimization
Problem: Evaluating each trial is expensive (train for 100 epochs).
Solution: Successive Halving (Hyperband).
- Start with many trials, train for 1 epoch.
- Keep top 50%, train for 2 epochs.
- Keep top 50%, train for 4 epochs.
- Repeat until 1 trial remains, train for 100 epochs.
Speedup: 10-100x faster than full evaluation.
6. Ray Tune for Distributed Tuning
from ray import tune
def train_model(config):
model = build_model(config['lr'], config['batch_size'])
for epoch in range(10):
loss = train_epoch(model)
tune.report(loss=loss)
config = {
'lr': tune.loguniform(1e-5, 1e-1),
'batch_size': tune.choice([16, 32, 64])
}
analysis = tune.run(
train_model,
config=config,
num_samples=100,
resources_per_trial={'gpu': 1}
)
print(f"Best config: {analysis.best_config}")
7. Summary
| Method | Trials Needed | Pros | Cons |
|---|---|---|---|
| Grid | O(k^n) | Exhaustive | Exponential |
| Random | O(100) | Simple | Inefficient |
| Bayesian | O(50) | Sample-efficient | Complex |
| Hyperband | O(20) | Very fast | Needs early stopping |
8. Deep Dive: Acquisition Functions
Acquisition functions decide where to sample next in Bayesian Optimization.
1. Expected Improvement (EI)
\text{EI}(\theta) = \mathbb{E}[\max(f(\theta) - f(\theta^*), 0)]
- Intuition: How much better can we expect this point to be?
- Pros: Balances exploration (high uncertainty) and exploitation (high mean).
2. Upper Confidence Bound (UCB)
\text{UCB}(\theta) = \mu(\theta) + \kappa \sigma(\theta)
\mu: Predicted mean.\sigma: Predicted std dev (uncertainty).\kappa: Exploration parameter (typically 2-3).- Intuition: Optimistic estimate. “This could be really good!”
3. Probability of Improvement (PI)
\text{PI}(\theta) = P(f(\theta) > f(\theta^*))
- Intuition: What’s the chance this beats the current best?
- Cons: Too greedy, doesn’t care how much better.
9. Deep Dive: Hyperband Algorithm
Problem: Training to convergence is expensive. Can we stop bad trials early?
Hyperband (Successive Halving + Adaptive Resource Allocation):
def hyperband(max_iter=81, eta=3):
# max_iter: max epochs
# eta: downsampling rate
s_max = int(np.log(max_iter) / np.log(eta))
B = (s_max + 1) * max_iter
for s in reversed(range(s_max + 1)):
n = int(np.ceil(B / max_iter / (s + 1) * eta**s))
r = max_iter * eta**(-s)
# Generate n random configurations
configs = [random_config() for _ in range(n)]
for i in range(s + 1):
n_i = int(n * eta**(-i))
r_i = int(r * eta**i)
# Train each config for r_i epochs
results = [train(c, r_i) for c in configs]
# Keep top 1/eta
configs = top_k(configs, results, int(n_i / eta))
return best_config
Example: max_iter=81, eta=3
- Round 1: 81 configs, 1 epoch each.
- Round 2: 27 configs (top 1/3), 3 epochs each.
- Round 3: 9 configs, 9 epochs each.
- Round 4: 3 configs, 27 epochs each.
- Round 5: 1 config, 81 epochs.
10. Deep Dive: Parallel Hyperparameter Tuning
Challenge: Bayesian Optimization is sequential (needs previous results to decide next point).
Solution 1: Batch Bayesian Optimization
- Use acquisition function to select top-
kpoints. - Evaluate them in parallel.
- Update GP with all
kresults.
Solution 2: Asynchronous Successive Halving (ASHA)
- Don’t wait for all trials to finish.
- As soon as a trial completes an epoch, decide: promote or kill.
# Ray Tune with ASHA
from ray.tune.schedulers import ASHAScheduler
scheduler = ASHAScheduler(
max_t=100, # Max epochs
grace_period=10, # Min epochs before stopping
reduction_factor=3
)
tune.run(
train_model,
config=config,
num_samples=100,
scheduler=scheduler,
resources_per_trial={'gpu': 1}
)
11. System Design: Hyperparameter Tuning Platform
Scenario: Build a platform for 100 ML engineers to tune models.
Requirements:
- Scalability: 1000s of concurrent trials.
- Reproducibility: Track all experiments.
- Visualization: Compare trials easily.
Architecture:
- Scheduler: Ray Tune (distributed).
- Tracking: Weights & Biases (W&B) or MLflow.
- Storage: S3 for checkpoints.
- Compute: Kubernetes cluster with autoscaling.
Code:
import wandb
from ray import tune
def train_with_logging(config):
wandb.init(project='hyperparameter-tuning', config=config)
model = build_model(config)
for epoch in range(100):
loss = train_epoch(model)
wandb.log({'loss': loss, 'epoch': epoch})
tune.report(loss=loss)
tune.run(
train_with_logging,
config=search_space,
num_samples=1000
)
12. Deep Dive: Transfer Learning for Hyperparameters
Idea: If we tuned hyperparameters for Task A, can we use them for Task B?
Meta-Learning Approach:
- Collect tuning history from many tasks.
- Train a model:
f(\text{task features}) \rightarrow \text{good hyperparameters}. - For new task, predict good starting point.
Example: Google Vizier uses this internally.
13. Production Considerations
- Cost: Each trial costs GPU hours. Set a budget.
- Reproducibility: Always set random seeds.
- Monitoring: Track resource usage (GPU util, memory).
- Checkpointing: Save model every N epochs (for Hyperband).
- Early Stopping: Don’t waste time on diverging models.
14. Deep Dive: Population-Based Training (PBT)
Origin: DeepMind (2017). Used to train AlphaStar and Waymo agents.
Concept:
- Combines Random Search (exploration) with Greedy Selection (exploitation).
- Instead of fixed hyperparameters, PBT evolves them during training.
Algorithm:
- Initialize: Start a population of
Nmodels with random hyperparameters. - Train: Train all models for
ksteps. - Eval: Evaluate performance.
- Exploit: Replace the bottom 20% of models with copies of the top 20%.
- Explore: Perturb the hyperparameters of the copied models (mutation).
lr = lr * random.choice([0.8, 1.2])
- Repeat: Continue training.
Benefits:
- Dynamic Schedules: Discovers complex schedules (e.g., “start with high LR, then decay, then spike”).
- Efficiency: No wasted compute on bad trials (they get killed).
- Single Run: You get a fully trained model at the end, not just a config.
15. Deep Dive: Neural Architecture Search (NAS)
Hyperparameters aren’t just numbers (LR, Batch Size). They can be the architecture itself.
Search Space:
- Number of layers.
- Operation type (Conv3x3, Conv5x5, MaxPool).
- Skip connections.
Algorithms:
- Reinforcement Learning (RL):
- Controller (RNN) generates an architecture string.
- Train child network, get accuracy (Reward).
- Update Controller using Policy Gradient.
- Cons: Extremely slow (2000 GPU-days for original NAS).
- Evolutionary Algorithms (EA):
- Mutate architectures (add layer, change filter size).
- Select best, repeat.
- Example: AmoebaNet.
- Differentiable NAS (DARTS):
- Relax discrete choices into continuous weights (softmax).
- Train architecture weights
\alphaand model weightswsimultaneously using gradient descent. - Pros: Fast (single GPU-day).
16. Deep Dive: The Math of Gaussian Processes (GP)
Bayesian Optimization relies on GPs. What are they?
Definition: A GP is a distribution over functions, defined by a mean function m(x) and a covariance function (kernel) k(x, x').
f(x) \sim GP(m(x), k(x, x'))
Kernels:
- RBF (Radial Basis Function): Smooth functions.
k(x, x') = \sigma^2 \exp(-\frac{||x - x'||^2}{2l^2}) - Matern: Rougher functions (better for deep learning landscapes).
Posterior Update:
Given observed data D = \{(x_i, y_i)\}, the predictive distribution for a new point x_* is Gaussian:
P(f_* | D, x_*) = \mathcal{N}(\mu_*, \Sigma_*)
\mu_* = K_*^T (K + \sigma_n^2 I)^{-1} y
\Sigma_* = K_{**} - K_*^T (K + \sigma_n^2 I)^{-1} K_*
\mu_*: Predicted value (Exploitation).\Sigma_*: Uncertainty (Exploration).
17. Deep Dive: Tree-Structured Parzen Estimator (TPE)
Optuna uses TPE by default. It’s faster than GPs for high dimensions.
Idea: Instead of modeling P(y|x) (GP), model P(x|y) and P(y).
- Split Data: Divide observations into two groups:
- Top 20% (Good):
l(x) - Bottom 80% (Bad):
g(x)
- Top 20% (Good):
- Density Estimation: Fit Kernel Density Estimators (KDE) to
l(x)andg(x).- “What do good hyperparameters look like?”
- “What do bad hyperparameters look like?”
- Acquisition: Maximize Expected Improvement, which simplifies to maximizing:
\frac{l(x)}{g(x)}
Intuition: Pick x that is highly likely under the “Good” distribution and unlikely under the “Bad” distribution.
18. Case Study: Tuning BERT for Production
Scenario: Fine-tuning BERT-Large for Sentiment Analysis.
Search Space:
- Learning Rate:
1e-5, 2e-5, 3e-5, 5e-5. - Batch Size: 16, 32.
- Epochs: 2, 3, 4.
- Warmup Steps: 0, 100, 500.
Key Findings (RoBERTa paper):
- Batch Size: Larger is better (up to a point).
- Training Duration: Training longer with smaller LR is better than short/high LR.
- Layer-wise LR Decay: Lower layers (closer to input) capture general features, need smaller LR. Higher layers need larger LR.
\text{LR}_{layer} = \text{LR}_{base} \times \xi^{L - layer}where\xi = 0.95.
19. Case Study: AlphaGo Zero Tuning
Problem: Tuning Monte Carlo Tree Search (MCTS) + Neural Network.
Hyperparameters:
c_{puct}: Exploration constant in MCTS.- Dirichlet Noise
\alpha: Noise added to root node for exploration. - Self-play games: How many games before retraining?
Strategy:
- Self-Play Evaluation: New model plays 400 games against old model.
- Gating: Only promote if win rate > 55%.
- Massive Parallelism: Thousands of TPUs generating self-play data.
20. System Design: Scalable Tuning Infrastructure
Components:
- Experiment Manager (Katib / Ray Tune):
- Stores search space config.
- Generates trials.
- Trial Runner (Kubernetes Pods):
- Pulls Docker image.
- Runs training code.
- Reports metrics to Manager.
- Database (MySQL / PostgreSQL):
- Stores trial history (params, metrics).
- Dashboard (Vizier / W&B):
- Visualizes parallel coordinate plots.
Scalability Challenges:
- Database Bottleneck: 1000 concurrent trials reporting metrics every second.
- Fix: Buffer metrics in Redis, flush to DB periodically.
- Pod Startup Latency: K8s takes 30s to start a pod.
- Fix: Use a pool of warm pods (Ray Actors).
21. Deep Dive: Multi-Objective Optimization
Real World: We don’t just want accuracy. We want:
- Maximize Accuracy.
- Minimize Latency.
- Minimize Model Size.
Pareto Frontier:
- A set of solutions where you cannot improve one objective without hurting another.
- Dominated Solution: Worse than another solution in all objectives.
- Non-Dominated Solution: Better in at least one objective.
Scalarization:
- Convert to single objective:
L = w_1 \cdot Acc + w_2 \cdot \frac{1}{Lat}. - Problem: Need to tune weights
w.
NSGA-II (Non-dominated Sorting Genetic Algorithm):
- Used by Optuna for multi-objective search.
- Maintains a population of Pareto-optimal solutions.
22. Code: Implementing a Simple Bayesian Optimizer
Let’s build a toy BO from scratch using scikit-learn.
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from scipy.stats import norm
class SimpleBayesianOptimizer:
def __init__(self, objective_func, bounds):
self.objective = objective_func
self.bounds = bounds
self.X = []
self.y = []
self.gp = GaussianProcessRegressor(kernel=Matern(nu=2.5))
def expected_improvement(self, X_candidates):
mu, sigma = self.gp.predict(X_candidates, return_std=True)
mu_sample_opt = np.max(self.y)
with np.errstate(divide='warn'):
imp = mu - mu_sample_opt
Z = imp / sigma
ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
ei[sigma == 0.0] = 0.0
return ei
def optimize(self, n_iters=10):
# Initial random samples
for _ in range(2):
x = np.random.uniform(self.bounds[0], self.bounds[1], 1).reshape(-1, 1)
y = self.objective(x)
self.X.append(x)
self.y.append(y)
for i in range(n_iters):
# Fit GP
self.gp.fit(np.array(self.X).reshape(-1, 1), np.array(self.y))
# Find point with max EI
X_grid = np.linspace(self.bounds[0], self.bounds[1], 100).reshape(-1, 1)
ei = self.expected_improvement(X_grid)
next_x = X_grid[np.argmax(ei)]
# Evaluate
next_y = self.objective(next_x)
self.X.append(next_x)
self.y.append(next_y)
print(f"Iter {i}: Best y = {np.max(self.y):.4f}")
# Usage
def objective(x): return -1 * (x - 2)**2 + 10 # Max at x=2
opt = SimpleBayesianOptimizer(objective, bounds=(-5, 5))
opt.optimize(n_iters=10)
23. Future Trends: AutoML-Zero
Goal: Evolve the algorithms themselves, not just parameters.
Method:
- Represent ML algorithms as a sequence of basic math operations (add, multiply, sin, cos).
- Use evolutionary algorithms to discover “Gradient Descent” or “Neural Networks” from scratch.
- Result: Rediscovered backpropagation and linear regression.
Implication: Future ML engineers might tune “Search Space Definitions” rather than models.
24. Summary
| ASHA | O(20) | Parallel | Requires Ray |
25. Deep Dive: Hyperparameter Importance Analysis
After running 100 trials, you want to know: Which knob actually mattered?
Methods:
- fANOVA (Functional Analysis of Variance):
- Decomposes the variance of the objective function into additive components.
- “60% of variance comes from Learning Rate, 10% from Batch Size, 5% from interaction between LR and Batch Size.”
- Tool:
optuna.importance.get_param_importances(study).
- SHAP (SHapley Additive exPlanations):
- Treats hyperparameter values as “features” and the objective value as the “prediction”.
- Calculates the marginal contribution of each hyperparameter.
- Parallel Coordinate Plots:
- Visualizes the high-dimensional relationships.
- Useful for spotting “bad regions” (e.g., “High LR + Low Batch Size always crashes”).
Actionable Insight:
- If
num_layershas 1% importance, stop tuning it! Fix it to a reasonable default and save compute.
26. Deep Dive: Handling Categorical & Conditional Hyperparameters
Real-world search spaces are messy.
Categorical:
optimizer: [“Adam”, “SGD”, “RMSprop”]- Problem: GPs assume continuous distance. Distance(“Adam”, “SGD”) is undefined.
- Solution: One-hot encoding or using Tree-based models (Random Forests, TPE) which handle splits naturally.
Conditional (Nested):
- IF
optimizer == "SGD"THEN tunemomentum. - IF
optimizer == "Adam"THEN tunebeta1,beta2. - Problem:
momentumis irrelevant ifoptimizeris Adam. - Solution:
- TPE: Handles this naturally by splitting the tree.
- ConfigSpace: A library specifically for defining DAG-structured search spaces.
27. Deep Dive: Warm-Starting Optimization
Problem: Every time we tune a new model, we start from scratch (random sampling). Reality: We have tuned 50 similar models before.
Strategies:
- Initial Points:
- Instead of random initialization, seed the optimizer with the best configs from previous studies.
study.enqueue_trial({'lr': 1e-3, 'batch_size': 32}).
- Transfer Learning for GPs:
- Use data from previous tasks to learn a “prior” for the GP mean function.
- Multi-Task Bayesian Optimization: Model the correlation between Task A and Task B. If they are correlated, observations in A reduce uncertainty in B.
- Meta-Learning (Auto-Sklearn):
- Compute meta-features of the dataset (num_rows, num_cols, class_balance).
- Find nearest neighbors in the “dataset space”.
- Reuse their best hyperparameters.
28. Case Study: Tuning XGBoost vs Neural Networks
XGBoost / LightGBM:
- Key Params:
max_depth,learning_rate,subsample,colsample_bytree,min_child_weight. - Landscape: Rugged but convex-ish locally.
- Strategy: Random Search is often “good enough”. TPE works very well.
- Cost: Fast to train (seconds/minutes). Can run 1000s of trials.
Neural Networks (ResNet/Transformer):
- Key Params:
lr,batch_size,optimizer,scheduler. - Landscape: Non-convex, saddle points, noise.
- Strategy: Must use Learning Rate Schedules. Tuning the schedule is more important than tuning the fixed LR.
- Cost: Slow (hours/days). Must use Early Stopping (Hyperband).
29. Ethical Considerations: The Carbon Footprint of Tuning
The Cost:
- Training a Transformer with NAS can emit 600,000 lbs of CO2 (equivalent to 5 cars’ lifetime).
- “Red AI” (buying performance with massive compute) vs “Green AI” (efficiency).
Mitigation Strategies:
- Green NAS: Penalize energy consumption in the objective function.
L = \text{Error} + \lambda \cdot \text{Energy} - Proxy Tasks: Tune on a subset of data (10%), then transfer to full data.
- Share Configs: Publish the best hyperparameters so others don’t have to re-tune. (Hugging Face Model Cards).
30. Further Reading
- “Algorithms for Hyper-Parameter Optimization” (Bergstra et al., 2011): Introduced TPE.
- “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” (Li et al., 2018): The standard for resource allocation.
- “Google Vizier: A Service for Black-Box Optimization” (Golovin et al., 2017): How Google does it at scale.
- “Neural Architecture Search with Reinforcement Learning” (Zoph & Le, 2017): The paper that started the NAS craze.
- “On the Importance of On-Manifold Regularization” (Mixup): Data augmentation as a hyperparameter.
32. Deep Dive: The Future of Tuning - LLMs as Optimizers
OptiMus (2023):
- Uses an LLM (GPT-4) to suggest hyperparameters.
- Prompt: “I am training a ResNet-50. The loss is oscillating. Current LR is 0.1. What should I try next?”
- Response: “Try reducing LR to 0.01 and adding a scheduler.”
- Why it works: LLMs have read millions of papers and GitHub issues. They have “common sense” about training dynamics that Bayesian Optimization lacks.
OMNI (OpenAI):
- Future systems will likely abstract tuning away entirely. You provide data + metric, the system handles the rest.
33. Deep Dive: Tuning for Robustness and Fairness
Robustness (Adversarial Training):
- Hyperparams: Epsilon (perturbation size), Alpha (step size).
- Trade-off: Increasing robustness often decreases clean accuracy.
- Tuning Goal: Find the Pareto frontier between Accuracy and Robustness.
Fairness:
- Hyperparams: Regularization strength for fairness constraints (e.g., Equalized Odds).
- Objective: Minimize Error +
\lambda \cdot \text{Disparity}. - Tuning: We need to find the
\lambdathat satisfies legal/ethical requirements while maximizing utility.
34. Code: Grid Search from Scratch
To understand why Grid Search is bad, let’s implement it.
import itertools
def grid_search(objective, param_grid):
keys = param_grid.keys()
values = param_grid.values()
combinations = list(itertools.product(*values))
best_score = -float('inf')
best_params = None
print(f"Total combinations: {len(combinations)}")
for combo in combinations:
params = dict(zip(keys, combo))
score = objective(params)
if score > best_score:
best_score = score
best_params = params
return best_params, best_score
# Usage
grid = {
'lr': [0.1, 0.01, 0.001],
'batch_size': [32, 64, 128],
'dropout': [0.1, 0.5]
}
# 3 * 3 * 2 = 18 trials.
# If we add one more parameter with 5 options -> 90 trials.
# Exponential explosion!
35. Production Checklist for Hyperparameter Tuning
Before you launch a tuning job:
- Define Metric: Is it Accuracy? F1? AUC? Latency?
- Define Budget: How many GPU hours can I afford?
- Choose Algorithm:
- < 10 params: Bayesian Optimization (Optuna).
-
10 params: Random Search or Hyperband.
- Neural Net: Hyperband / ASHA.
- Set Search Space:
- Use Log Scale for LR and Regularization.
- Don’t tune things that don’t matter (e.g., random seed).
- Enable Early Stopping: Don’t waste compute.
- Log Everything: Use W&B / MLflow.
- Verify on Test Set: Evaluate the single best model on the held-out test set.
36. Deep Dive: Bayesian Optimization Hyperband (BOHB)
Problem:
- Bayesian Optimization is great at finding good configs but slow (doesn’t kill bad trials).
- Hyperband is fast (kills bad trials) but random (doesn’t learn from history).
Solution: BOHB (2018)
- Combines the best of both.
- Uses Hyperband to determine how many resources (epochs) to allocate.
- Uses Bayesian Optimization (TPE) to select the configurations to run at each step.
- Result: Converges faster than either method alone. SOTA for many problems.
37. Deep Dive: The “No Free Lunch” Theorem in Tuning
Theorem: Averaged over all possible problems, every optimization algorithm performs equally well (same as random search).
Implication:
- There is no “Best Optimizer” for every problem.
- TPE might be best for XGBoost.
- CMA-ES might be best for Reinforcement Learning.
- Adam might be best for CNNs.
- Lesson: Try multiple optimizers if you are stuck.
38. Deep Dive: Tuning Generative Models (GANs / Diffusion)
Tuning GANs is notoriously hard.
Challenges:
- Mode Collapse: Generator produces only one image.
- Non-Convergence: Discriminator becomes too strong too fast.
Key Hyperparameters:
- Learning Rate Ratio: Often we set TTUR (Two-Time-Scale Update Rule).
LR_{disc} = 4 \times LR_{gen}.- Beta1: Momentum. Often set to 0.0 or 0.5 (instead of default 0.9).
- Gradient Penalty: Weight
\lambdafor WGAN-GP.
Diffusion Models:
- Noise Schedule: Linear vs Cosine.
- Timesteps: 1000? 4000?
- EMA Decay: Exponential Moving Average of weights (crucial for quality).
39. Case Study: Tuning Stable Diffusion
Goal: Fine-tune Stable Diffusion on a specific style (e.g., “Disney Style”).
Method: Dreambooth / LoRA.
Hyperparameters:
- Learning Rate: Extremely sensitive.
1e-6works,1e-5destroys the model. - Text Encoder Training: Train it or freeze it? (Training = better likeness, Freezing = better editing).
- Prior Preservation Loss: Weight of the class images (to prevent forgetting what a “dog” looks like).
40. The Psychology of Tuning
Why do humans struggle with tuning?
- Confirmation Bias: We try 3 things, one works, and we assume it’s the “Golden Config”. We stop searching.
- Sunk Cost Fallacy: “I spent 3 days tuning this ResNet. I can’t switch to EfficientNet now.”
- Dimensionality Curse: Humans can visualize 2D/3D. We cannot intuit 10D spaces. We miss interactions (e.g., “LR is only bad if Batch Size is small”).
Lesson: Trust the algorithm. Don’t “babysit” the tuner.
41. Checklist for Debugging Tuning Failures
If your tuner isn’t finding good results:
- Is the search space too big? Prune irrelevant parameters.
- Are the ranges correct? Is LR
[1e-5, 1e-1]or[1, 10]? (Common bug). - Is the metric noisy? If running the same config twice gives
\pm 5\%accuracy, the tuner is confused. Fix the seed or average over runs. - Is the budget too small? 10 trials is not enough for 10 parameters.
- Is the model broken? Does it train with default parameters? If not, fix the code first.
42. Summary
| Method | Trials Needed | Pros | Cons |
|---|---|---|---|
| Grid | O(k^n) | Exhaustive | Exponential |
| Random | O(100) | Simple | Inefficient |
| Bayesian | O(50) | Sample-efficient | Complex |
| Hyperband | O(20) | Very fast | Needs early stopping |
| ASHA | O(20) | Parallel | Requires Ray |
| PBT | O(20) | Dynamic schedules | Complex setup |
| NAS | O(1000) | Finds architecture | Very expensive |
| BOHB | O(30) | Best of both worlds | Complex |
| LLM | O(10) | Uses “common sense” | New, experimental |
43. Further Reading
- “Algorithms for Hyper-Parameter Optimization” (Bergstra et al., 2011): The paper that introduced TPE.
- “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” (Li et al., 2018): The standard for resource allocation.
- “Google Vizier: A Service for Black-Box Optimization” (Golovin et al., 2017): How Google does it at scale.
- “Neural Architecture Search with Reinforcement Learning” (Zoph & Le, 2017): The paper that started the NAS craze.
- “Optuna: A Next-generation Hyperparameter Optimization Framework” (Akiba et al., 2019): The define-by-run philosophy.
44. Conclusion
Hyperparameter optimization is no longer a “nice to have”, it is a critical component of the modern ML stack. As models grow larger and compute becomes more expensive, the ability to efficiently navigate the search space becomes a competitive advantage. Whether you are using simple Random Search for a baseline or deploying massive Population-Based Training on a Kubernetes cluster, the principles remain the same: Explore the unknown, Exploit the promising, and Automate everything.
FAQ
When should you use Bayesian optimization vs random search for hyperparameter tuning?
Random search works well when you have fewer than 10 hyperparameters and trials are cheap to run. It explores the important dimensions efficiently because most hyperparameters do not matter much. Bayesian optimization converges in roughly half the trials by building a probabilistic surrogate model of the objective function, but adds complexity. For neural networks where each trial takes hours, combining Bayesian optimization with early stopping via Hyperband (BOHB) is the standard approach.
How does Hyperband reduce the cost of hyperparameter optimization?
Hyperband starts many trials with minimal training epochs, progressively eliminates the worst performers via successive halving, and allocates more epochs only to promising configurations. This achieves 10-100x speedup over full evaluation because most bad configurations can be identified within the first few epochs. ASHA extends this to asynchronous parallel execution, making it practical for large clusters used in resource partitioning.
What is population-based training and when should you use it?
PBT trains a population of models simultaneously, periodically replacing poor performers with copies of top performers and mutating their hyperparameters. Unlike traditional tuning that finds a fixed configuration, PBT discovers dynamic schedules like learning rate warmup followed by decay. It was used by DeepMind for AlphaStar and works best for reinforcement learning and training scenarios with complex dynamics where static hyperparameters are suboptimal.
How do you determine which hyperparameters actually matter?
Use fANOVA (functional analysis of variance) to decompose objective variance into contributions from each hyperparameter. Optuna provides this via get_param_importances. If a hyperparameter contributes less than 5% of variance, fix it to a sensible default and remove it from the search space. Parallel coordinate plots help visualize interaction effects between hyperparameters that single-factor analysis misses.
Originally published at: arunbaby.com/ml-system-design/0038-hyperparameter-optimization
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch