Speech Experiment Management
Design experiment management systems tailored for speech research—tracking audio data, models, metrics, and multi-dimensional experiments at scale.
Problem Statement
Design a Speech Experiment Management System that:
- Tracks speech-specific metadata: Audio datasets, speaker distributions, language configs, acoustic features
- Manages complex experiment spaces: Model architecture × training data × augmentation × decoding hyperparameters
- Enables systematic evaluation: WER/CER on multiple test sets, multi-lingual benchmarks, speaker-level analysis
- Supports reproducibility: Re-run experiments with exact data/model/environment state
- Integrates with speech toolkits: ESPnet, Kaldi, SpeechBrain, Fairseq
- Handles large-scale artifacts: Audio files, spectrograms, language models, acoustic models
Functional Requirements
- Experiment tracking:
- Log hyperparameters (learning rate, batch size, model architecture)
- Track data configs (train/val/test splits, languages, speaker sets)
- Log metrics (WER, CER, latency, RTF, MOS for TTS)
- Store artifacts (model checkpoints, attention plots, decoded outputs)
- Track augmentation policies (SpecAugment, noise, speed perturbation)
- Multi-dimensional organization:
- Organize by task (ASR, TTS, diarization, KWS)
- Group by language (en, zh, es, multi-lingual)
- Tag by domain (broadcast, conversational, read speech)
- Link related experiments (ablations, ensembles, fine-tuning)
- Evaluation and comparison:
- Compute WER/CER on multiple test sets
- Speaker-level and utterance-level breakdowns
- Compare across languages, domains, noise conditions
- Visualize attention maps, spectrograms, learning curves
- Data versioning:
- Track dataset versions (hashes, splits, preprocessing)
- Track audio feature configs (sample rate, n_mels, hop_length)
- Link experiments to specific data versions
- Model versioning:
- Track model checkpoints (epoch, step, metric value)
- Store encoder/decoder weights separately (for transfer learning)
- Link to pre-trained models (HuggingFace, ESPnet zoo)
- Collaboration and sharing:
- Share experiments with team
- Export to papers (LaTeX tables, plots)
- Integration with notebooks (Jupyter, Colab)
Non-Functional Requirements
- Scale: 1000s of experiments, 100K+ hours of audio, 10+ languages
- Performance: Fast queries (<1s), efficient artifact storage (deduplication)
- Reliability: No data loss, support for resuming failed experiments
- Integration: Minimal code changes to existing training scripts
- Cost efficiency: Optimize storage for large audio datasets and models
Understanding the Requirements
Why Speech Experiment Management Is Different
General ML experiment tracking (like MLflow) works, but speech has unique challenges:
- Audio data is large:
- Raw audio: ~1 MB/min at 16 kHz
- Spectrograms: even larger for high-resolution features
- Solution: Track data by reference (paths/hashes), not by copying
- Multi-lingual and multi-domain:
- Same architecture, different languages/domains
- Need systematic organization and comparison across dimensions
- Complex evaluation:
- WER/CER on multiple test sets (clean, noisy, far-field, accented)
- Speaker-level and utterance-level analysis
- Not just a single “accuracy” metric
- Decoding hyperparameters matter:
- Beam width, language model weight, length penalty
- Often swept post-training → need to track separately
- Long training times:
- ASR models can train for days/weeks
- Need robust checkpointing and resumability
The Systematic Iteration Connection
Just like Spiral Matrix systematically explores a 2D grid:
- Speech experiment management explores multi-dimensional spaces:
- Model (architecture, size) × Data (language, domain, augmentation) × Hyperparameters (LR, batch size) × Decoding (beam, LM weight)
- Both require state tracking:
- Spiral: track boundaries and current position
- Experiments: track which configs have been tried, which are running, which failed
- Both enable resumability:
- Spiral: pause and resume traversal from any boundary state
- Experiments: resume training from checkpoints, resume hyperparameter sweeps
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Speech Experiment Management System │
└─────────────────────────────────────────────────────────────────┘
Client Layer
┌────────────────────────────────────────────┐
│ Python SDK │ CLI │ Web UI │ API │
│ (ESPnet / │ │ │ │
│ SpeechBrain│ │ │ │
│ integration)│ │ │ │
└─────────────────────┬──────────────────────┘
│
API Gateway
┌──────────────┴──────────────┐
│ - Auth & access control │
│ - Request routing │
│ - Metrics & logging │
└──────────────┬──────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
│ Metadata │ │ Metrics │ │ Artifact │
│ Service │ │ Service │ │ Service │
│ │ │ │ │ │
│ - Experiments │ │ - WER │ │ - Models │
│ - Runs │ │ - Loss │ │ - Checkpoints │
│ - Data configs │ │ - Curves │ │ - Spectrograms │
│ - Model configs│ │ - Tables │ │ - Decoded logs │
│ - Speaker info │ │ - Speaker│ │ - Audio files │
│ │ │ metrics│ │ (references) │
└───────┬────────┘ └────┬─────┘ └───────┬────────┘
│ │ │
┌───────▼────────┐ ┌────▼─────┐ ┌───────▼────────┐
│ SQL DB │ │ Time- │ │ Object Store │
│ (Postgres) │ │ Series │ │ + Data Lake │
│ │ │ DB │ │ │
│ - Experiments │ │ - Metrics│ │ - Models │
│ - Runs │ │ - Fast │ │ - Audio refs │
│ - Configs │ │ queries│ │ - Spectrograms │
└────────────────┘ └──────────┘ └────────────────┘
Key Components
- Metadata Service:
- Stores experiment metadata (model, data, hyperparameters)
- Speech-specific fields: language, domain, speaker count, sample rate
- Relational DB for structured queries
- Metrics Service:
- Stores training metrics (loss, learning rate schedule)
- Stores evaluation metrics (WER, CER, per-test-set, per-speaker)
- Time-series DB for efficient queries
- Artifact Service:
- Stores models, checkpoints, attention plots, spectrograms
- References to audio files (not copies—audio stays in data lake)
- Deduplication for repeated artifacts (e.g., pre-trained encoders)
- Data Lake / Audio Storage:
- Centralized storage for audio datasets
- Organized by language, domain, speaker
- Accessed via paths/URIs, not copied into experiment artifacts
- Web UI:
- Dashboard for experiments
- WER comparison tables across test sets
- Attention plot visualization
- Decoded output inspection
Component Deep-Dive
1. Speech-Specific Metadata Schema
Extend general experiment tracking schema with speech-specific fields:
CREATE TABLE speech_experiments (
experiment_id UUID PRIMARY KEY,
name VARCHAR(255),
task VARCHAR(50), -- 'asr', 'tts', 'diarization', 'kws'
description TEXT,
created_at TIMESTAMP,
user_id VARCHAR(255)
);
CREATE TABLE speech_runs (
run_id UUID PRIMARY KEY,
experiment_id UUID REFERENCES speech_experiments(experiment_id),
name VARCHAR(255),
status VARCHAR(50), -- 'running', 'completed', 'failed'
start_time TIMESTAMP,
end_time TIMESTAMP,
user_id VARCHAR(255),
-- Model config
model_type VARCHAR(100), -- 'conformer', 'transformer', 'rnn-t'
num_params BIGINT,
encoder_layers INT,
decoder_layers INT,
-- Data config
train_dataset VARCHAR(255),
train_hours FLOAT,
languages TEXT[], -- array of language codes
domains TEXT[], -- ['broadcast', 'conversational']
sample_rate INT, -- 16000, 8000, etc.
-- Feature config
feature_type VARCHAR(50), -- 'log-mel', 'mfcc', 'fbank'
n_mels INT,
hop_length INT,
win_length INT,
-- Augmentation config
augmentation_policy JSONB, -- SpecAugment, noise, speed
-- Training config
optimizer VARCHAR(50),
learning_rate FLOAT,
batch_size INT,
epochs INT,
-- Environment
git_commit VARCHAR(40),
docker_image VARCHAR(255),
num_gpus INT
);
CREATE TABLE speech_metrics (
run_id UUID REFERENCES speech_runs(run_id),
test_set VARCHAR(255), -- 'librispeech-test-clean', 'common-voice-en'
metric VARCHAR(50), -- 'wer', 'cer', 'rtf', 'latency'
value FLOAT,
speaker_id VARCHAR(255), -- optional, for per-speaker metrics
utterance_id VARCHAR(255), -- optional, for per-utterance metrics
timestamp TIMESTAMP,
PRIMARY KEY (run_id, test_set, metric, speaker_id, utterance_id)
);
CREATE TABLE speech_artifacts (
artifact_id UUID PRIMARY KEY,
run_id UUID REFERENCES speech_runs(run_id),
type VARCHAR(50), -- 'model', 'checkpoint', 'attention_plot', 'decoded_text'
path VARCHAR(1024),
size_bytes BIGINT,
content_hash VARCHAR(64),
storage_uri TEXT,
epoch INT, -- optional, for checkpoints
step INT, -- optional, for checkpoints
created_at TIMESTAMP
);
2. Python SDK Integration (ESPnet Example)
import speech_experiment_tracker as set
# Initialize client
client = set.Client(api_url="https://speech-tracking.example.com", api_key="...")
# Create experiment
experiment = client.create_experiment(
name="Conformer ASR - LibriSpeech + Common Voice",
task="asr",
description="Multi-dataset training with SpecAugment"
)
# Start a run
run = experiment.start_run(
name="conformer_12layers_specaug",
tags={"language": "en", "domain": "read_speech"}
)
# Log model and data configs
run.log_config({
"model_type": "conformer",
"encoder_layers": 12,
"decoder_layers": 6,
"num_params": 120_000_000,
"train_dataset": "librispeech-960h + common-voice-en",
"train_hours": 1200,
"languages": ["en"],
"sample_rate": 16000,
"feature_type": "log-mel",
"n_mels": 80,
"hop_length": 160,
"augmentation_policy": {
"spec_augment": True,
"time_mask": 30,
"freq_mask": 13,
"speed_perturb": [0.9, 1.0, 1.1]
},
"optimizer": "adam",
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 100
})
# Training loop
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader)
val_wer = evaluate(model, val_loader)
# Log training metrics
run.log_metrics({
"train_loss": train_loss,
"val_wer": val_wer
}, step=epoch)
# Save checkpoint
if epoch % 10 == 0:
checkpoint_path = f"checkpoints/epoch{epoch}.pt"
save_checkpoint(model, optimizer, checkpoint_path)
run.log_artifact(checkpoint_path, type="checkpoint", epoch=epoch)
# Final evaluation on multiple test sets
test_sets = [
"librispeech-test-clean",
"librispeech-test-other",
"common-voice-en-test"
]
for test_set in test_sets:
wer, cer = evaluate_on_test_set(model, test_set)
run.log_metrics({
f"{test_set}_wer": wer,
f"{test_set}_cer": cer
})
# Save final model
run.log_artifact("final_model.pt", type="model")
# Mark run as complete
run.finish()
3. Multi-Test-Set Evaluation Tracking
A key speech-specific need: evaluate on multiple test sets and track per-test-set metrics.
def evaluate_and_log(model, test_sets, run):
"""Evaluate model on multiple test sets and log detailed metrics."""
results = {}
for test_set in test_sets:
loader = get_test_loader(test_set)
total_words = 0
total_errors = 0
utterance_results = []
for batch in loader:
hyps = model.decode(batch['audio'])
refs = batch['text']
for hyp, ref, utt_id in zip(hyps, refs, batch['utterance_ids']):
errors, words = compute_wer_details(hyp, ref)
total_errors += errors
total_words += words
utterance_results.append({
"utterance_id": utt_id,
"wer": errors / max(1, words),
"hyp": hyp,
"ref": ref
})
wer = total_errors / max(1, total_words)
results[test_set] = {
"wer": wer,
"utterances": utterance_results
}
# Log aggregate metric
run.log_metric(f"{test_set}_wer", wer)
# Log per-utterance results as artifact
run.log_json(f"results/{test_set}_utterances.json", utterance_results)
return results
4. Data Versioning for Speech
Track dataset versions by hashing audio file lists + preprocessing configs:
import hashlib
import json
def compute_dataset_hash(audio_file_list, preprocessing_config):
"""
Compute a deterministic hash for a dataset.
Args:
audio_file_list: List of audio file paths
preprocessing_config: Dict with sample_rate, n_mels, etc.
"""
# Sort file list for determinism
sorted_files = sorted(audio_file_list)
# Combine file list + config
content = {
"files": sorted_files,
"preprocessing": preprocessing_config
}
# Compute hash
content_str = json.dumps(content, sort_keys=True)
dataset_hash = hashlib.sha256(content_str.encode()).hexdigest()
return dataset_hash
# Usage
train_files = glob.glob("/data/librispeech/train-960h/**/*.flac", recursive=True)
preprocessing_config = {
"sample_rate": 16000,
"n_mels": 80,
"hop_length": 160,
"win_length": 400
}
dataset_hash = compute_dataset_hash(train_files, preprocessing_config)
run.log_config({
"train_dataset_hash": dataset_hash,
"train_dataset_files": len(train_files),
"preprocessing": preprocessing_config
})
5. Decoding Hyperparameter Sweeps
ASR decoding often involves sweeping beam width and language model weight:
def decode_sweep(model, test_set, run):
"""
Sweep decoding hyperparameters and log results.
"""
beam_widths = [1, 5, 10, 20]
lm_weights = [0.0, 0.3, 0.5, 0.7, 1.0]
results = []
for beam in beam_widths:
for lm_weight in lm_weights:
wer = evaluate_with_decoding_params(
model, test_set, beam_width=beam, lm_weight=lm_weight
)
results.append({
"beam_width": beam,
"lm_weight": lm_weight,
"wer": wer
})
# Log each config
run.log_metrics({
f"wer_beam{beam}_lm{lm_weight}": wer
})
# Find best config
best = min(results, key=lambda x: x['wer'])
run.log_config({"best_decoding_config": best})
# Log full sweep results as artifact
run.log_json("decoding_sweep.json", results)
Scaling Strategies
1. Efficient Audio Data Handling
Challenge: Copying audio files into each experiment is expensive and redundant.
Solution:
- Store audio in a centralized data lake (e.g., S3, HDFS).
- Track by reference: Experiments store paths/URIs, not copies.
- Use hashing for deduplication: If multiple experiments use the same dataset, hash it once.
# Don't do this:
run.log_artifact("train_audio.tar.gz") # 100 GB upload per experiment!
# Do this:
run.log_config({
"train_audio_path": "s3://speech-data/librispeech/train-960h/",
"train_audio_hash": "sha256:abc123..."
})
2. Checkpoint Deduplication
Challenge: Models can be GBs. Saving every checkpoint is expensive.
Solution:
- Content-based deduplication: Hash checkpoint files.
- Incremental checkpoints: Store only parameter diffs if possible.
- Tiered storage: Recent checkpoints on fast storage, old checkpoints on Glacier.
3. Distributed Evaluation
For large-scale evaluation (100+ test sets, 10+ languages):
- Use a distributed evaluation service (Ray, Spark).
- Parallelize across test sets and languages.
- Aggregate results and log back to experiment tracker.
import ray
@ray.remote
def evaluate_test_set(model_path, test_set):
model = load_model(model_path)
wer = evaluate(model, test_set)
return test_set, wer
# Distribute evaluation
test_sets = ["test_clean", "test_other", "cv_en", "cv_es", ...]
futures = [evaluate_test_set.remote(model_path, ts) for ts in test_sets]
results = ray.get(futures)
# Log all results
for test_set, wer in results:
run.log_metric(f"{test_set}_wer", wer)
Monitoring & Observability
Key Metrics
System metrics:
- Request latency (API, artifact upload/download)
- Storage usage (models, audio references, metadata)
- Error rates (failed experiments, upload failures)
User metrics:
- Active experiments and runs
- Average experiment duration
- Most common model architectures, datasets, languages
- WER distribution across runs
Dashboards:
- Experiment dashboard (running/completed/failed, recent results)
- System health (latency, storage, errors)
- Cost dashboard (storage, compute, data transfer)
Alerts
- Experiment failed after >12 hours of training
- Storage usage >90% capacity
- API error rate >1%
- WER degradation on key test sets
Failure Modes & Mitigations
| Failure Mode | Impact | Mitigation |
|---|---|---|
| Training crash mid-run | Lost progress, wasted compute | Robust checkpointing, auto-resume |
| Artifact upload failure | Model/checkpoint not saved | Retry with exponential backoff, local backup |
| Dataset hash collision | Wrong dataset used | Use strong hash (SHA-256), validate file count |
| Metric not logged | Incomplete evaluation | Client-side buffering, fail-safe logging |
| Evaluation script bug | Wrong WER reported | Unit tests for evaluation, log decoded outputs |
Real-World Case Study: Multi-Lingual ASR Team
Scenario:
- Team of 20 researchers
- 10+ languages (en, zh, es, fr, de, ja, ko, ar, hi, ru)
- 100K+ hours of audio data
- 1000+ experiments over 2 years
Architecture:
- Metadata: PostgreSQL with indexes on language, domain, test_set
- Metrics: InfluxDB for fast time-series queries
- Artifacts: S3 with lifecycle policies (checkpoints >6 months → Glacier)
- Audio data: Centralized S3 bucket, organized by language/domain
- API: Kubernetes cluster with auto-scaling
Key optimizations:
- Audio stored by reference (saved ~10 TB of redundant uploads)
- Checkpoint deduplication (saved ~30% storage)
- Distributed evaluation (Ray cluster, 10x speedup on multi-test-set evaluation)
Outcomes:
- 99.9% uptime
- Median query latency: 150ms
- Complete audit trail for model releases
- Reproducibility: 95% of experiments re-runnable from metadata
Cost Analysis
Example: Medium-Sized Speech Team
Assumptions:
- 10 researchers
- 50 experiments/month, 500 runs/month
- Average run: 5 GB model, 10K metrics, 10 test sets
- Audio data: 50K hours, stored centrally (not per-experiment)
- Retention: 2 years
| Component | Cost/Month |
|---|---|
| Metadata DB (PostgreSQL RDS) | $300 |
| Metrics DB (InfluxDB) | $400 |
| Model storage (S3, 5 TB) | $115 |
| Audio data storage (S3, 50 TB, reference-only) | $1,150 |
| API compute (Kubernetes) | $600 |
| Data transfer | $150 |
| Total | $2,715 |
Cost savings from best practices:
- Audio by reference (vs copying per-experiment): -$50K/year
- Checkpoint deduplication: -$500/month
- Tiered storage (Glacier for old artifacts): -$200/month
Advanced Topics
1. Integration with ESPnet
ESPnet is a popular speech toolkit. Integrate experiment tracking:
# In ESPnet training script
import speech_experiment_tracker as set
# Initialize tracker
client = set.Client(...)
run = client.start_run(...)
# Log ESPnet config
run.log_config(vars(args)) # args from argparse
# Hook into ESPnet trainer
class TrackerCallback:
def on_epoch_end(self, epoch, metrics):
run.log_metrics(metrics, step=epoch)
def on_checkpoint_save(self, epoch, checkpoint_path):
run.log_artifact(checkpoint_path, type="checkpoint", epoch=epoch)
trainer.add_callback(TrackerCallback())
2. Attention Map Visualization
For Transformer-based ASR, log and visualize attention maps:
import matplotlib.pyplot as plt
def plot_attention(attention_weights, hyp_tokens, ref_tokens):
"""Plot attention matrix."""
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(attention_weights, cmap='Blues')
ax.set_xticks(range(len(ref_tokens)))
ax.set_xticklabels(ref_tokens, rotation=90)
ax.set_yticks(range(len(hyp_tokens)))
ax.set_yticklabels(hyp_tokens)
ax.set_xlabel("Reference Tokens")
ax.set_ylabel("Hypothesis Tokens")
return fig
# During evaluation
attention_weights = model.get_attention_weights(audio)
fig = plot_attention(attention_weights, hyp_tokens, ref_tokens)
run.log_figure("attention_plot.png", fig)
3. Speaker-Level Analysis
Track per-speaker WER to identify performance gaps:
def speaker_level_analysis(model, test_set, run):
"""Compute and log per-speaker WER."""
speaker_stats = {}
for batch in test_loader:
hyps = model.decode(batch['audio'])
refs = batch['text']
speakers = batch['speaker_ids']
for hyp, ref, speaker in zip(hyps, refs, speakers):
errors, words = compute_wer_details(hyp, ref)
if speaker not in speaker_stats:
speaker_stats[speaker] = {"errors": 0, "words": 0}
speaker_stats[speaker]["errors"] += errors
speaker_stats[speaker]["words"] += words
# Compute per-speaker WER
for speaker, stats in speaker_stats.items():
wer = stats["errors"] / max(1, stats["words"])
run.log_metric(f"speaker_{speaker}_wer", wer)
# Log summary statistics
wers = [stats["errors"] / max(1, stats["words"]) for stats in speaker_stats.values()]
run.log_metrics({
"avg_speaker_wer": np.mean(wers),
"median_speaker_wer": np.median(wers),
"worst_speaker_wer": np.max(wers),
"best_speaker_wer": np.min(wers)
})
Practical Operations Checklist
For Speech Researchers
- Always log data config: Dataset, hours, languages, speaker count.
- Track augmentation policies: SpecAugment, noise, speed perturbation.
- Evaluate on multiple test sets: Clean, noisy, accented, domain-specific.
- Log decoded outputs: For error analysis and debugging.
- Track decoding hyperparameters: Beam width, LM weight.
- Use descriptive run names:
conformer_12l_960h_specaug>run_42.
For Platform Engineers
- Monitor storage growth: Audio and models can grow quickly.
- Set up tiered storage: Move old checkpoints to Glacier.
- Implement checkpoint cleanup: Delete intermediate checkpoints after final model is saved.
- Monitor evaluation queue: Distributed eval should not be a bottleneck.
- Test disaster recovery: Can you restore experiments from backups?
Key Takeaways
✅ Speech experiment management requires domain-specific extensions (audio data, WER/CER, multi-test-set evaluation).
✅ Store audio by reference, not by copying—saves massive storage costs.
✅ Track data versions (hashes, preprocessing configs) for reproducibility.
✅ Multi-dimensional evaluation (language, domain, noise condition) is critical for speech.
✅ Checkpoint deduplication and tiered storage reduce costs significantly.
✅ Systematic iteration through experiment spaces (model × data × hyperparameters) mirrors structured traversal patterns like Spiral Matrix.
✅ Integration with speech toolkits (ESPnet, Kaldi, SpeechBrain) is key for adoption.
Connection to Thematic Link: Systematic Iteration and State Tracking
All three Day 19 topics converge on systematic, stateful exploration:
DSA (Spiral Matrix):
- Layer-by-layer traversal with boundary tracking
- Explicit state management (top, bottom, left, right)
- Resume/pause friendly
ML System Design (Experiment Tracking Systems):
- Systematic exploration of hyperparameter/architecture spaces
- Track state of experiments (running, completed, failed)
- Resume from checkpoints, recover from failures
Speech Tech (Speech Experiment Management):
- Organize speech experiments across model × data × hyperparameters × decoding dimensions
- Track training state (checkpoints, metrics, evaluation results)
- Enable reproducibility and multi-test-set comparison
The unifying pattern: structured, stateful iteration through complex, multi-dimensional spaces with clear persistence and recoverability.
Originally published at: arunbaby.com/speech-tech/0019-speech-experiment-management
If you found this helpful, consider sharing it with others who might benefit.