Audio Feature Extraction for Speech ML
How to transform raw audio waveforms into ML-ready features that capture speech characteristics for robust model training.
Introduction
Raw audio waveforms are high-dimensional, noisy, and difficult for ML models to learn from directly. Feature extraction transforms audio into compact, informative representations that:
- Capture important speech characteristics
- Reduce dimensionality (16kHz audio = 16,000 samples/sec → ~40 features)
- Provide invariance to irrelevant variations (volume, recording device)
- Enable efficient model training
Why it matters:
- Improves accuracy: Good features → better models
- Reduces compute: Lower dimensionality = faster training/inference
- Enables transfer learning: Pre-extracted features work across tasks
- Production efficiency: Feature extraction can be cached
What you’ll learn:
- Core audio features (MFCCs, spectrograms, mel-scale)
- Time-domain vs frequency-domain features
- Production-grade extraction pipelines
- Optimization for real-time processing
- Feature engineering for speech tasks
Problem Definition
Design a feature extraction pipeline for speech ML systems.
Functional Requirements
- Feature Types
- Time-domain features (energy, zero-crossing rate)
- Frequency-domain features (spectrograms, MFCCs)
- Temporal features (deltas, delta-deltas)
- Learned features (embeddings)
- Input Handling
- Support multiple sample rates (8kHz, 16kHz, 48kHz)
- Handle variable-length audio
- Process both mono and stereo
- Support batch processing
- Output Format
- Fixed-size feature vectors
- Variable-length sequences
- 2D/3D tensors for neural networks
Non-Functional Requirements
- Performance
- Real-time: Extract features < 10ms for 1 sec audio
- Batch: Process 10K files/hour on single machine
- Memory: < 100MB RAM for streaming
- Quality
- Robust to noise
- Consistent across devices
- Reproducible (deterministic)
- Flexibility
- Configurable parameters
- Support multiple backends (librosa, torchaudio)
- Easy to extend with new features
Audio Basics
Waveform Representation
import numpy as np
import librosa
import matplotlib.pyplot as plt
# Load audio
audio, sr = librosa.load('speech.wav', sr=16000)
print(f"Sample rate: {sr} Hz")
print(f"Duration: {len(audio) / sr:.2f} seconds")
print(f"Shape: {audio.shape}")
print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")
# Visualize waveform
plt.figure(figsize=(12, 4))
time = np.arange(len(audio)) / sr
plt.plot(time, audio)
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Audio Waveform')
plt.show()
Key properties:
- Sample rate (sr): Samples per second (e.g., 16000 Hz = 16000 samples/sec)
- Duration:
len(audio) / sr
seconds - Amplitude: Typically normalized to [-1, 1]
Feature 1: Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs are the most widely used features in speech recognition.
Why MFCCs?
- Mimic human hearing: Use mel scale (perceptual frequency scale)
- Compact: Represent spectral envelope with 13-40 coefficients
- Robust: Less sensitive to pitch variations
- Proven: Gold standard for ASR for decades
How MFCCs Work
Audio Waveform
↓
1. Pre-emphasis (boost high frequencies)
↓
2. Frame the signal (25ms windows, 10ms hop)
↓
3. Apply window function (Hamming)
↓
4. FFT (Fast Fourier Transform)
↓
5. Mel filterbank (map to mel scale)
↓
6. Log (compress dynamic range)
↓
7. DCT (Discrete Cosine Transform)
↓
MFCCs (13-40 coefficients per frame)
Implementation
import librosa
import numpy as np
class MFCCExtractor:
"""
Extract MFCC features from audio
Standard configuration for speech recognition
"""
def __init__(
self,
sr=16000,
n_mfcc=40,
n_fft=512,
hop_length=160, # 10ms at 16kHz
n_mels=40,
fmin=20,
fmax=8000
):
self.sr = sr
self.n_mfcc = n_mfcc
self.n_fft = n_fft
self.hop_length = hop_length
self.n_mels = n_mels
self.fmin = fmin
self.fmax = fmax
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract MFCCs
Args:
audio: Audio waveform (1D array)
Returns:
MFCCs: (n_mfcc, time_steps)
"""
# Extract MFCCs
mfccs = librosa.feature.mfcc(
y=audio,
sr=self.sr,
n_mfcc=self.n_mfcc,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax
)
return mfccs # Shape: (n_mfcc, time)
def extract_with_deltas(self, audio: np.ndarray) -> np.ndarray:
"""
Extract MFCCs + deltas + delta-deltas
Deltas capture temporal dynamics
Returns:
Features: (n_mfcc * 3, time_steps)
"""
# MFCCs
mfccs = self.extract(audio)
# Delta (first derivative)
delta = librosa.feature.delta(mfccs)
# Delta-delta (second derivative)
delta2 = librosa.feature.delta(mfccs, order=2)
# Stack
features = np.vstack([mfccs, delta, delta2]) # (120, time)
return features
# Usage
extractor = MFCCExtractor()
mfccs = extractor.extract(audio)
print(f"MFCCs shape: {mfccs.shape}") # (40, time_steps)
# With deltas
features = extractor.extract_with_deltas(audio)
print(f"MFCCs+deltas shape: {features.shape}") # (120, time_steps)
Visualizing MFCCs
import matplotlib.pyplot as plt
def plot_mfccs(mfccs, sr, hop_length):
"""Visualize MFCC features"""
plt.figure(figsize=(12, 6))
# Convert frame indices to time
times = librosa.frames_to_time(
np.arange(mfccs.shape[1]),
sr=sr,
hop_length=hop_length
)
plt.imshow(
mfccs,
aspect='auto',
origin='lower',
extent=[times[0], times[-1], 0, mfccs.shape[0]],
cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.xlabel('Time (s)')
plt.ylabel('MFCC Coefficient')
plt.title('MFCC Features')
plt.tight_layout()
plt.show()
plot_mfccs(mfccs, sr=16000, hop_length=160)
Feature 2: Mel-Spectrograms
Mel-spectrograms preserve more temporal detail than MFCCs.
What is a Spectrogram?
A spectrogram shows how the frequency content of a signal changes over time.
- X-axis: Time
- Y-axis: Frequency
- Color: Magnitude (energy)
Mel-Spectrogram vs MFCC
Aspect | Mel-Spectrogram | MFCC |
---|---|---|
Dimensions | (n_mels, time) | (n_mfcc, time) |
Information | Full spectrum | Spectral envelope |
Size | 40-128 bins | 13-40 coefficients |
Use case | CNNs, deep learning | Traditional ASR |
Temporal resolution | Higher | Lower (due to DCT) |
Implementation
class MelSpectrogramExtractor:
"""
Extract log mel-spectrogram features
Popular for deep learning models (CNNs, Transformers)
"""
def __init__(
self,
sr=16000,
n_fft=512,
hop_length=160,
n_mels=80,
fmin=0,
fmax=8000
):
self.sr = sr
self.n_fft = n_fft
self.hop_length = hop_length
self.n_mels = n_mels
self.fmin = fmin
self.fmax = fmax
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract log mel-spectrogram
Returns:
Log mel-spectrogram: (n_mels, time_steps)
"""
# Compute mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax
)
# Convert to log scale (dB)
log_mel = librosa.power_to_db(mel_spec, ref=np.max)
return log_mel # Shape: (n_mels, time)
def extract_normalized(self, audio: np.ndarray) -> np.ndarray:
"""
Extract and normalize to [0, 1]
Better for neural networks
"""
log_mel = self.extract(audio)
# Normalize to [0, 1]
log_mel_norm = (log_mel - log_mel.min()) / (log_mel.max() - log_mel.min() + 1e-8)
return log_mel_norm
# Usage
mel_extractor = MelSpectrogramExtractor(n_mels=80)
mel_spec = mel_extractor.extract(audio)
print(f"Mel-spectrogram shape: {mel_spec.shape}") # (80, time_steps)
Visualizing Mel-Spectrogram
def plot_mel_spectrogram(mel_spec, sr, hop_length):
"""Visualize mel-spectrogram"""
plt.figure(figsize=(12, 6))
librosa.display.specshow(
mel_spec,
sr=sr,
hop_length=hop_length,
x_axis='time',
y_axis='mel',
cmap='viridis'
)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-Spectrogram')
plt.tight_layout()
plt.show()
plot_mel_spectrogram(mel_spec, sr=16000, hop_length=160)
Feature 3: Raw Spectrograms (STFT)
Short-Time Fourier Transform (STFT) provides the highest frequency resolution.
Implementation
class STFTExtractor:
"""
Extract raw STFT features
Used when you need full frequency resolution
"""
def __init__(
self,
n_fft=512,
hop_length=160,
win_length=400
):
self.n_fft = n_fft
self.hop_length = hop_length
self.win_length = win_length
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract magnitude spectrogram
Returns:
Spectrogram: (n_fft//2 + 1, time_steps)
"""
# Compute STFT
stft = librosa.stft(
audio,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=self.win_length
)
# Get magnitude
magnitude = np.abs(stft)
# Convert to dB
magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)
return magnitude_db # Shape: (n_fft//2 + 1, time)
def extract_with_phase(self, audio: np.ndarray):
"""
Extract magnitude and phase
Phase information useful for reconstruction
"""
stft = librosa.stft(
audio,
n_fft=self.n_fft,
hop_length=self.hop_length,
win_length=self.win_length
)
magnitude = np.abs(stft)
phase = np.angle(stft)
return magnitude, phase
# Usage
stft_extractor = STFTExtractor()
spectrogram = stft_extractor.extract(audio)
print(f"Spectrogram shape: {spectrogram.shape}") # (257, time_steps)
Feature 4: Time-Domain Features
Simple but effective features computed directly from waveform.
Implementation
class TimeDomainExtractor:
"""
Extract time-domain features
Fast to compute, useful for simple tasks
"""
def extract_energy(self, audio: np.ndarray, frame_length=400, hop_length=160):
"""
Frame-wise energy (RMS)
Captures loudness/volume over time
"""
energy = librosa.feature.rms(
y=audio,
frame_length=frame_length,
hop_length=hop_length
)[0]
return energy
def extract_zero_crossing_rate(self, audio: np.ndarray, frame_length=400, hop_length=160):
"""
Zero-crossing rate
Measures how often signal crosses zero
High ZCR → noisy/unvoiced
Low ZCR → tonal/voiced
"""
zcr = librosa.feature.zero_crossing_rate(
audio,
frame_length=frame_length,
hop_length=hop_length
)[0]
return zcr
def extract_all(self, audio: np.ndarray):
"""Extract all time-domain features"""
energy = self.extract_energy(audio)
zcr = self.extract_zero_crossing_rate(audio)
# Stack features
features = np.vstack([energy, zcr]) # (2, time)
return features
# Usage
time_extractor = TimeDomainExtractor()
time_features = time_extractor.extract_all(audio)
print(f"Time-domain features shape: {time_features.shape}") # (2, time_steps)
Feature 5: Pitch & Formants
Pitch and formants are linguistic features important for speech.
Pitch Extraction
class PitchExtractor:
"""
Extract fundamental frequency (F0)
Important for:
- Speaker recognition
- Emotion detection
- Prosody modeling
"""
def __init__(self, sr=16000, fmin=80, fmax=400):
self.sr = sr
self.fmin = fmin # Typical male voice
self.fmax = fmax # Typical female voice
def extract_f0(self, audio: np.ndarray, hop_length=160):
"""
Extract pitch (fundamental frequency)
Returns:
f0: Pitch values (Hz) per frame
voiced_flag: Boolean array (voiced vs unvoiced)
"""
# Extract pitch using YIN algorithm
f0 = librosa.yin(
audio,
fmin=self.fmin,
fmax=self.fmax,
sr=self.sr,
hop_length=hop_length
)
# Detect voiced regions (f0 > 0)
voiced_flag = f0 > 0
return f0, voiced_flag
def extract_pitch_features(self, audio: np.ndarray):
"""
Extract pitch statistics
Useful for speaker/emotion recognition
"""
f0, voiced = self.extract_f0(audio)
# Statistics on voiced frames
voiced_f0 = f0[voiced]
if len(voiced_f0) > 0:
features = {
'mean_pitch': np.mean(voiced_f0),
'std_pitch': np.std(voiced_f0),
'min_pitch': np.min(voiced_f0),
'max_pitch': np.max(voiced_f0),
'pitch_range': np.max(voiced_f0) - np.min(voiced_f0),
'voiced_ratio': np.sum(voiced) / len(voiced)
}
else:
features = {k: 0.0 for k in ['mean_pitch', 'std_pitch', 'min_pitch', 'max_pitch', 'pitch_range', 'voiced_ratio']}
return features
# Usage
pitch_extractor = PitchExtractor()
f0, voiced = pitch_extractor.extract_f0(audio)
print(f"Pitch shape: {f0.shape}")
pitch_stats = pitch_extractor.extract_pitch_features(audio)
print(f"Pitch statistics: {pitch_stats}")
Production Feature Pipeline
Combine all features into a unified pipeline.
Unified Feature Extractor
from dataclasses import dataclass
from typing import Dict, List, Optional
import json
@dataclass
class FeatureConfig:
"""Configuration for feature extraction"""
sr: int = 16000
feature_types: List[str] = None # ['mfcc', 'mel', 'pitch']
# MFCC config
n_mfcc: int = 40
# Mel-spectrogram config
n_mels: int = 80
# Common config
n_fft: int = 512
hop_length: int = 160 # 10ms
# Normalization
normalize: bool = True
def __post_init__(self):
if self.feature_types is None:
self.feature_types = ['mfcc']
class AudioFeatureExtractor:
"""
Production-grade audio feature extractor
Supports multiple feature types, caching, and batch processing
"""
def __init__(self, config: FeatureConfig):
self.config = config
# Initialize extractors
self.mfcc_extractor = MFCCExtractor(
sr=config.sr,
n_mfcc=config.n_mfcc,
n_fft=config.n_fft,
hop_length=config.hop_length
)
self.mel_extractor = MelSpectrogramExtractor(
sr=config.sr,
n_mels=config.n_mels,
n_fft=config.n_fft,
hop_length=config.hop_length
)
self.pitch_extractor = PitchExtractor(sr=config.sr)
self.time_extractor = TimeDomainExtractor()
def extract(self, audio: np.ndarray) -> Dict[str, np.ndarray]:
"""
Extract features based on config
Args:
audio: Audio waveform
Returns:
Dictionary of features
"""
features = {}
if 'mfcc' in self.config.feature_types:
mfccs = self.mfcc_extractor.extract_with_deltas(audio)
if self.config.normalize:
mfccs = self._normalize(mfccs)
features['mfcc'] = mfccs
if 'mel' in self.config.feature_types:
mel = self.mel_extractor.extract(audio)
if self.config.normalize:
mel = self._normalize(mel)
features['mel'] = mel
if 'pitch' in self.config.feature_types:
f0, voiced = self.pitch_extractor.extract_f0(audio, hop_length=self.config.hop_length)
features['pitch'] = f0
features['voiced'] = voiced.astype(np.float32)
if 'time' in self.config.feature_types:
time_feats = self.time_extractor.extract_all(audio)
if self.config.normalize:
time_feats = self._normalize(time_feats)
features['time'] = time_feats
return features
def _normalize(self, features: np.ndarray) -> np.ndarray:
"""
Normalize features (mean=0, std=1) per coefficient
"""
mean = np.mean(features, axis=1, keepdims=True)
std = np.std(features, axis=1, keepdims=True) + 1e-8
normalized = (features - mean) / std
return normalized
def extract_from_file(self, audio_path: str) -> Dict[str, np.ndarray]:
"""
Extract features from audio file
"""
audio, sr = librosa.load(audio_path, sr=self.config.sr)
return self.extract(audio)
def extract_batch(self, audio_list: List[np.ndarray]) -> List[Dict[str, np.ndarray]]:
"""
Extract features from batch of audio
"""
return [self.extract(audio) for audio in audio_list]
def save_config(self, path: str):
"""Save feature extraction config"""
with open(path, 'w') as f:
json.dump(self.config.__dict__, f, indent=2)
@staticmethod
def load_config(path: str) -> FeatureConfig:
"""Load feature extraction config"""
with open(path, 'r') as f:
config_dict = json.load(f)
return FeatureConfig(**config_dict)
# Usage
config = FeatureConfig(
feature_types=['mfcc', 'mel', 'pitch'],
n_mfcc=40,
n_mels=80,
normalize=True
)
extractor = AudioFeatureExtractor(config)
# Extract features
features = extractor.extract(audio)
print("Extracted features:", features.keys())
for name, feat in features.items():
print(f" {name}: {feat.shape}")
# Save config for reproducibility
extractor.save_config('feature_config.json')
Handling Variable-Length Audio
Different audio clips have different durations. Need to handle this for ML.
Strategy 1: Padding/Truncation
class VariableLengthHandler:
"""
Handle variable-length audio
"""
def pad_or_truncate(self, features: np.ndarray, target_length: int) -> np.ndarray:
"""
Pad or truncate features to fixed length
Args:
features: (n_features, time)
target_length: Target time dimension
Returns:
Fixed-length features: (n_features, target_length)
"""
current_length = features.shape[1]
if current_length < target_length:
# Pad with zeros
pad_width = ((0, 0), (0, target_length - current_length))
features = np.pad(features, pad_width, mode='constant')
elif current_length > target_length:
# Truncate (take first target_length frames)
features = features[:, :target_length]
return features
def create_mask(self, features: np.ndarray, target_length: int) -> np.ndarray:
"""
Create attention mask for padded features
Returns:
Mask: (target_length,) - 1 for real frames, 0 for padding
"""
current_length = features.shape[1]
mask = np.zeros(target_length)
mask[:min(current_length, target_length)] = 1
return mask
Strategy 2: Temporal Pooling
class TemporalPooler:
"""
Pool variable-length features to fixed size
"""
def mean_pool(self, features: np.ndarray) -> np.ndarray:
"""
Average pool over time
Args:
features: (n_features, time)
Returns:
Pooled: (n_features,)
"""
return np.mean(features, axis=1)
def max_pool(self, features: np.ndarray) -> np.ndarray:
"""Max pool over time"""
return np.max(features, axis=1)
def stats_pool(self, features: np.ndarray) -> np.ndarray:
"""
Statistical pooling: mean + std
Returns:
Pooled: (n_features * 2,)
"""
mean = np.mean(features, axis=1)
std = np.std(features, axis=1)
return np.concatenate([mean, std])
Real-Time Feature Extraction
For streaming applications, need incremental feature extraction.
Streaming Feature Extractor
from collections import deque
class StreamingFeatureExtractor:
"""
Extract features from streaming audio
Use case: Real-time ASR, voice assistants
"""
def __init__(
self,
sr=16000,
frame_length_ms=25,
hop_length_ms=10,
buffer_duration_ms=500
):
self.sr = sr
self.frame_length = int(sr * frame_length_ms / 1000)
self.hop_length = int(sr * hop_length_ms / 1000)
self.buffer_length = int(sr * buffer_duration_ms / 1000)
# Circular buffer for audio
self.buffer = deque(maxlen=self.buffer_length)
# Feature extractor
self.extractor = MFCCExtractor(
sr=sr,
hop_length=self.hop_length
)
def add_audio_chunk(self, audio_chunk: np.ndarray):
"""
Add new audio chunk to buffer
Args:
audio_chunk: New audio samples
"""
self.buffer.extend(audio_chunk)
def extract_latest(self) -> Optional[np.ndarray]:
"""
Extract features from current buffer
Returns:
Features or None if buffer too small
"""
if len(self.buffer) < self.frame_length:
return None
# Convert buffer to array
audio = np.array(self.buffer)
# Extract features
features = self.extractor.extract(audio)
return features
def reset(self):
"""Clear buffer"""
self.buffer.clear()
# Usage
streaming_extractor = StreamingFeatureExtractor()
# Simulate streaming (100ms chunks)
chunk_size = 1600 # 100ms at 16kHz
for i in range(0, len(audio), chunk_size):
chunk = audio[i:i+chunk_size]
# Add to buffer
streaming_extractor.add_audio_chunk(chunk)
# Extract features
features = streaming_extractor.extract_latest()
if features is not None:
print(f"Chunk {i//chunk_size}: features shape = {features.shape}")
# Process features (send to model, etc.)
Performance Optimization
1. Caching Features
import os
import pickle
import hashlib
class CachedFeatureExtractor:
"""
Cache extracted features to disk
Avoid re-extracting for same audio
"""
def __init__(self, extractor: AudioFeatureExtractor, cache_dir='./feature_cache'):
self.extractor = extractor
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def _get_cache_path(self, audio_path: str) -> str:
"""Generate cache file path based on audio path hash"""
path_hash = hashlib.md5(audio_path.encode()).hexdigest()
return os.path.join(self.cache_dir, f"{path_hash}.pkl")
def extract_from_file(self, audio_path: str, use_cache=True) -> Dict[str, np.ndarray]:
"""
Extract features with caching
"""
cache_path = self._get_cache_path(audio_path)
# Check cache
if use_cache and os.path.exists(cache_path):
with open(cache_path, 'rb') as f:
features = pickle.load(f)
return features
# Extract features
features = self.extractor.extract_from_file(audio_path)
# Save to cache
with open(cache_path, 'wb') as f:
pickle.dump(features, f)
return features
2. Parallel Processing
from multiprocessing import Pool
from functools import partial
class ParallelFeatureExtractor:
"""
Extract features from multiple files in parallel
"""
def __init__(self, extractor: AudioFeatureExtractor, n_workers=4):
self.extractor = extractor
self.n_workers = n_workers
def extract_from_files(self, audio_paths: List[str]) -> List[Dict[str, np.ndarray]]:
"""
Extract features from multiple files in parallel
"""
with Pool(self.n_workers) as pool:
features_list = pool.map(
self.extractor.extract_from_file,
audio_paths
)
return features_list
# Usage
parallel_extractor = ParallelFeatureExtractor(extractor, n_workers=8)
audio_files = ['file1.wav', 'file2.wav', ...] # 1000s of files
features = parallel_extractor.extract_from_files(audio_files)
Advanced Feature Types
1. Learned Features (Embeddings)
Instead of hand-crafted features, learn representations from data.
import torch
import torch.nn as nn
class AudioEmbeddingExtractor(nn.Module):
"""
Extract learned audio embeddings
Use pre-trained models (wav2vec, HuBERT) as feature extractors
"""
def __init__(self, model_name='facebook/wav2vec2-base'):
super().__init__()
from transformers import Wav2Vec2Model
# Load pre-trained model
self.model = Wav2Vec2Model.from_pretrained(model_name)
self.model.eval() # Freeze for feature extraction
def extract(self, audio: np.ndarray, sr=16000) -> np.ndarray:
"""
Extract contextualized embeddings
Returns:
Embeddings: (time_steps, hidden_dim)
typically (time, 768) for base model
"""
# Convert to tensor
audio_tensor = torch.tensor(audio, dtype=torch.float32).unsqueeze(0)
# Extract features
with torch.no_grad():
outputs = self.model(audio_tensor)
embeddings = outputs.last_hidden_state[0] # (time, 768)
return embeddings.numpy()
# Usage - MUCH more powerful than MFCCs for transfer learning
embedding_extractor = AudioEmbeddingExtractor()
embeddings = embedding_extractor.extract(audio)
print(f"Embeddings shape: {embeddings.shape}") # (time, 768)
Comparison:
Feature Type | Dimension | Training Required | Transfer Learning | Accuracy |
---|---|---|---|---|
MFCCs | 40-120 | No | Poor | Baseline |
Mel-spectrogram | 80-128 | No | Good | +5-10% |
Wav2Vec embeddings | 768 | Yes (pre-trained) | Excellent | +15-25% |
2. Filter Bank Features (FBank)
Alternative to MFCCs - skip the DCT step.
class FilterbankExtractor:
"""
Extract log mel-filterbank features
Similar to mel-spectrograms, popular in modern ASR
"""
def __init__(self, sr=16000, n_mels=80, n_fft=512, hop_length=160):
self.sr = sr
self.n_mels = n_mels
self.n_fft = n_fft
self.hop_length = hop_length
def extract(self, audio: np.ndarray) -> np.ndarray:
"""
Extract log filter bank energies
Returns:
FBank: (n_mels, time_steps)
"""
# Mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=self.sr,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels
)
# Log
log_mel = librosa.power_to_db(mel_spec, ref=np.max)
return log_mel
# FBank vs MFCC:
# - FBank: Keep all mel bins (80-128)
# - MFCC: Compress to 13-40 via DCT
#
# FBank often works better with neural networks
3. Prosodic Features
Capture rhythm, stress, and intonation.
class ProsodicFeatureExtractor:
"""
Extract prosodic features for emotion, speaker ID, etc.
"""
def extract_intensity_contour(self, audio, sr=16000, hop_length=160):
"""
Intensity (loudness) over time
"""
intensity = librosa.feature.rms(y=audio, hop_length=hop_length)[0]
# Convert to dB
intensity_db = librosa.amplitude_to_db(intensity, ref=np.max)
return intensity_db
def extract_speaking_rate(self, audio, sr=16000):
"""
Estimate speaking rate (syllables per second)
Approximation: count peaks in energy envelope
"""
# Energy envelope
energy = librosa.feature.rms(y=audio, hop_length=160)[0]
# Find peaks (local maxima)
from scipy.signal import find_peaks
peaks, _ = find_peaks(energy, distance=10, prominence=0.1)
# Speaking rate
duration = len(audio) / sr
syllables_per_sec = len(peaks) / duration
return syllables_per_sec
def extract_all_prosodic(self, audio, sr=16000):
"""Extract all prosodic features"""
# Pitch
pitch_extractor = PitchExtractor(sr=sr)
pitch_stats = pitch_extractor.extract_pitch_features(audio)
# Intensity
intensity = self.extract_intensity_contour(audio, sr)
# Speaking rate
speaking_rate = self.extract_speaking_rate(audio, sr)
return {
**pitch_stats,
'mean_intensity': np.mean(intensity),
'std_intensity': np.std(intensity),
'speaking_rate': speaking_rate
}
Feature Quality & Validation
Ensure extracted features are high quality.
Feature Quality Metrics
class FeatureQualityChecker:
"""
Validate quality of extracted features
"""
def check_for_nans(self, features: Dict[str, np.ndarray]) -> bool:
"""Check for NaN/Inf values"""
for name, feat in features.items():
if np.isnan(feat).any() or np.isinf(feat).any():
print(f"⚠️ {name} contains NaN/Inf")
return False
return True
def check_dynamic_range(self, features: Dict[str, np.ndarray]) -> Dict[str, float]:
"""
Check dynamic range of features
Low dynamic range → feature not informative
"""
ranges = {}
for name, feat in features.items():
feat_range = feat.max() - feat.min()
ranges[name] = feat_range
if feat_range < 1e-6:
print(f"⚠️ {name} has very low dynamic range: {feat_range}")
return ranges
def check_feature_statistics(self, features_batch: List[np.ndarray]):
"""
Check statistics across batch
Ensure features are properly normalized
"""
# Stack all features
all_features = np.concatenate(features_batch, axis=1) # (n_features, total_time)
# Per-feature statistics
mean_per_feature = np.mean(all_features, axis=1)
std_per_feature = np.std(all_features, axis=1)
print("Feature Statistics:")
print(f" Mean range: [{mean_per_feature.min():.3f}, {mean_per_feature.max():.3f}]")
print(f" Std range: [{std_per_feature.min():.3f}, {std_per_feature.max():.3f}]")
# Check if normalized
if np.abs(mean_per_feature).max() > 0.1:
print("⚠️ Features not centered (mean far from 0)")
if np.abs(std_per_feature - 1.0).max() > 0.2:
print("⚠️ Features not standardized (std far from 1)")
Connection to Data Preprocessing Pipeline
Feature extraction for speech is analogous to data preprocessing for ML systems (see Day 3 ML).
Parallel Concepts
Speech Feature Extraction | ML Data Preprocessing |
---|---|
Handle missing audio | Handle missing values |
Normalize features (mean=0, std=1) | Normalize numerical features |
Pad/truncate variable length | Handle variable-length sequences |
Validate audio quality | Schema validation |
Cache extracted features | Cache preprocessed data |
Batch processing | Distributed data processing |
Unified Preprocessing Framework
class UnifiedPreprocessor:
"""
Combined preprocessing for multimodal ML
Example: Speech + text + metadata
"""
def __init__(self):
# Audio features
self.audio_extractor = AudioFeatureExtractor(
FeatureConfig(feature_types=['mfcc', 'mel'])
)
# Text features (from transcripts)
from sklearn.feature_extraction.text import TfidfVectorizer
self.text_vectorizer = TfidfVectorizer(max_features=1000)
# Numerical features
from sklearn.preprocessing import StandardScaler
self.numerical_scaler = StandardScaler()
def preprocess_sample(self, audio, text, metadata):
"""
Preprocess multimodal sample
Args:
audio: Audio waveform
text: Transcript or description
metadata: User/item metadata (dict)
Returns:
Combined feature vector
"""
# Extract audio features
audio_features = self.audio_extractor.extract(audio)
audio_pooled = np.mean(audio_features['mfcc'], axis=1) # (n_mfcc,)
# Extract text features
text_features = self.text_vectorizer.transform([text]).toarray()[0] # (1000,)
# Process metadata
metadata_array = np.array([
metadata['user_age'],
metadata['user_gender'],
metadata['device_type']
])
metadata_scaled = self.numerical_scaler.transform([metadata_array])[0]
# Concatenate all features
combined = np.concatenate([
audio_pooled, # (40,)
text_features, # (1000,)
metadata_scaled # (3,)
]) # Total: (1043,)
return combined
Production Best Practices
1. Feature Versioning
Track feature extraction versions for reproducibility.
class VersionedFeatureExtractor:
"""
Version feature extraction logic
Critical for:
- A/B testing different features
- Rollback if new features hurt performance
- Reproducibility
"""
VERSION = "1.2.0"
def __init__(self, config: FeatureConfig):
self.config = config
self.extractor = AudioFeatureExtractor(config)
def extract_with_metadata(self, audio_path: str):
"""
Extract features with version metadata
"""
features = self.extractor.extract_from_file(audio_path)
metadata = {
'version': self.VERSION,
'config': self.config.__dict__,
'timestamp': datetime.now().isoformat(),
'audio_path': audio_path
}
return {
'features': features,
'metadata': metadata
}
def save_features(self, features, output_path):
"""Save features with version info"""
np.savez_compressed(
output_path,
**features['features'],
metadata=json.dumps(features['metadata'])
)
2. Error Handling
Robust feature extraction handles failures gracefully.
class RobustFeatureExtractor:
"""
Feature extractor with error handling
"""
def __init__(self, extractor: AudioFeatureExtractor):
self.extractor = extractor
def extract_safe(self, audio_path: str) -> Optional[Dict]:
"""
Extract features with error handling
"""
try:
# Load audio
audio, sr = librosa.load(audio_path, sr=self.extractor.config.sr)
# Validate
if len(audio) == 0:
logger.warning(f"Empty audio: {audio_path}")
return None
if len(audio) < self.extractor.config.sr * 0.1: # < 100ms
logger.warning(f"Audio too short: {audio_path}")
return None
# Extract
features = self.extractor.extract(audio)
# Quality check
quality_checker = FeatureQualityChecker()
if not quality_checker.check_for_nans(features):
logger.error(f"Feature extraction failed (NaN): {audio_path}")
return None
return features
except Exception as e:
logger.error(f"Feature extraction error for {audio_path}: {e}")
return None
def extract_batch_robust(self, audio_paths: List[str]) -> List[Dict]:
"""
Extract from batch, skipping failures
"""
results = []
failures = []
for path in audio_paths:
features = self.extract_safe(path)
if features is not None:
results.append({'path': path, 'features': features})
else:
failures.append(path)
success_rate = len(results) / len(audio_paths)
logger.info(f"Feature extraction: {len(results)}/{len(audio_paths)} succeeded ({success_rate:.1%})")
if failures:
logger.warning(f"Failed files: {failures[:10]}") # Log first 10
return results
3. Monitoring Feature Quality
Track feature statistics over time to detect issues.
class FeatureMonitor:
"""
Monitor feature quality in production
"""
def __init__(self, expected_stats: Dict[str, Dict]):
"""
Args:
expected_stats: Expected statistics per feature type
{
'mfcc': {'mean_range': [-5, 5], 'std_range': [0.5, 2.0]},
'mel': {'mean_range': [-80, 0], 'std_range': [10, 30]}
}
"""
self.expected_stats = expected_stats
def validate_features(self, features: Dict[str, np.ndarray]) -> List[str]:
"""
Validate extracted features against expected statistics
Returns:
List of warnings
"""
warnings = []
for feat_name, feat_values in features.items():
if feat_name not in self.expected_stats:
continue
expected = self.expected_stats[feat_name]
# Check mean
actual_mean = np.mean(feat_values)
expected_mean_range = expected['mean_range']
if not (expected_mean_range[0] <= actual_mean <= expected_mean_range[1]):
warnings.append(
f"{feat_name}: mean {actual_mean:.2f} outside expected range {expected_mean_range}"
)
# Check std
actual_std = np.std(feat_values)
expected_std_range = expected['std_range']
if not (expected_std_range[0] <= actual_std <= expected_std_range[1]):
warnings.append(
f"{feat_name}: std {actual_std:.2f} outside expected range {expected_std_range}"
)
return warnings
def compute_statistics(self, features_batch: List[Dict[str, np.ndarray]]):
"""
Compute statistics across batch
Use to establish baseline expected_stats
"""
stats = {}
# Get feature names from first sample
feature_names = features_batch[0].keys()
for feat_name in feature_names:
# Collect all values
all_values = np.concatenate([
f[feat_name].flatten() for f in features_batch
])
stats[feat_name] = {
'mean': np.mean(all_values),
'std': np.std(all_values),
'min': np.min(all_values),
'max': np.max(all_values),
'percentiles': {
'25': np.percentile(all_values, 25),
'50': np.percentile(all_values, 50),
'75': np.percentile(all_values, 75),
'95': np.percentile(all_values, 95)
}
}
return stats
Data Augmentation in Feature Space
Augment features directly for training robustness.
SpecAugment
class SpecAugment:
"""
SpecAugment: Data augmentation on spectrograms
Proposed in "SpecAugment: A Simple Data Augmentation Method for ASR" (Google, 2019)
Improves ASR accuracy by 10-20% on many benchmarks
"""
def __init__(
self,
time_mask_param=70,
freq_mask_param=15,
num_time_masks=2,
num_freq_masks=2
):
self.time_mask_param = time_mask_param
self.freq_mask_param = freq_mask_param
self.num_time_masks = num_time_masks
self.num_freq_masks = num_freq_masks
def time_mask(self, spec: np.ndarray) -> np.ndarray:
"""
Mask random time region
Sets random time frames to zero
"""
spec = spec.copy()
time_length = spec.shape[1]
for _ in range(self.num_time_masks):
t = np.random.randint(0, min(self.time_mask_param, time_length))
t0 = np.random.randint(0, time_length - t)
spec[:, t0:t0+t] = 0
return spec
def freq_mask(self, spec: np.ndarray) -> np.ndarray:
"""
Mask random frequency region
Sets random frequency bins to zero
"""
spec = spec.copy()
freq_length = spec.shape[0]
for _ in range(self.num_freq_masks):
f = np.random.randint(0, min(self.freq_mask_param, freq_length))
f0 = np.random.randint(0, freq_length - f)
spec[f0:f0+f, :] = 0
return spec
def augment(self, spec: np.ndarray) -> np.ndarray:
"""Apply both time and freq masking"""
spec = self.time_mask(spec)
spec = self.freq_mask(spec)
return spec
# Usage during training
augmenter = SpecAugment()
for audio, label in train_loader:
# Extract features
mel_spec = mel_extractor.extract(audio)
# Augment
mel_spec_aug = augmenter.augment(mel_spec)
# Train model
train_model(mel_spec_aug, label)
Batch Feature Extraction for Training
Extract features for entire dataset efficiently.
Batch Extraction Pipeline
import os
from pathlib import Path
from tqdm import tqdm
import h5py
class BatchFeatureExtractor:
"""
Extract features for large audio datasets
Use case: Prepare training data
- Extract once, train many times
- Save features to disk (HDF5 format)
"""
def __init__(self, extractor: AudioFeatureExtractor, n_workers=8):
self.extractor = extractor
self.n_workers = n_workers
def extract_dataset(
self,
audio_dir: str,
output_path: str,
max_length_frames: int = 1000
):
"""
Extract features for all audio files in directory
Args:
audio_dir: Directory containing .wav files
output_path: HDF5 file to save features
max_length_frames: Pad/truncate to this length
"""
# Find all audio files
audio_files = list(Path(audio_dir).rglob('*.wav'))
print(f"Found {len(audio_files)} audio files")
# Create HDF5 file
with h5py.File(output_path, 'w') as hf:
# Pre-allocate datasets
# (We'll store features for each type)
feature_dim = self.extractor.config.n_mfcc * 3 # MFCCs + deltas
features_dataset = hf.create_dataset(
'features',
shape=(len(audio_files), feature_dim, max_length_frames),
dtype='float32'
)
lengths_dataset = hf.create_dataset(
'lengths',
shape=(len(audio_files),),
dtype='int32'
)
# Store file paths
paths_dataset = hf.create_dataset(
'paths',
shape=(len(audio_files),),
dtype=h5py.string_dtype()
)
# Extract features
for idx, audio_path in enumerate(tqdm(audio_files)):
try:
# Load audio
audio, sr = librosa.load(str(audio_path), sr=self.extractor.config.sr)
# Extract features
features = self.extractor.extract(audio)
# Get MFCCs with deltas
mfcc_deltas = features['mfcc'] # (120, time)
# Pad or truncate
handler = VariableLengthHandler()
mfcc_fixed = handler.pad_or_truncate(mfcc_deltas, max_length_frames)
# Store
features_dataset[idx] = mfcc_fixed
lengths_dataset[idx] = min(mfcc_deltas.shape[1], max_length_frames)
paths_dataset[idx] = str(audio_path)
except Exception as e:
logger.error(f"Failed to process {audio_path}: {e}")
# Store zeros for failed files
features_dataset[idx] = np.zeros((feature_dim, max_length_frames))
lengths_dataset[idx] = 0
paths_dataset[idx] = str(audio_path)
print(f"Features saved to {output_path}")
# Usage
batch_extractor = BatchFeatureExtractor(extractor, n_workers=8)
batch_extractor.extract_dataset(
audio_dir='./data/train/',
output_path='./features/train_features.h5',
max_length_frames=1000
)
# Load for training
with h5py.File('./features/train_features.h5', 'r') as hf:
features = hf['features'][:] # (N, feature_dim, max_length)
lengths = hf['lengths'][:] # (N,)
paths = hf['paths'][:] # (N,)
Real-World Systems
Kaldi: Traditional ASR Feature Pipeline
Kaldi is the industry standard for traditional ASR.
Feature extraction:
# Kaldi feature extraction (MFCC + pitch)
compute-mfcc-feats --config=conf/mfcc.conf scp:wav.scp ark:mfcc.ark
compute-and-process-kaldi-pitch-feats scp:wav.scp ark:pitch.ark
# Combine features
paste-feats ark:mfcc.ark ark:pitch.ark ark:features.ark
Configuration (mfcc.conf):
--use-energy=true
--num-mel-bins=40
--num-ceps=40
--low-freq=20
--high-freq=8000
--sample-frequency=16000
PyTorch: Modern Deep Learning Pipeline
import torchaudio
import torch
class TorchAudioExtractor:
"""
Feature extraction using torchaudio
Benefits:
- GPU acceleration
- Differentiable (can backprop through features)
- Integrated with PyTorch training
"""
def __init__(self, sr=16000, n_mfcc=40, n_mels=80):
self.sr = sr
self.n_mfcc = n_mfcc
self.n_mels = n_mels
# Create transforms (can move to GPU)
self.mfcc_transform = torchaudio.transforms.MFCC(
sample_rate=sr,
n_mfcc=n_mfcc,
melkwargs={'n_mels': 40, 'n_fft': 512, 'hop_length': 160}
)
self.mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=sr,
n_fft=512,
hop_length=160,
n_mels=n_mels
)
# Amplitude → dB conversion
self.db_transform = torchaudio.transforms.AmplitudeToDB()
def to(self, device):
"""
Move transforms to a device (CPU/GPU) and return self.
"""
self.mfcc_transform = self.mfcc_transform.to(device)
self.mel_transform = self.mel_transform.to(device)
self.db_transform = self.db_transform.to(device)
return self
def extract(self, audio: torch.Tensor) -> Dict[str, torch.Tensor]:
"""
Extract features (GPU-accelerated if audio on GPU)
Args:
audio: (batch, time) or (time,)
Returns:
Dictionary of features
"""
if audio.ndim == 1:
audio = audio.unsqueeze(0) # Add batch dimension
# Extract
mfccs = self.mfcc_transform(audio) # (batch, n_mfcc, time)
mel = self.mel_transform(audio) # (batch, n_mels, time)
mel_db = self.db_transform(mel)
return {
'mfcc': mfccs,
'mel': mel_db
}
# Usage with GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
extractor = TorchAudioExtractor().to(device)
# Load audio
audio, sr = torchaudio.load('speech.wav')
audio = audio.to(device)
# Extract (on GPU)
features = extractor.extract(audio)
Google: Production ASR Feature Extraction
Stack:
- Input: 16kHz audio
- Features: 80-bin log mel-filterbank
- Augmentation: SpecAugment
- Normalization: Per-utterance mean/variance normalization
- Model: Transformer encoder-decoder
Key optimizations:
- Precompute features for training data
- On-the-fly extraction for inference
- GPU-accelerated extraction for real-time systems
Choosing the Right Features
Different tasks need different features.
Feature Selection Guide
Task | Best Features | Why |
---|---|---|
ASR (traditional) | MFCCs + deltas | Captures phonetic content |
ASR (deep learning) | Mel-spectrograms | Works well with CNNs |
Speaker Recognition | MFCCs + pitch + prosody | Speaker identity in pitch/prosody |
Emotion Recognition | Prosodic + spectral | Emotion in prosody + voice quality |
Keyword Spotting | Mel-spectrograms | Simple, fast with CNNs |
Speech Enhancement | STFT magnitude + phase | Need phase for reconstruction |
Voice Activity Detection | Energy + ZCR | Simple features sufficient |
Combining Features
class MultiFeatureExtractor:
"""
Combine multiple feature types
Different features capture different aspects
"""
def __init__(self):
self.mfcc_ext = MFCCExtractor()
self.pitch_ext = PitchExtractor()
self.prosody_ext = ProsodicFeatureExtractor()
def extract_combined(self, audio):
"""
Extract and combine multiple feature types
"""
# MFCCs (40, time)
mfccs = self.mfcc_ext.extract(audio)
# Pitch (time,)
pitch, voiced = self.pitch_ext.extract_f0(audio)
pitch = pitch.reshape(1, -1) # (1, time)
# Energy (1, time)
energy = librosa.feature.rms(y=audio, hop_length=160)
# Align all features to same time dimension
min_time = min(mfccs.shape[1], pitch.shape[1], energy.shape[1])
mfccs = mfccs[:, :min_time]
pitch = pitch[:, :min_time]
energy = energy[:, :min_time]
# Stack
combined = np.vstack([mfccs, pitch, energy]) # (42, time)
return combined
Key Takeaways
✅ MFCCs are standard for speech recognition - compact and robust
✅ Mel-spectrograms work better with deep learning (CNNs, Transformers)
✅ Delta features capture temporal dynamics - critical for accuracy
✅ Normalize features for stable training (mean=0, std=1)
✅ Handle variable length with padding, pooling, or attention masks
✅ Cache features for repeated use - major speedup in training
✅ Streaming extraction possible with circular buffers
✅ Parallel processing speeds up batch feature extraction
✅ SpecAugment improves robustness through feature-space augmentation
✅ Monitor feature quality to detect pipeline issues early
✅ Version features for reproducibility and A/B testing
✅ Choose features based on task - no one-size-fits-all
Originally published at: arunbaby.com/speech-tech/0003-audio-feature-extraction
If you found this helpful, consider sharing it with others who might benefit.