Audio Feature Extraction for Speech ML

29 minute read

How to transform raw audio waveforms into ML-ready features that capture speech characteristics for robust model training.

Introduction

Raw audio waveforms are high-dimensional, noisy, and difficult for ML models to learn from directly. Feature extraction transforms audio into compact, informative representations that:

Capture important speech characteristics
Reduce dimensionality (16kHz audio = 16,000 samples/sec → ~40 features)
Provide invariance to irrelevant variations (volume, recording device)
Enable efficient model training

Why it matters:

Improves accuracy: Good features → better models
Reduces compute: Lower dimensionality = faster training/inference
Enables transfer learning: Pre-extracted features work across tasks
Production efficiency: Feature extraction can be cached

What you’ll learn:

Core audio features (MFCCs, spectrograms, mel-scale)
Time-domain vs frequency-domain features
Production-grade extraction pipelines
Optimization for real-time processing
Feature engineering for speech tasks

Problem Definition

Design a feature extraction pipeline for speech ML systems.

Functional Requirements

Feature Types
- Time-domain features (energy, zero-crossing rate)
- Frequency-domain features (spectrograms, MFCCs)
- Temporal features (deltas, delta-deltas)
- Learned features (embeddings)
Input Handling
- Support multiple sample rates (8kHz, 16kHz, 48kHz)
- Handle variable-length audio
- Process both mono and stereo
- Support batch processing
Output Format
- Fixed-size feature vectors
- Variable-length sequences
- 2D/3D tensors for neural networks

Non-Functional Requirements

Performance
- Real-time: Extract features < 10ms for 1 sec audio
- Batch: Process 10K files/hour on single machine
- Memory: < 100MB RAM for streaming
Quality
- Robust to noise
- Consistent across devices
- Reproducible (deterministic)
Flexibility
- Configurable parameters
- Support multiple backends (librosa, torchaudio)
- Easy to extend with new features

Audio Basics

Waveform Representation

import numpy as np
import librosa
import matplotlib.pyplot as plt

# Load audio
audio, sr = librosa.load('speech.wav', sr=16000)

print(f"Sample rate: {sr} Hz")
print(f"Duration: {len(audio) / sr:.2f} seconds")
print(f"Shape: {audio.shape}")
print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")

# Visualize waveform
plt.figure(figsize=(12, 4))
time = np.arange(len(audio)) / sr
plt.plot(time, audio)
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Audio Waveform')
plt.show()

Key properties:

Sample rate (sr): Samples per second (e.g., 16000 Hz = 16000 samples/sec)
Duration: len(audio) / sr seconds
Amplitude: Typically normalized to [-1, 1]

Feature 1: Mel-Frequency Cepstral Coefficients (MFCCs)

MFCCs are the most widely used features in speech recognition.

Why MFCCs?

Mimic human hearing: Use mel scale (perceptual frequency scale)
Compact: Represent spectral envelope with 13-40 coefficients
Robust: Less sensitive to pitch variations
Proven: Gold standard for ASR for decades

How MFCCs Work

Audio Waveform
    ↓
1. Pre-emphasis (boost high frequencies)
    ↓
2. Frame the signal (25ms windows, 10ms hop)
    ↓
3. Apply window function (Hamming)
    ↓
4. FFT (Fast Fourier Transform)
    ↓
5. Mel filterbank (map to mel scale)
    ↓
6. Log (compress dynamic range)
    ↓
7. DCT (Discrete Cosine Transform)
    ↓
MFCCs (13-40 coefficients per frame)

Implementation

import librosa
import numpy as np

class MFCCExtractor:
    """
    Extract MFCC features from audio
    
    Standard configuration for speech recognition
    """
    
    def __init__(
        self,
        sr=16000,
        n_mfcc=40,
        n_fft=512,
        hop_length=160,  # 10ms at 16kHz
        n_mels=40,
        fmin=20,
        fmax=8000
    ):
        self.sr = sr
        self.n_mfcc = n_mfcc
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.n_mels = n_mels
        self.fmin = fmin
        self.fmax = fmax
    
    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract MFCCs
        
        Args:
            audio: Audio waveform (1D array)
        
        Returns:
            MFCCs: (n_mfcc, time_steps)
        """
        # Extract MFCCs
        mfccs = librosa.feature.mfcc(
            y=audio,
            sr=self.sr,
            n_mfcc=self.n_mfcc,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            n_mels=self.n_mels,
            fmin=self.fmin,
            fmax=self.fmax
        )
        
        return mfccs  # Shape: (n_mfcc, time)
    
    def extract_with_deltas(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract MFCCs + deltas + delta-deltas
        
        Deltas capture temporal dynamics
        
        Returns:
            Features: (n_mfcc * 3, time_steps)
        """
        # MFCCs
        mfccs = self.extract(audio)
        
        # Delta (first derivative)
        delta = librosa.feature.delta(mfccs)
        
        # Delta-delta (second derivative)
        delta2 = librosa.feature.delta(mfccs, order=2)
        
        # Stack
        features = np.vstack([mfccs, delta, delta2])  # (120, time)
        
        return features

# Usage
extractor = MFCCExtractor()
mfccs = extractor.extract(audio)
print(f"MFCCs shape: {mfccs.shape}")  # (40, time_steps)

# With deltas
features = extractor.extract_with_deltas(audio)
print(f"MFCCs+deltas shape: {features.shape}")  # (120, time_steps)

Visualizing MFCCs

import matplotlib.pyplot as plt

def plot_mfccs(mfccs, sr, hop_length):
    """Visualize MFCC features"""
    plt.figure(figsize=(12, 6))
    
    # Convert frame indices to time
    times = librosa.frames_to_time(
        np.arange(mfccs.shape[1]),
        sr=sr,
        hop_length=hop_length
    )
    
    plt.imshow(
        mfccs,
        aspect='auto',
        origin='lower',
        extent=[times[0], times[-1], 0, mfccs.shape[0]],
        cmap='viridis'
    )
    
    plt.colorbar(format='%+2.0f dB')
    plt.xlabel('Time (s)')
    plt.ylabel('MFCC Coefficient')
    plt.title('MFCC Features')
    plt.tight_layout()
    plt.show()

plot_mfccs(mfccs, sr=16000, hop_length=160)

Feature 2: Mel-Spectrograms

Mel-spectrograms preserve more temporal detail than MFCCs.

What is a Spectrogram?

A spectrogram shows how the frequency content of a signal changes over time.

X-axis: Time
Y-axis: Frequency
Color: Magnitude (energy)

Mel-Spectrogram vs MFCC

Aspect	Mel-Spectrogram	MFCC
Dimensions	(n_mels, time)	(n_mfcc, time)
Information	Full spectrum	Spectral envelope
Size	40-128 bins	13-40 coefficients
Use case	CNNs, deep learning	Traditional ASR
Temporal resolution	Higher	Lower (due to DCT)

Implementation

class MelSpectrogramExtractor:
    """
    Extract log mel-spectrogram features
    
    Popular for deep learning models (CNNs, Transformers)
    """
    
    def __init__(
        self,
        sr=16000,
        n_fft=512,
        hop_length=160,
        n_mels=80,
        fmin=0,
        fmax=8000
    ):
        self.sr = sr
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.n_mels = n_mels
        self.fmin = fmin
        self.fmax = fmax
    
    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract log mel-spectrogram
        
        Returns:
            Log mel-spectrogram: (n_mels, time_steps)
        """
        # Compute mel spectrogram
        mel_spec = librosa.feature.melspectrogram(
            y=audio,
            sr=self.sr,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            n_mels=self.n_mels,
            fmin=self.fmin,
            fmax=self.fmax
        )
        
        # Convert to log scale (dB)
        log_mel = librosa.power_to_db(mel_spec, ref=np.max)
        
        return log_mel  # Shape: (n_mels, time)
    
    def extract_normalized(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract and normalize to [0, 1]
        
        Better for neural networks
        """
        log_mel = self.extract(audio)
        
        # Normalize to [0, 1]
        log_mel_norm = (log_mel - log_mel.min()) / (log_mel.max() - log_mel.min() + 1e-8)
        
        return log_mel_norm

# Usage
mel_extractor = MelSpectrogramExtractor(n_mels=80)
mel_spec = mel_extractor.extract(audio)
print(f"Mel-spectrogram shape: {mel_spec.shape}")  # (80, time_steps)

Visualizing Mel-Spectrogram

def plot_mel_spectrogram(mel_spec, sr, hop_length):
    """Visualize mel-spectrogram"""
    plt.figure(figsize=(12, 6))
    
    librosa.display.specshow(
        mel_spec,
        sr=sr,
        hop_length=hop_length,
        x_axis='time',
        y_axis='mel',
        cmap='viridis'
    )
    
    plt.colorbar(format='%+2.0f dB')
    plt.title('Mel-Spectrogram')
    plt.tight_layout()
    plt.show()

plot_mel_spectrogram(mel_spec, sr=16000, hop_length=160)

Feature 3: Raw Spectrograms (STFT)

Short-Time Fourier Transform (STFT) provides the highest frequency resolution.

Implementation

class STFTExtractor:
    """
    Extract raw STFT features
    
    Used when you need full frequency resolution
    """
    
    def __init__(
        self,
        n_fft=512,
        hop_length=160,
        win_length=400
    ):
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.win_length = win_length
    
    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract magnitude spectrogram
        
        Returns:
            Spectrogram: (n_fft//2 + 1, time_steps)
        """
        # Compute STFT
        stft = librosa.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length
        )
        
        # Get magnitude
        magnitude = np.abs(stft)
        
        # Convert to dB
        magnitude_db = librosa.amplitude_to_db(magnitude, ref=np.max)
        
        return magnitude_db  # Shape: (n_fft//2 + 1, time)
    
    def extract_with_phase(self, audio: np.ndarray):
        """
        Extract magnitude and phase
        
        Phase information useful for reconstruction
        """
        stft = librosa.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length
        )
        
        magnitude = np.abs(stft)
        phase = np.angle(stft)
        
        return magnitude, phase

# Usage
stft_extractor = STFTExtractor()
spectrogram = stft_extractor.extract(audio)
print(f"Spectrogram shape: {spectrogram.shape}")  # (257, time_steps)

Feature 4: Time-Domain Features

Simple but effective features computed directly from waveform.

Implementation

class TimeDomainExtractor:
    """
    Extract time-domain features
    
    Fast to compute, useful for simple tasks
    """
    
    def extract_energy(self, audio: np.ndarray, frame_length=400, hop_length=160):
        """
        Frame-wise energy (RMS)
        
        Captures loudness/volume over time
        """
        energy = librosa.feature.rms(
            y=audio,
            frame_length=frame_length,
            hop_length=hop_length
        )[0]
        
        return energy
    
    def extract_zero_crossing_rate(self, audio: np.ndarray, frame_length=400, hop_length=160):
        """
        Zero-crossing rate
        
        Measures how often signal crosses zero
        High ZCR → noisy/unvoiced
        Low ZCR → tonal/voiced
        """
        zcr = librosa.feature.zero_crossing_rate(
            audio,
            frame_length=frame_length,
            hop_length=hop_length
        )[0]
        
        return zcr
    
    def extract_all(self, audio: np.ndarray):
        """Extract all time-domain features"""
        energy = self.extract_energy(audio)
        zcr = self.extract_zero_crossing_rate(audio)
        
        # Stack features
        features = np.vstack([energy, zcr])  # (2, time)
        
        return features

# Usage
time_extractor = TimeDomainExtractor()
time_features = time_extractor.extract_all(audio)
print(f"Time-domain features shape: {time_features.shape}")  # (2, time_steps)

Feature 5: Pitch & Formants

Pitch and formants are linguistic features important for speech.

Pitch Extraction

class PitchExtractor:
    """
    Extract fundamental frequency (F0)
    
    Important for:
    - Speaker recognition
    - Emotion detection
    - Prosody modeling
    """
    
    def __init__(self, sr=16000, fmin=80, fmax=400):
        self.sr = sr
        self.fmin = fmin  # Typical male voice
        self.fmax = fmax  # Typical female voice
    
    def extract_f0(self, audio: np.ndarray, hop_length=160):
        """
        Extract pitch (fundamental frequency)
        
        Returns:
            f0: Pitch values (Hz) per frame
            voiced_flag: Boolean array (voiced vs unvoiced)
        """
        # Extract pitch using YIN algorithm
        f0 = librosa.yin(
            audio,
            fmin=self.fmin,
            fmax=self.fmax,
            sr=self.sr,
            hop_length=hop_length
        )
        
        # Detect voiced regions (f0 > 0)
        voiced_flag = f0 > 0
        
        return f0, voiced_flag
    
    def extract_pitch_features(self, audio: np.ndarray):
        """
        Extract pitch statistics
        
        Useful for speaker/emotion recognition
        """
        f0, voiced = self.extract_f0(audio)
        
        # Statistics on voiced frames
        voiced_f0 = f0[voiced]
        
        if len(voiced_f0) > 0:
            features = {
                'mean_pitch': np.mean(voiced_f0),
                'std_pitch': np.std(voiced_f0),
                'min_pitch': np.min(voiced_f0),
                'max_pitch': np.max(voiced_f0),
                'pitch_range': np.max(voiced_f0) - np.min(voiced_f0),
                'voiced_ratio': np.sum(voiced) / len(voiced)
            }
        else:
            features = {k: 0.0 for k in ['mean_pitch', 'std_pitch', 'min_pitch', 'max_pitch', 'pitch_range', 'voiced_ratio']}
        
        return features

# Usage
pitch_extractor = PitchExtractor()
f0, voiced = pitch_extractor.extract_f0(audio)
print(f"Pitch shape: {f0.shape}")

pitch_stats = pitch_extractor.extract_pitch_features(audio)
print(f"Pitch statistics: {pitch_stats}")

Production Feature Pipeline

Combine all features into a unified pipeline.

Unified Feature Extractor

from dataclasses import dataclass
from typing import Dict, List, Optional
import json

@dataclass
class FeatureConfig:
    """Configuration for feature extraction"""
    sr: int = 16000
    feature_types: List[str] = None  # ['mfcc', 'mel', 'pitch']
    
    # MFCC config
    n_mfcc: int = 40
    
    # Mel-spectrogram config
    n_mels: int = 80
    
    # Common config
    n_fft: int = 512
    hop_length: int = 160  # 10ms
    
    # Normalization
    normalize: bool = True
    
    def __post_init__(self):
        if self.feature_types is None:
            self.feature_types = ['mfcc']

class AudioFeatureExtractor:
    """
    Production-grade audio feature extractor
    
    Supports multiple feature types, caching, and batch processing
    """
    
    def __init__(self, config: FeatureConfig):
        self.config = config
        
        # Initialize extractors
        self.mfcc_extractor = MFCCExtractor(
            sr=config.sr,
            n_mfcc=config.n_mfcc,
            n_fft=config.n_fft,
            hop_length=config.hop_length
        )
        
        self.mel_extractor = MelSpectrogramExtractor(
            sr=config.sr,
            n_mels=config.n_mels,
            n_fft=config.n_fft,
            hop_length=config.hop_length
        )
        
        self.pitch_extractor = PitchExtractor(sr=config.sr)
        self.time_extractor = TimeDomainExtractor()
    
    def extract(self, audio: np.ndarray) -> Dict[str, np.ndarray]:
        """
        Extract features based on config
        
        Args:
            audio: Audio waveform
        
        Returns:
            Dictionary of features
        """
        features = {}
        
        if 'mfcc' in self.config.feature_types:
            mfccs = self.mfcc_extractor.extract_with_deltas(audio)
            if self.config.normalize:
                mfccs = self._normalize(mfccs)
            features['mfcc'] = mfccs
        
        if 'mel' in self.config.feature_types:
            mel = self.mel_extractor.extract(audio)
            if self.config.normalize:
                mel = self._normalize(mel)
            features['mel'] = mel
        
        if 'pitch' in self.config.feature_types:
            f0, voiced = self.pitch_extractor.extract_f0(audio, hop_length=self.config.hop_length)
            features['pitch'] = f0
            features['voiced'] = voiced.astype(np.float32)
        
        if 'time' in self.config.feature_types:
            time_feats = self.time_extractor.extract_all(audio)
            if self.config.normalize:
                time_feats = self._normalize(time_feats)
            features['time'] = time_feats
        
        return features
    
    def _normalize(self, features: np.ndarray) -> np.ndarray:
        """
        Normalize features (mean=0, std=1) per coefficient
        """
        mean = np.mean(features, axis=1, keepdims=True)
        std = np.std(features, axis=1, keepdims=True) + 1e-8
        
        normalized = (features - mean) / std
        
        return normalized
    
    def extract_from_file(self, audio_path: str) -> Dict[str, np.ndarray]:
        """
        Extract features from audio file
        """
        audio, sr = librosa.load(audio_path, sr=self.config.sr)
        return self.extract(audio)
    
    def extract_batch(self, audio_list: List[np.ndarray]) -> List[Dict[str, np.ndarray]]:
        """
        Extract features from batch of audio
        """
        return [self.extract(audio) for audio in audio_list]
    
    def save_config(self, path: str):
        """Save feature extraction config"""
        with open(path, 'w') as f:
            json.dump(self.config.__dict__, f, indent=2)
    
    @staticmethod
    def load_config(path: str) -> FeatureConfig:
        """Load feature extraction config"""
        with open(path, 'r') as f:
            config_dict = json.load(f)
        return FeatureConfig(**config_dict)

# Usage
config = FeatureConfig(
    feature_types=['mfcc', 'mel', 'pitch'],
    n_mfcc=40,
    n_mels=80,
    normalize=True
)

extractor = AudioFeatureExtractor(config)

# Extract features
features = extractor.extract(audio)
print("Extracted features:", features.keys())
for name, feat in features.items():
    print(f"  {name}: {feat.shape}")

# Save config for reproducibility
extractor.save_config('feature_config.json')

Handling Variable-Length Audio

Different audio clips have different durations. Need to handle this for ML.

Strategy 1: Padding/Truncation

class VariableLengthHandler:
    """
    Handle variable-length audio
    """
    
    def pad_or_truncate(self, features: np.ndarray, target_length: int) -> np.ndarray:
        """
        Pad or truncate features to fixed length
        
        Args:
            features: (n_features, time)
            target_length: Target time dimension
        
        Returns:
            Fixed-length features: (n_features, target_length)
        """
        current_length = features.shape[1]
        
        if current_length < target_length:
            # Pad with zeros
            pad_width = ((0, 0), (0, target_length - current_length))
            features = np.pad(features, pad_width, mode='constant')
        elif current_length > target_length:
            # Truncate (take first target_length frames)
            features = features[:, :target_length]
        
        return features
    
    def create_mask(self, features: np.ndarray, target_length: int) -> np.ndarray:
        """
        Create attention mask for padded features
        
        Returns:
            Mask: (target_length,) - 1 for real frames, 0 for padding
        """
        current_length = features.shape[1]
        
        mask = np.zeros(target_length)
        mask[:min(current_length, target_length)] = 1
        
        return mask

Strategy 2: Temporal Pooling

class TemporalPooler:
    """
    Pool variable-length features to fixed size
    """
    
    def mean_pool(self, features: np.ndarray) -> np.ndarray:
        """
        Average pool over time
        
        Args:
            features: (n_features, time)
        
        Returns:
            Pooled: (n_features,)
        """
        return np.mean(features, axis=1)
    
    def max_pool(self, features: np.ndarray) -> np.ndarray:
        """Max pool over time"""
        return np.max(features, axis=1)
    
    def stats_pool(self, features: np.ndarray) -> np.ndarray:
        """
        Statistical pooling: mean + std
        
        Returns:
            Pooled: (n_features * 2,)
        """
        mean = np.mean(features, axis=1)
        std = np.std(features, axis=1)
        
        return np.concatenate([mean, std])

Real-Time Feature Extraction

For streaming applications, need incremental feature extraction.

Streaming Feature Extractor

from collections import deque

class StreamingFeatureExtractor:
    """
    Extract features from streaming audio
    
    Use case: Real-time ASR, voice assistants
    """
    
    def __init__(
        self,
        sr=16000,
        frame_length_ms=25,
        hop_length_ms=10,
        buffer_duration_ms=500
    ):
        self.sr = sr
        self.frame_length = int(sr * frame_length_ms / 1000)
        self.hop_length = int(sr * hop_length_ms / 1000)
        self.buffer_length = int(sr * buffer_duration_ms / 1000)
        
        # Circular buffer for audio
        self.buffer = deque(maxlen=self.buffer_length)
        
        # Feature extractor
        self.extractor = MFCCExtractor(
            sr=sr,
            hop_length=self.hop_length
        )
    
    def add_audio_chunk(self, audio_chunk: np.ndarray):
        """
        Add new audio chunk to buffer
        
        Args:
            audio_chunk: New audio samples
        """
        self.buffer.extend(audio_chunk)
    
    def extract_latest(self) -> Optional[np.ndarray]:
        """
        Extract features from current buffer
        
        Returns:
            Features or None if buffer too small
        """
        if len(self.buffer) < self.frame_length:
            return None
        
        # Convert buffer to array
        audio = np.array(self.buffer)
        
        # Extract features
        features = self.extractor.extract(audio)
        
        return features
    
    def reset(self):
        """Clear buffer"""
        self.buffer.clear()

# Usage
streaming_extractor = StreamingFeatureExtractor()

# Simulate streaming (100ms chunks)
chunk_size = 1600  # 100ms at 16kHz

for i in range(0, len(audio), chunk_size):
    chunk = audio[i:i+chunk_size]
    
    # Add to buffer
    streaming_extractor.add_audio_chunk(chunk)
    
    # Extract features
    features = streaming_extractor.extract_latest()
    
    if features is not None:
        print(f"Chunk {i//chunk_size}: features shape = {features.shape}")
        # Process features (send to model, etc.)

Performance Optimization

1. Caching Features

import os
import pickle
import hashlib

class CachedFeatureExtractor:
    """
    Cache extracted features to disk
    
    Avoid re-extracting for same audio
    """
    
    def __init__(self, extractor: AudioFeatureExtractor, cache_dir='./feature_cache'):
        self.extractor = extractor
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)
    
    def _get_cache_path(self, audio_path: str) -> str:
        """Generate cache file path based on audio path hash"""
        path_hash = hashlib.md5(audio_path.encode()).hexdigest()
        return os.path.join(self.cache_dir, f"{path_hash}.pkl")
    
    def extract_from_file(self, audio_path: str, use_cache=True) -> Dict[str, np.ndarray]:
        """
        Extract features with caching
        """
        cache_path = self._get_cache_path(audio_path)
        
        # Check cache
        if use_cache and os.path.exists(cache_path):
            with open(cache_path, 'rb') as f:
                features = pickle.load(f)
            return features
        
        # Extract features
        features = self.extractor.extract_from_file(audio_path)
        
        # Save to cache
        with open(cache_path, 'wb') as f:
            pickle.dump(features, f)
        
        return features

2. Parallel Processing

from multiprocessing import Pool
from functools import partial

class ParallelFeatureExtractor:
    """
    Extract features from multiple files in parallel
    """
    
    def __init__(self, extractor: AudioFeatureExtractor, n_workers=4):
        self.extractor = extractor
        self.n_workers = n_workers
    
    def extract_from_files(self, audio_paths: List[str]) -> List[Dict[str, np.ndarray]]:
        """
        Extract features from multiple files in parallel
        """
        with Pool(self.n_workers) as pool:
            features_list = pool.map(
                self.extractor.extract_from_file,
                audio_paths
            )
        
        return features_list

# Usage
parallel_extractor = ParallelFeatureExtractor(extractor, n_workers=8)
audio_files = ['file1.wav', 'file2.wav', ...]  # 1000s of files
features = parallel_extractor.extract_from_files(audio_files)

Advanced Feature Types

1. Learned Features (Embeddings)

Instead of hand-crafted features, learn representations from data.

import torch
import torch.nn as nn

class AudioEmbeddingExtractor(nn.Module):
    """
    Extract learned audio embeddings
    
    Use pre-trained models (wav2vec, HuBERT) as feature extractors
    """
    
    def __init__(self, model_name='facebook/wav2vec2-base'):
        super().__init__()
        from transformers import Wav2Vec2Model
        
        # Load pre-trained model
        self.model = Wav2Vec2Model.from_pretrained(model_name)
        self.model.eval()  # Freeze for feature extraction
    
    def extract(self, audio: np.ndarray, sr=16000) -> np.ndarray:
        """
        Extract contextualized embeddings
        
        Returns:
            Embeddings: (time_steps, hidden_dim)
                typically (time, 768) for base model
        """
        # Convert to tensor
        audio_tensor = torch.tensor(audio, dtype=torch.float32).unsqueeze(0)
        
        # Extract features
        with torch.no_grad():
            outputs = self.model(audio_tensor)
            embeddings = outputs.last_hidden_state[0]  # (time, 768)
        
        return embeddings.numpy()

# Usage - MUCH more powerful than MFCCs for transfer learning
embedding_extractor = AudioEmbeddingExtractor()
embeddings = embedding_extractor.extract(audio)
print(f"Embeddings shape: {embeddings.shape}")  # (time, 768)

Comparison:

Feature Type	Dimension	Training Required	Transfer Learning	Accuracy
MFCCs	40-120	No	Poor	Baseline
Mel-spectrogram	80-128	No	Good	+5-10%
Wav2Vec embeddings	768	Yes (pre-trained)	Excellent	+15-25%

2. Filter Bank Features (FBank)

Alternative to MFCCs - skip the DCT step.

class FilterbankExtractor:
    """
    Extract log mel-filterbank features
    
    Similar to mel-spectrograms, popular in modern ASR
    """
    
    def __init__(self, sr=16000, n_mels=80, n_fft=512, hop_length=160):
        self.sr = sr
        self.n_mels = n_mels
        self.n_fft = n_fft
        self.hop_length = hop_length
    
    def extract(self, audio: np.ndarray) -> np.ndarray:
        """
        Extract log filter bank energies
        
        Returns:
            FBank: (n_mels, time_steps)
        """
        # Mel spectrogram
        mel_spec = librosa.feature.melspectrogram(
            y=audio,
            sr=self.sr,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            n_mels=self.n_mels
        )
        
        # Log
        log_mel = librosa.power_to_db(mel_spec, ref=np.max)
        
        return log_mel

# FBank vs MFCC:
# - FBank: Keep all mel bins (80-128)
# - MFCC: Compress to 13-40 via DCT
# 
# FBank often works better with neural networks

3. Prosodic Features

Capture rhythm, stress, and intonation.

class ProsodicFeatureExtractor:
    """
    Extract prosodic features for emotion, speaker ID, etc.
    """
    
    def extract_intensity_contour(self, audio, sr=16000, hop_length=160):
        """
        Intensity (loudness) over time
        """
        intensity = librosa.feature.rms(y=audio, hop_length=hop_length)[0]
        
        # Convert to dB
        intensity_db = librosa.amplitude_to_db(intensity, ref=np.max)
        
        return intensity_db
    
    def extract_speaking_rate(self, audio, sr=16000):
        """
        Estimate speaking rate (syllables per second)
        
        Approximation: count peaks in energy envelope
        """
        # Energy envelope
        energy = librosa.feature.rms(y=audio, hop_length=160)[0]
        
        # Find peaks (local maxima)
        from scipy.signal import find_peaks
        
        peaks, _ = find_peaks(energy, distance=10, prominence=0.1)
        
        # Speaking rate
        duration = len(audio) / sr
        syllables_per_sec = len(peaks) / duration
        
        return syllables_per_sec
    
    def extract_all_prosodic(self, audio, sr=16000):
        """Extract all prosodic features"""
        
        # Pitch
        pitch_extractor = PitchExtractor(sr=sr)
        pitch_stats = pitch_extractor.extract_pitch_features(audio)
        
        # Intensity
        intensity = self.extract_intensity_contour(audio, sr)
        
        # Speaking rate
        speaking_rate = self.extract_speaking_rate(audio, sr)
        
        return {
            **pitch_stats,
            'mean_intensity': np.mean(intensity),
            'std_intensity': np.std(intensity),
            'speaking_rate': speaking_rate
        }

Feature Quality & Validation

Ensure extracted features are high quality.

Feature Quality Metrics

class FeatureQualityChecker:
    """
    Validate quality of extracted features
    """
    
    def check_for_nans(self, features: Dict[str, np.ndarray]) -> bool:
        """Check for NaN/Inf values"""
        for name, feat in features.items():
            if np.isnan(feat).any() or np.isinf(feat).any():
                print(f"⚠️  {name} contains NaN/Inf")
                return False
        return True
    
    def check_dynamic_range(self, features: Dict[str, np.ndarray]) -> Dict[str, float]:
        """
        Check dynamic range of features
        
        Low dynamic range → feature not informative
        """
        ranges = {}
        
        for name, feat in features.items():
            feat_range = feat.max() - feat.min()
            ranges[name] = feat_range
            
            if feat_range < 1e-6:
                print(f"⚠️  {name} has very low dynamic range: {feat_range}")
        
        return ranges
    
    def check_feature_statistics(self, features_batch: List[np.ndarray]):
        """
        Check statistics across batch
        
        Ensure features are properly normalized
        """
        # Stack all features
        all_features = np.concatenate(features_batch, axis=1)  # (n_features, total_time)
        
        # Per-feature statistics
        mean_per_feature = np.mean(all_features, axis=1)
        std_per_feature = np.std(all_features, axis=1)
        
        print("Feature Statistics:")
        print(f"  Mean range: [{mean_per_feature.min():.3f}, {mean_per_feature.max():.3f}]")
        print(f"  Std range: [{std_per_feature.min():.3f}, {std_per_feature.max():.3f}]")
        
        # Check if normalized
        if np.abs(mean_per_feature).max() > 0.1:
            print("⚠️  Features not centered (mean far from 0)")
        
        if np.abs(std_per_feature - 1.0).max() > 0.2:
            print("⚠️  Features not standardized (std far from 1)")

Connection to Data Preprocessing Pipeline

Feature extraction for speech is analogous to data preprocessing for ML systems (see Day 3 ML).

Parallel Concepts

Speech Feature Extraction	ML Data Preprocessing
Handle missing audio	Handle missing values
Normalize features (mean=0, std=1)	Normalize numerical features
Pad/truncate variable length	Handle variable-length sequences
Validate audio quality	Schema validation
Cache extracted features	Cache preprocessed data
Batch processing	Distributed data processing

Unified Preprocessing Framework

class UnifiedPreprocessor:
    """
    Combined preprocessing for multimodal ML
    
    Example: Speech + text + metadata
    """
    
    def __init__(self):
        # Audio features
        self.audio_extractor = AudioFeatureExtractor(
            FeatureConfig(feature_types=['mfcc', 'mel'])
        )
        
        # Text features (from transcripts)
        from sklearn.feature_extraction.text import TfidfVectorizer
        self.text_vectorizer = TfidfVectorizer(max_features=1000)
        
        # Numerical features
        from sklearn.preprocessing import StandardScaler
        self.numerical_scaler = StandardScaler()
    
    def preprocess_sample(self, audio, text, metadata):
        """
        Preprocess multimodal sample
        
        Args:
            audio: Audio waveform
            text: Transcript or description
            metadata: User/item metadata (dict)
        
        Returns:
            Combined feature vector
        """
        # Extract audio features
        audio_features = self.audio_extractor.extract(audio)
        audio_pooled = np.mean(audio_features['mfcc'], axis=1)  # (n_mfcc,)
        
        # Extract text features
        text_features = self.text_vectorizer.transform([text]).toarray()[0]  # (1000,)
        
        # Process metadata
        metadata_array = np.array([
            metadata['user_age'],
            metadata['user_gender'],
            metadata['device_type']
        ])
        metadata_scaled = self.numerical_scaler.transform([metadata_array])[0]
        
        # Concatenate all features
        combined = np.concatenate([
            audio_pooled,      # (40,)
            text_features,     # (1000,)
            metadata_scaled    # (3,)
        ])  # Total: (1043,)
        
        return combined

Production Best Practices

1. Feature Versioning

Track feature extraction versions for reproducibility.

class VersionedFeatureExtractor:
    """
    Version feature extraction logic
    
    Critical for:
    - A/B testing different features
    - Rollback if new features hurt performance
    - Reproducibility
    """
    
    VERSION = "1.2.0"
    
    def __init__(self, config: FeatureConfig):
        self.config = config
        self.extractor = AudioFeatureExtractor(config)
    
    def extract_with_metadata(self, audio_path: str):
        """
        Extract features with version metadata
        """
        features = self.extractor.extract_from_file(audio_path)
        
        metadata = {
            'version': self.VERSION,
            'config': self.config.__dict__,
            'timestamp': datetime.now().isoformat(),
            'audio_path': audio_path
        }
        
        return {
            'features': features,
            'metadata': metadata
        }
    
    def save_features(self, features, output_path):
        """Save features with version info"""
        np.savez_compressed(
            output_path,
            **features['features'],
            metadata=json.dumps(features['metadata'])
        )

2. Error Handling

Robust feature extraction handles failures gracefully.

class RobustFeatureExtractor:
    """
    Feature extractor with error handling
    """
    
    def __init__(self, extractor: AudioFeatureExtractor):
        self.extractor = extractor
    
    def extract_safe(self, audio_path: str) -> Optional[Dict]:
        """
        Extract features with error handling
        """
        try:
            # Load audio
            audio, sr = librosa.load(audio_path, sr=self.extractor.config.sr)
            
            # Validate
            if len(audio) == 0:
                logger.warning(f"Empty audio: {audio_path}")
                return None
            
            if len(audio) < self.extractor.config.sr * 0.1:  # < 100ms
                logger.warning(f"Audio too short: {audio_path}")
                return None
            
            # Extract
            features = self.extractor.extract(audio)
            
            # Quality check
            quality_checker = FeatureQualityChecker()
            if not quality_checker.check_for_nans(features):
                logger.error(f"Feature extraction failed (NaN): {audio_path}")
                return None
            
            return features
        
        except Exception as e:
            logger.error(f"Feature extraction error for {audio_path}: {e}")
            return None
    
    def extract_batch_robust(self, audio_paths: List[str]) -> List[Dict]:
        """
        Extract from batch, skipping failures
        """
        results = []
        failures = []
        
        for path in audio_paths:
            features = self.extract_safe(path)
            if features is not None:
                results.append({'path': path, 'features': features})
            else:
                failures.append(path)
        
        success_rate = len(results) / len(audio_paths)
        logger.info(f"Feature extraction: {len(results)}/{len(audio_paths)} succeeded ({success_rate:.1%})")
        
        if failures:
            logger.warning(f"Failed files: {failures[:10]}")  # Log first 10
        
        return results

3. Monitoring Feature Quality

Track feature statistics over time to detect issues.

class FeatureMonitor:
    """
    Monitor feature quality in production
    """
    
    def __init__(self, expected_stats: Dict[str, Dict]):
        """
        Args:
            expected_stats: Expected statistics per feature type
                {
                    'mfcc': {'mean_range': [-5, 5], 'std_range': [0.5, 2.0]},
                    'mel': {'mean_range': [-80, 0], 'std_range': [10, 30]}
                }
        """
        self.expected_stats = expected_stats
    
    def validate_features(self, features: Dict[str, np.ndarray]) -> List[str]:
        """
        Validate extracted features against expected statistics
        
        Returns:
            List of warnings
        """
        warnings = []
        
        for feat_name, feat_values in features.items():
            if feat_name not in self.expected_stats:
                continue
            
            expected = self.expected_stats[feat_name]
            
            # Check mean
            actual_mean = np.mean(feat_values)
            expected_mean_range = expected['mean_range']
            
            if not (expected_mean_range[0] <= actual_mean <= expected_mean_range[1]):
                warnings.append(
                    f"{feat_name}: mean {actual_mean:.2f} outside expected range {expected_mean_range}"
                )
            
            # Check std
            actual_std = np.std(feat_values)
            expected_std_range = expected['std_range']
            
            if not (expected_std_range[0] <= actual_std <= expected_std_range[1]):
                warnings.append(
                    f"{feat_name}: std {actual_std:.2f} outside expected range {expected_std_range}"
                )
        
        return warnings
    
    def compute_statistics(self, features_batch: List[Dict[str, np.ndarray]]):
        """
        Compute statistics across batch
        
        Use to establish baseline expected_stats
        """
        stats = {}
        
        # Get feature names from first sample
        feature_names = features_batch[0].keys()
        
        for feat_name in feature_names:
            # Collect all values
            all_values = np.concatenate([
                f[feat_name].flatten() for f in features_batch
            ])
            
            stats[feat_name] = {
                'mean': np.mean(all_values),
                'std': np.std(all_values),
                'min': np.min(all_values),
                'max': np.max(all_values),
                'percentiles': {
                    '25': np.percentile(all_values, 25),
                    '50': np.percentile(all_values, 50),
                    '75': np.percentile(all_values, 75),
                    '95': np.percentile(all_values, 95)
                }
            }
        
        return stats

Data Augmentation in Feature Space

Augment features directly for training robustness.

SpecAugment

class SpecAugment:
    """
    SpecAugment: Data augmentation on spectrograms
    
    Proposed in "SpecAugment: A Simple Data Augmentation Method for ASR" (Google, 2019)
    
    Improves ASR accuracy by 10-20% on many benchmarks
    """
    
    def __init__(
        self,
        time_mask_param=70,
        freq_mask_param=15,
        num_time_masks=2,
        num_freq_masks=2
    ):
        self.time_mask_param = time_mask_param
        self.freq_mask_param = freq_mask_param
        self.num_time_masks = num_time_masks
        self.num_freq_masks = num_freq_masks
    
    def time_mask(self, spec: np.ndarray) -> np.ndarray:
        """
        Mask random time region
        
        Sets random time frames to zero
        """
        spec = spec.copy()
        time_length = spec.shape[1]
        
        for _ in range(self.num_time_masks):
            t = np.random.randint(0, min(self.time_mask_param, time_length))
            t0 = np.random.randint(0, time_length - t)
            spec[:, t0:t0+t] = 0
        
        return spec
    
    def freq_mask(self, spec: np.ndarray) -> np.ndarray:
        """
        Mask random frequency region
        
        Sets random frequency bins to zero
        """
        spec = spec.copy()
        freq_length = spec.shape[0]
        
        for _ in range(self.num_freq_masks):
            f = np.random.randint(0, min(self.freq_mask_param, freq_length))
            f0 = np.random.randint(0, freq_length - f)
            spec[f0:f0+f, :] = 0
        
        return spec
    
    def augment(self, spec: np.ndarray) -> np.ndarray:
        """Apply both time and freq masking"""
        spec = self.time_mask(spec)
        spec = self.freq_mask(spec)
        return spec

# Usage during training
augmenter = SpecAugment()

for audio, label in train_loader:
    # Extract features
    mel_spec = mel_extractor.extract(audio)
    
    # Augment
    mel_spec_aug = augmenter.augment(mel_spec)
    
    # Train model
    train_model(mel_spec_aug, label)

Batch Feature Extraction for Training

Extract features for entire dataset efficiently.

Batch Extraction Pipeline

import os
from pathlib import Path
from tqdm import tqdm
import h5py

class BatchFeatureExtractor:
    """
    Extract features for large audio datasets
    
    Use case: Prepare training data
    - Extract once, train many times
    - Save features to disk (HDF5 format)
    """
    
    def __init__(self, extractor: AudioFeatureExtractor, n_workers=8):
        self.extractor = extractor
        self.n_workers = n_workers
    
    def extract_dataset(
        self,
        audio_dir: str,
        output_path: str,
        max_length_frames: int = 1000
    ):
        """
        Extract features for all audio files in directory
        
        Args:
            audio_dir: Directory containing .wav files
            output_path: HDF5 file to save features
            max_length_frames: Pad/truncate to this length
        """
        # Find all audio files
        audio_files = list(Path(audio_dir).rglob('*.wav'))
        print(f"Found {len(audio_files)} audio files")
        
        # Create HDF5 file
        with h5py.File(output_path, 'w') as hf:
            # Pre-allocate datasets
            # (We'll store features for each type)
            feature_dim = self.extractor.config.n_mfcc * 3  # MFCCs + deltas
            
            features_dataset = hf.create_dataset(
                'features',
                shape=(len(audio_files), feature_dim, max_length_frames),
                dtype='float32'
            )
            
            lengths_dataset = hf.create_dataset(
                'lengths',
                shape=(len(audio_files),),
                dtype='int32'
            )
            
            # Store file paths
            paths_dataset = hf.create_dataset(
                'paths',
                shape=(len(audio_files),),
                dtype=h5py.string_dtype()
            )
            
            # Extract features
            for idx, audio_path in enumerate(tqdm(audio_files)):
                try:
                    # Load audio
                    audio, sr = librosa.load(str(audio_path), sr=self.extractor.config.sr)
                    
                    # Extract features
                    features = self.extractor.extract(audio)
                    
                    # Get MFCCs with deltas
                    mfcc_deltas = features['mfcc']  # (120, time)
                    
                    # Pad or truncate
                    handler = VariableLengthHandler()
                    mfcc_fixed = handler.pad_or_truncate(mfcc_deltas, max_length_frames)
                    
                    # Store
                    features_dataset[idx] = mfcc_fixed
                    lengths_dataset[idx] = min(mfcc_deltas.shape[1], max_length_frames)
                    paths_dataset[idx] = str(audio_path)
                
                except Exception as e:
                    logger.error(f"Failed to process {audio_path}: {e}")
                    # Store zeros for failed files
                    features_dataset[idx] = np.zeros((feature_dim, max_length_frames))
                    lengths_dataset[idx] = 0
                    paths_dataset[idx] = str(audio_path)
        
        print(f"Features saved to {output_path}")

# Usage
batch_extractor = BatchFeatureExtractor(extractor, n_workers=8)
batch_extractor.extract_dataset(
    audio_dir='./data/train/',
    output_path='./features/train_features.h5',
    max_length_frames=1000
)

# Load for training
with h5py.File('./features/train_features.h5', 'r') as hf:
    features = hf['features'][:]  # (N, feature_dim, max_length)
    lengths = hf['lengths'][:]    # (N,)
    paths = hf['paths'][:]        # (N,)

Real-World Systems

Kaldi: Traditional ASR Feature Pipeline

Kaldi is the industry standard for traditional ASR.

Feature extraction:

# Kaldi feature extraction (MFCC + pitch)
compute-mfcc-feats --config=conf/mfcc.conf scp:wav.scp ark:mfcc.ark
compute-and-process-kaldi-pitch-feats scp:wav.scp ark:pitch.ark

# Combine features
paste-feats ark:mfcc.ark ark:pitch.ark ark:features.ark

Configuration (mfcc.conf):

--use-energy=true
--num-mel-bins=40
--num-ceps=40
--low-freq=20
--high-freq=8000
--sample-frequency=16000

PyTorch: Modern Deep Learning Pipeline

import torchaudio
import torch

class TorchAudioExtractor:
    """
    Feature extraction using torchaudio
    
    Benefits:
    - GPU acceleration
    - Differentiable (can backprop through features)
    - Integrated with PyTorch training
    """
    
    def __init__(self, sr=16000, n_mfcc=40, n_mels=80):
        self.sr = sr
        self.n_mfcc = n_mfcc
        self.n_mels = n_mels
        
        # Create transforms (can move to GPU)
        self.mfcc_transform = torchaudio.transforms.MFCC(
            sample_rate=sr,
            n_mfcc=n_mfcc,
            melkwargs={'n_mels': 40, 'n_fft': 512, 'hop_length': 160}
        )
        
        self.mel_transform = torchaudio.transforms.MelSpectrogram(
            sample_rate=sr,
            n_fft=512,
            hop_length=160,
            n_mels=n_mels
        )
        
        # Amplitude → dB conversion
        self.db_transform = torchaudio.transforms.AmplitudeToDB()
    
    def to(self, device):
        """
        Move transforms to a device (CPU/GPU) and return self.
        """
        self.mfcc_transform = self.mfcc_transform.to(device)
        self.mel_transform = self.mel_transform.to(device)
        self.db_transform = self.db_transform.to(device)
        return self
    
    def extract(self, audio: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Extract features (GPU-accelerated if audio on GPU)
        
        Args:
            audio: (batch, time) or (time,)
        
        Returns:
            Dictionary of features
        """
        if audio.ndim == 1:
            audio = audio.unsqueeze(0)  # Add batch dimension
        
        # Extract
        mfccs = self.mfcc_transform(audio)  # (batch, n_mfcc, time)
        mel = self.mel_transform(audio)     # (batch, n_mels, time)
        mel_db = self.db_transform(mel)
        
        return {
            'mfcc': mfccs,
            'mel': mel_db
        }

# Usage with GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

extractor = TorchAudioExtractor().to(device)

# Load audio
audio, sr = torchaudio.load('speech.wav')
audio = audio.to(device)

# Extract (on GPU)
features = extractor.extract(audio)

Google: Production ASR Feature Extraction

Stack:

Input: 16kHz audio
Features: 80-bin log mel-filterbank
Augmentation: SpecAugment
Normalization: Per-utterance mean/variance normalization
Model: Transformer encoder-decoder

Key optimizations:

Precompute features for training data
On-the-fly extraction for inference
GPU-accelerated extraction for real-time systems

Choosing the Right Features

Different tasks need different features.

Feature Selection Guide

Task	Best Features	Why
ASR (traditional)	MFCCs + deltas	Captures phonetic content
ASR (deep learning)	Mel-spectrograms	Works well with CNNs
Speaker Recognition	MFCCs + pitch + prosody	Speaker identity in pitch/prosody
Emotion Recognition	Prosodic + spectral	Emotion in prosody + voice quality
Keyword Spotting	Mel-spectrograms	Simple, fast with CNNs
Speech Enhancement	STFT magnitude + phase	Need phase for reconstruction
Voice Activity Detection	Energy + ZCR	Simple features sufficient

Combining Features

class MultiFeatureExtractor:
    """
    Combine multiple feature types
    
    Different features capture different aspects
    """
    
    def __init__(self):
        self.mfcc_ext = MFCCExtractor()
        self.pitch_ext = PitchExtractor()
        self.prosody_ext = ProsodicFeatureExtractor()
    
    def extract_combined(self, audio):
        """
        Extract and combine multiple feature types
        """
        # MFCCs (40, time)
        mfccs = self.mfcc_ext.extract(audio)
        
        # Pitch (time,)
        pitch, voiced = self.pitch_ext.extract_f0(audio)
        pitch = pitch.reshape(1, -1)  # (1, time)
        
        # Energy (1, time)
        energy = librosa.feature.rms(y=audio, hop_length=160)
        
        # Align all features to same time dimension
        min_time = min(mfccs.shape[1], pitch.shape[1], energy.shape[1])
        
        mfccs = mfccs[:, :min_time]
        pitch = pitch[:, :min_time]
        energy = energy[:, :min_time]
        
        # Stack
        combined = np.vstack([mfccs, pitch, energy])  # (42, time)
        
        return combined

Key Takeaways

✅ MFCCs are standard for speech recognition - compact and robust
✅ Mel-spectrograms work better with deep learning (CNNs, Transformers)
✅ Delta features capture temporal dynamics - critical for accuracy
✅ Normalize features for stable training (mean=0, std=1)
✅ Handle variable length with padding, pooling, or attention masks
✅ Cache features for repeated use - major speedup in training
✅ Streaming extraction possible with circular buffers
✅ Parallel processing speeds up batch feature extraction
✅ SpecAugment improves robustness through feature-space augmentation
✅ Monitor feature quality to detect pipeline issues early
✅ Version features for reproducibility and A/B testing
✅ Choose features based on task - no one-size-fits-all

Originally published at: arunbaby.com/speech-tech/0003-audio-feature-extraction

If you found this helpful, consider sharing it with others who might benefit.

Introduction

Problem Definition

Functional Requirements

Non-Functional Requirements

Audio Basics

Waveform Representation

Feature 1: Mel-Frequency Cepstral Coefficients (MFCCs)

Why MFCCs?

How MFCCs Work

Implementation

Visualizing MFCCs

Feature 2: Mel-Spectrograms

What is a Spectrogram?

Mel-Spectrogram vs MFCC

Implementation

Visualizing Mel-Spectrogram

Feature 3: Raw Spectrograms (STFT)

Implementation

Feature 4: Time-Domain Features

Implementation

Feature 5: Pitch & Formants

Pitch Extraction

Production Feature Pipeline

Unified Feature Extractor

Handling Variable-Length Audio

Strategy 1: Padding/Truncation

Strategy 2: Temporal Pooling

Real-Time Feature Extraction

Streaming Feature Extractor

Performance Optimization

1. Caching Features

2. Parallel Processing

Advanced Feature Types

1. Learned Features (Embeddings)

2. Filter Bank Features (FBank)

3. Prosodic Features

Feature Quality & Validation

Feature Quality Metrics

Connection to Data Preprocessing Pipeline

Parallel Concepts

Unified Preprocessing Framework

Production Best Practices

1. Feature Versioning

2. Error Handling

3. Monitoring Feature Quality

Data Augmentation in Feature Space

SpecAugment

Batch Feature Extraction for Training

Batch Extraction Pipeline

Real-World Systems

Kaldi: Traditional ASR Feature Pipeline

PyTorch: Modern Deep Learning Pipeline

Google: Production ASR Feature Extraction

Choosing the Right Features

Feature Selection Guide

Combining Features

Key Takeaways

Related across topics

Merge Two Sorted Lists

Data Preprocessing Pipeline Design

Share on