14 minute read

“Teaching machines to hear feelings.”

TL;DR

Speech Emotion Recognition (SER) identifies emotional states from voice using acoustic features like pitch, energy, and spectral patterns. Modern approaches fine-tune self-supervised models like Wav2Vec2 on emotion datasets, achieving strong results with limited labeled data. Multimodal fusion combining audio with text consistently outperforms single-modality approaches. Production deployment requires handling class imbalance (neutral dominates), speaker-independent evaluation, and real-time streaming with sub-500ms latency. IEMOCAP remains the gold standard dataset. SER connects to broader voice agent architectures where emotional awareness enables empathetic responses, and benefits from upstream speech enhancement in noisy environments.

A thermal camera display showing a spectrogram with regions colored in emotional heat map colors from cool blue to ho...

1. Introduction

Speech Emotion Recognition (SER) is the task of identifying the emotional state of a speaker from their voice.

Emotions Typically Recognized:

  • Basic: Happy, Sad, Angry, Fear, Disgust, Surprise, Neutral.
  • Dimensional: Valence (positive/negative), Arousal (activation level), Dominance.

Applications:

  • Customer Service: Detect frustrated callers, route to specialists.
  • Mental Health: Monitor emotional state over time.
  • Human-Robot Interaction: Empathetic responses.
  • Gaming: Adaptive game difficulty based on player emotion.
  • Automotive: Detect driver stress or drowsiness.

2. Challenges in SER

1. Subjectivity:

  • Same utterance can be perceived differently.
  • Cultural differences in emotional expression.

2. Speaker Variability:

  • Emotional expression varies by person.
  • Age, gender, and language effects.

3. Context Dependency:

  • “Really?” can be surprised, sarcastic, or angry.
  • Need context to disambiguate.

4. Data Scarcity:

  • Labeled emotional speech is expensive to collect.
  • Acted vs spontaneous speech differs.

5. Class Imbalance:

  • Neutral is often dominant.
  • Extreme emotions (rage, despair) are rare.

3. Acoustic Features for SER

3.1. Prosodic Features

Pitch (F0):

  • Higher pitch → excitement, anger.
  • Lower pitch → sadness, boredom.

Energy:

  • Higher energy → anger, happiness.
  • Lower energy → sadness.

Speaking Rate:

  • Faster → excitement, nervousness.
  • Slower → sadness, hesitation.

3.2. Spectral Features

MFCCs:

  • Standard speech features.
  • 13-40 coefficients + deltas.

Mel Spectrogram:

  • Raw input for CNNs.
  • Captures timbral qualities.

Formants:

  • Vowel quality changes with emotion.

3.3. Voice Quality Features

Jitter and Shimmer:

  • Irregularities in pitch and amplitude.
  • Higher in stressed/emotional speech.

Harmonic-to-Noise Ratio (HNR):

  • Clarity of voice.
  • Lower in breathy or tense speech.

4. Traditional ML Approaches

4.1. Feature Extraction + Classifier

Pipeline:

  1. Extract hand-crafted features (openSMILE).
  2. Train SVM, Random Forest, or GMM.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Extract features (using openSMILE or librosa)
X_train = extract_features(train_audio)
X_test = extract_features(test_audio)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train SVM
clf = SVC(kernel='rbf', C=1.0)
clf.fit(X_train, y_train)

# Predict
predictions = clf.predict(X_test)

4.2. openSMILE Features

openSMILE extracts thousands of features:

  • eGeMAPS: 88 features (standardized for emotion).
  • ComParE: 6373 features (comprehensive).
import opensmile

smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.eGeMAPSv02,
feature_level=opensmile.FeatureLevel.Functionals
)

features = smile.process_file('audio.wav')

5. Deep Learning Approaches

5.1. CNN on Spectrograms

Architecture:

  1. Convert audio to mel spectrogram.
  2. Treat as image, apply 2D CNN.
  3. Global pooling + dense layers.
class EmotionCNN(nn.Module):
    def __init__(self, num_classes=7):
        super().__init__()
        self.conv = nn.Sequential(
        nn.Conv2d(1, 32, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2),
        nn.Conv2d(32, 64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(2),
        nn.Conv2d(64, 128, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.AdaptiveAvgPool2d(1)
        )
        self.fc = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

5.2. LSTM/GRU on Sequences

Architecture:

  1. Extract frame-level features (MFCCs).
  2. Feed to bidirectional LSTM.
  3. Attention or pooling over time.
class EmotionLSTM(nn.Module):
    def __init__(self, input_dim=40, hidden_dim=128, num_classes=7):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.attention = nn.Linear(hidden_dim * 2, 1)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        # x: (batch, time, features)
        lstm_out, _ = self.lstm(x)

        # Attention
        attn_weights = F.softmax(self.attention(lstm_out), dim=1)
        context = torch.sum(attn_weights * lstm_out, dim=1)

        return self.fc(context)

5.3. Transformer-Based Models

Using Pretrained Models:

  • Wav2Vec 2.0: Self-supervised audio representations.
  • HuBERT: Hidden unit BERT for speech.
  • WavLM: Microsoft’s large speech model.
from transformers import Wav2Vec2Model, Wav2Vec2Processor

class EmotionWav2Vec(nn.Module):
    def __init__(self, num_classes=7):
        super().__init__()
        self.wav2vec = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
        self.classifier = nn.Linear(768, num_classes)

    def forward(self, input_values):
        outputs = self.wav2vec(input_values)
        hidden = outputs.last_hidden_state.mean(dim=1)
        return self.classifier(hidden)

6. Datasets

6.1. IEMOCAP

  • 12 hours of audiovisual data.
  • 5 sessions, 10 actors.
  • Emotions: Angry, Happy, Sad, Neutral, Excited, Frustrated.
  • Gold standard for SER research.

6.2. RAVDESS

  • 24 actors (12 male, 12 female).
  • 7 emotions + calm.
  • Acted speech and song.

6.3. CREMA-D

  • 7,442 clips from 91 actors.
  • 6 emotions.
  • Diverse ethnic backgrounds.

6.4. CMU-MOSEI

  • 23,453 video clips.
  • Multimodal: text, audio, video.
  • Sentiment and emotion labels.

6.5. EmoDB (German)

  • 535 utterances.
  • 10 actors, 7 emotions.
  • Classic dataset for SER.

7. Evaluation Metrics

Classification Metrics:

  • Accuracy: Overall correct predictions.
  • Weighted F1: Accounts for class imbalance.
  • Unweighted Accuracy (UA): Average recall across classes.
  • Confusion Matrix: Understand per-class performance.

For Dimensional Emotions:

  • CCC (Concordance Correlation Coefficient): Agreement measure.
  • MSE/MAE: For valence/arousal prediction.

8. System Design: Call Center Emotion Analytics

Scenario: Detect customer emotions during support calls.

Requirements:

  • Real-time analysis.
  • Handle noisy telephony audio.
  • Alert supervisors on negative emotions.

Architecture:

┌─────────────────┐
│ Phone Call │
│ (Audio Stream)│
└────────┬────────┘
 │
┌────────▼────────┐
│ Voice Activity │
│ Detection │
└────────┬────────┘
 │
┌────────▼────────┐
│ Speaker │
│ Diarization │
└────────┬────────┘
 │
┌────────▼────────┐
│ Emotion │
│ Recognition │
└────────┬────────┘
 │
 ├──────────────┐
 │ │
┌────────▼────────┐ ┌▼────────────────┐
│ Dashboard │ │ Alert System │
│ (Real-time) │ │ (Supervisor) │
└─────────────────┘ └─────────────────┘

Implementation Details:

  • Process in 3-second windows.
  • Apply noise reduction first.
  • Track emotion trajectory over call.
  • Trigger alert if anger/frustration persists.

9. Multimodal Emotion Recognition

Combine modalities for better accuracy:

  • Audio: Voice, prosody.
  • Text: Transcribed words, sentiment.
  • Video: Facial expressions, body language.

9.1. Early Fusion

Concatenate features before classification:

audio_features = audio_encoder(audio)
text_features = text_encoder(text)
combined = torch.cat([audio_features, text_features], dim=1)
output = classifier(combined)

9.2. Late Fusion

Combine predictions from each modality:

audio_pred = audio_model(audio)
text_pred = text_model(text)
combined_pred = (audio_pred + text_pred) / 2

9.3. Cross-Modal Attention

Let modalities attend to each other:

class CrossModalAttention(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.query = nn.Linear(dim, dim)
        self.key = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

    def forward(self, x1, x2):
        # x1 attends to x2
        q = self.query(x1)
        k = self.key(x2)
        v = self.value(x2)

        attn = F.softmax(torch.bmm(q, k.transpose(1, 2)) / math.sqrt(q.size(-1)), dim=-1)
        return torch.bmm(attn, v)

10. Real-Time Considerations

Latency Requirements:

  • Call center: <500ms per segment.
  • Gaming: <100ms for responsiveness.

Optimization Strategies:

  1. Streaming: Process overlapping windows.
  2. Model Pruning: Reduce model size.
  3. Quantization: INT8 inference.
  4. GPU Batching: Process multiple calls together.

11. Interview Questions

  1. Features for SER: What acoustic features capture emotion?
  2. IEMOCAP: Describe the dataset and common practices.
  3. Class Imbalance: How do you handle it in SER?
  4. Multimodal Fusion: Early vs late vs attention fusion?
  5. Real-Time Design: Design an emotion detector for virtual meetings.

12. Common Mistakes

  • Ignoring Speaker Effects: Train with speaker-independent splits.
  • Leaking Speakers: Same speaker in train and test.
  • Wrong Metrics: Use weighted/unweighted accuracy for imbalanced data.
  • Acted vs Spontaneous: Models trained on acted data fail on real speech.
  • Ignoring Context: Sentence-level emotion misses conversational dynamics.

1. Self-Supervised Pretraining:

  • Wav2Vec, HuBERT for emotion.
  • Less labeled data needed.

2. Personalized Emotion Recognition:

  • Adapt to individual expression patterns.
  • Few-shot learning.

3. Continuous Emotion Tracking:

  • Not discrete labels, but continuous trajectories.
  • Valence-arousal-dominance space.

4. Explainable SER:

  • Which parts of audio indicate emotion.
  • Attention visualization.

14. Conclusion

Speech Emotion Recognition is a challenging but impactful task. It requires understanding of both speech processing and machine learning.

Key Takeaways:

  • Features: Prosody, spectral, voice quality.
  • Models: CNN on spectrograms, LSTM on sequences, Transformers.
  • Data: IEMOCAP is the gold standard.
  • Evaluation: Weighted F1 for imbalanced classes.
  • Multimodal: Combining audio + text improves accuracy.

As AI becomes more empathetic, SER will be central to human-computer interaction. Master it to build systems that truly understand their users.

15. Training Pipeline

15.1. Data Preprocessing

import librosa
import numpy as np

def preprocess_audio(audio_path, target_sr=16000, max_duration=10):
    # Load audio
    audio, sr = librosa.load(audio_path, sr=target_sr)

    # Trim silence
    audio, _ = librosa.effects.trim(audio, top_db=20)

    # Pad or truncate
    max_samples = target_sr * max_duration
    if len(audio) > max_samples:
        audio = audio[:max_samples]
    else:
        audio = np.pad(audio, (0, max_samples - len(audio)))

        # Compute mel spectrogram
        mel = librosa.feature.melspectrogram(
        y=audio, sr=target_sr, n_mels=80, hop_length=160
        )
        log_mel = np.log(mel + 1e-8)

        return log_mel

15.2. Data Loading

from torch.utils.data import Dataset, DataLoader

class EmotionDataset(Dataset):
    def __init__(self, audio_paths, labels):
        self.audio_paths = audio_paths
        self.labels = labels

    def __len__(self):
        return len(self.audio_paths)

    def __getitem__(self, idx):
        mel = preprocess_audio(self.audio_paths[idx])
        label = self.labels[idx]
        return torch.tensor(mel).unsqueeze(0), label

        # Create dataloaders
        train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=32)

15.3. Training Loop

model = EmotionCNN(num_classes=7)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for epoch in range(100):
    model.train()
    for mel, labels in train_loader:
        outputs = model(mel)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Validation
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for mel, labels in val_loader:
                outputs = model(mel)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                print(f"Epoch {epoch}, Val Accuracy: {100 * correct / total:.2f}%")

16. Data Augmentation

Audio Augmentations:

import audiomentations as A

augment = A.Compose([
A.AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
A.TimeStretch(min_rate=0.8, max_rate=1.2, p=0.5),
A.PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
A.Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
])

def augment_audio(audio, sr):
    return augment(samples=audio, sample_rate=sr)

SpecAugment:

def spec_augment(mel, freq_mask=10, time_mask=20):
    # Frequency masking
    f0 = np.random.randint(0, mel.shape[0] - freq_mask)
    mel[f0:f0+freq_mask, :] = 0

    # Time masking
    t0 = np.random.randint(0, mel.shape[1] - time_mask)
    mel[:, t0:t0+time_mask] = 0

    return mel

17. Handling Class Imbalance

Strategies:

  1. Weighted Loss:
    class_weights = compute_class_weights(labels)
    criterion = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))
    
  2. Oversampling:
    from imblearn.over_sampling import RandomOverSampler
    ros = RandomOverSampler()
    X_resampled, y_resampled = ros.fit_resample(X, y)
    
  3. Focal Loss:
    class FocalLoss(nn.Module):
     def __init__(self, gamma=2):
         super().__init__()
         self.gamma = gamma
    
     def forward(self, inputs, targets):
         ce_loss = F.cross_entropy(inputs, targets, reduction='none')
         pt = torch.exp(-ce_loss)
         focal_loss = (1 - pt) ** self.gamma * ce_loss
         return focal_loss.mean()
    

18. Dimensional Emotion Recognition

Valence-Arousal-Dominance (VAD) Model:

  • Valence: Positive (happy) to Negative (sad).
  • Arousal: Active (excited) to Passive (calm).
  • Dominance: Dominant to Submissive.

Regression Instead of Classification:

class EmotionVADRegressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
        self.regressor = nn.Linear(768, 3) # Predict V, A, D

    def forward(self, x):
        features = self.encoder(x).last_hidden_state.mean(dim=1)
        return self.regressor(features)

        # Training with MSE loss
        criterion = nn.MSELoss()
        output = model(audio)
        loss = criterion(output, torch.tensor([valence, arousal, dominance]))

Evaluation Metric (CCC):

def concordance_correlation_coefficient(pred, target):
    mean_pred = pred.mean()
    mean_target = target.mean()
    var_pred = pred.var()
    var_target = target.var()
    covar = ((pred - mean_pred) * (target - mean_target)).mean()

    ccc = 2 * covar / (var_pred + var_target + (mean_pred - mean_target)**2)
    return ccc

19. Production Deployment

19.1. Model Export

# Export to ONNX
dummy_input = torch.randn(1, 1, 80, 400)
torch.onnx.export(model, dummy_input, "emotion_model.onnx")

# Or TorchScript
scripted = torch.jit.script(model)
scripted.save("emotion_model.pt")

19.2. Inference Service

from fastapi import FastAPI, UploadFile
import soundfile as sf

app = FastAPI()

@app.post("/predict")
async def predict_emotion(file: UploadFile):
    # Read audio
    audio, sr = sf.read(file.file)

    # Preprocess
    mel = preprocess_audio_from_array(audio, sr)

    # Predict
    with torch.no_grad():
        output = model(torch.tensor(mel).unsqueeze(0))
        emotion_idx = output.argmax().item()

        emotions = ["angry", "happy", "sad", "neutral", "fear", "disgust", "surprise"]
        return {"emotion": emotions[emotion_idx]}

19.3. Streaming Processing

class StreamingEmotionDetector:
    def __init__(self, model, window_size=3.0, hop_size=1.0, sr=16000):
        self.model = model
        self.window_samples = int(window_size * sr)
        self.hop_samples = int(hop_size * sr)
        self.buffer = []

    def process_chunk(self, audio_chunk):
        self.buffer.extend(audio_chunk)

        results = []
        while len(self.buffer) >= self.window_samples:
            window = self.buffer[:self.window_samples]
            emotion = self.predict(window)
            results.append(emotion)
            self.buffer = self.buffer[self.hop_samples:]

            return results

    def predict(self, audio):
        mel = compute_mel(audio)
        with torch.no_grad():
            output = self.model(torch.tensor(mel).unsqueeze(0))
            return output.argmax().item()

20. Mastery Checklist

Mastery Checklist:

  • Extract prosodic features (F0, energy)
  • Extract spectral features (MFCC, mel spectrogram)
  • Train CNN on spectrograms
  • Train LSTM with attention
  • Fine-tune Wav2Vec2 for emotion
  • Handle class imbalance (weighted loss, oversampling)
  • Implement multimodal fusion
  • Evaluate with weighted F1 and UA
  • Deploy real-time emotion detector
  • Understand dimensional emotion models

21. Conclusion

Speech Emotion Recognition bridges the gap between AI and human emotional intelligence. It’s a challenging task that requires:

  • Domain Knowledge: Understanding how emotions manifest in speech.
  • ML Expertise: Selecting and training appropriate models.
  • Data Engineering: Handling imbalanced, subjective labels.
  • System Design: Building real-time, production-ready systems.

The Path Forward:

  1. Start with IEMOCAP and a CNN baseline.
  2. Upgrade to Wav2Vec2 for better features.
  3. Add multimodal (text) for improved accuracy.
  4. Deploy with streaming for real-time applications.

As AI assistants become more prevalent, emotional intelligence will be a key differentiator. Systems that understand and respond to human emotions will create more natural, empathetic interactions. Master SER to be at the forefront of this revolution.

FAQ

What acoustic features are most important for speech emotion recognition?

The most informative features span three categories: prosodic features (pitch/F0, energy, speaking rate) that capture intonation and rhythm, spectral features (MFCCs, mel spectrograms, formants) that capture timbral qualities, and voice quality features (jitter, shimmer, harmonic-to-noise ratio) that capture vocal irregularities. Higher pitch and energy tend to indicate excitement or anger, while lower values suggest sadness. Pretrained models like Wav2Vec2 learn to capture all these patterns automatically from raw audio.

How do you handle class imbalance in emotion recognition datasets?

Emotion datasets are inherently imbalanced because neutral speech dominates and extreme emotions are rare. Effective strategies include weighted cross-entropy loss with inverse-frequency class weights, oversampling minority emotion classes, focal loss that automatically down-weights well-classified examples, and data augmentation using pitch shifting, time stretching, and noise addition to increase diversity in underrepresented classes. Using unweighted accuracy (average recall across classes) as the evaluation metric also prevents the model from optimizing only for the majority class.

What is the difference between early fusion and late fusion in multimodal emotion recognition?

Early fusion concatenates feature vectors from different modalities (audio, text, video) before passing them to a shared classifier, enabling the model to learn cross-modal interactions but requiring aligned features. Late fusion trains separate models per modality and combines their prediction scores, which is simpler and allows modality-specific architectures but misses inter-modal dependencies. Cross-modal attention provides a middle ground by letting each modality attend to relevant parts of other modalities through learned attention weights.

Why is IEMOCAP considered the gold standard dataset for speech emotion recognition?

IEMOCAP contains 12 hours of audiovisual data from 10 actors across 5 dyadic sessions, combining both scripted and spontaneous interactions. It provides multiple annotation types (categorical emotions and dimensional VAD values), supports speaker-independent evaluation splits, and includes both audio and video modalities. Its moderate size and quality annotations make it the most widely used and comparable benchmark in SER research, though its acted nature means models may not transfer perfectly to spontaneous real-world speech.


Originally published at: arunbaby.com/speech-tech/0045-speech-emotion-recognition

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch