Sequence-to-Sequence Speech Models

Q: What is the difference between CTC and attention-based ASR?

CTC enforces monotonic alignment and enables streaming but assumes conditional independence between output tokens. Attention-based models like LAS capture label dependencies and long-range context but cannot stream because they need the full input. Hybrid CTC-Attention combines both strengths.

Q: How does Tacotron 2 generate natural-sounding speech?

Tacotron 2 uses an encoder-decoder with location-sensitive attention to generate mel-spectrograms from text, then a vocoder like HiFi-GAN converts spectrograms to waveforms. The location-sensitive attention prevents skipping and repeating by using previous alignment weights.

23 minute read

“From waveforms to words, and back again.”

TL;DR

Sequence-to-sequence models replaced hand-crafted speech pipelines with end-to-end neural networks for ASR (LAS, Whisper), TTS (Tacotron 2, FastSpeech 2), and speech translation (Translatotron). The evolution from HMM-GMM through CTC and RNN-Transducer to Conformer progressively solved streaming, alignment, and robustness challenges. Self-supervised pretraining with Wav2Vec 2.0 and HuBERT enables SOTA results with minimal labeled data, while audio-language models like AudioLM blur the boundary between speech and NLP. For how these models handle multiple tasks simultaneously, see the multi-task learning article.

Two parallel conveyor belts running in the same direction with a robotic arm transferring items between them in preci...

1. The Seq2Seq Paradigm

Traditional speech systems were pipelines:

ASR: Acoustic Model + Language Model + Decoder.
TTS: Text Analysis + Acoustic Model + Vocoder.

Seq2Seq unifies this into a single neural network:

Input: Sequence (audio features or text).
Output: Sequence (text or audio features).
No hand-crafted features or rules.

2. ASR: Listen, Attend, and Spell (LAS)

Architecture:

Listener (Encoder): Pyramid LSTM processes audio.
- Reduces time resolution (e.g., 100 frames → 25 frames).
Attender: Attention mechanism focuses on relevant audio frames.
Speller (Decoder): LSTM generates characters/words.

Attention: \alpha_t = \text{softmax}(e_t) e_{t,i} = \text{score}(s_{t-1}, h_i) c_t = \sum_i \alpha_{t,i} h_i

Where:

s_{t-1}: Previous decoder state.
h_i: Encoder hidden state at time i.
c_t: Context vector (weighted sum of encoder states).

3. TTS: Tacotron 2

Goal: Text → Mel-Spectrogram → Waveform.

Architecture:

Encoder: Character embeddings → LSTM.
Attention: Decoder attends to encoder states.
Decoder: Predicts mel-spectrogram frames.
Vocoder (WaveNet/HiFi-GAN): Mel → Waveform.

Key Innovation: Predict multiple frames per step (faster).

4. Speech Translation: Direct S2ST

Traditional: Audio → ASR → Text → MT → Text → TTS → Audio. Direct: Audio → Audio (no text intermediate).

Advantages:

Preserves prosody (emotion, emphasis).
Works for unwritten languages.

Model: Translatotron (Google).

Encoder: Audio (Spanish).
Decoder: Spectrogram (English).

5. Attention Mechanisms

1. Content-Based Attention

Decoder decides where to attend based on content.
Problem: Can attend to same location twice (repetition).

2. Location-Aware Attention

Uses previous attention weights to guide current attention.
Encourages monotonic progression (left-to-right).

3. Monotonic Attention

Enforces strict left-to-right alignment.
Good for streaming ASR (can’t look ahead).

6. Challenges

1. Exposure Bias

During training, decoder sees ground truth.
During inference, decoder sees its own (possibly wrong) predictions.
Fix: Scheduled sampling.

2. Alignment Failures

Attention might skip words or repeat.
Fix: Guided attention (force diagonal alignment early in training).

3. Long Sequences

Attention is O(N^2) in memory.
Fix: Chunked attention, or use CTC for ASR.

7. Modern Approach: Transformer-Based

Whisper (OpenAI):

Encoder-Decoder Transformer.
Trained on 680k hours.
Handles ASR, Translation, Language ID in one model.

Advantages:

Parallel training (unlike RNN).
Better long-range dependencies.

8. Summary

Task	Model	Input	Output
ASR	LAS, Whisper	Audio	Text
TTS	Tacotron 2	Text	Audio
ST	Translatotron	Audio (L1)	Audio (L2)

9. Deep Dive: Attention Alignment Visualization

In ASR, attention should be monotonic (left-to-right).

Good Alignment:

Audio frames: [a1][a2][a3][a4][a5]
Text: [ h ][ e ][ l ][ l ][ o ]
Attention: ████
 ░░░░████
 ░░░░████
 ░░░░████
 ░░░░████

Bad Alignment (Skipping):

Audio frames: [a1][a2][a3][a4][a5]
Text: [ h ][ e ][ l ][ l ][ o ]
Attention: ████
 ░░░░████
 ░░░░████ ← Skipped a3!

Fix: Guided attention loss forces diagonal alignment early in training.

10. Deep Dive: Streaming ASR with Monotonic Attention

Problem: Standard attention looks at the entire input. Can’t stream.

Monotonic Chunkwise Attention (MoChA):

At each decoder step, decide: “Should I move to the next audio chunk?”
Use a sigmoid gate: p_{\text{move}} = \sigma(g(h_t, s_{t-1}))
If yes, attend to next chunk. If no, stay.

Advantage: Can process audio in real-time (with bounded latency).

11. Deep Dive: Tacotron 2 Architecture Details

Encoder:

Character embeddings → 3 Conv layers → Bidirectional LSTM.
Output: Encoded representation of text.

Decoder:

Prenet: 2 FC layers with dropout (helps with generalization).
Attention RNN: 1-layer LSTM.
Decoder RNN: 2-layer LSTM.
Output: Predicts 80-dim mel-spectrogram frame.

Postnet:

5 Conv layers.
Refines the mel-spectrogram (adds high-frequency details).

Stop Token:

Decoder also predicts “Should I stop?” at each step.
Training: Sigmoid BCE loss.

12. Deep Dive: Vocoder Evolution

Goal: Mel-Spectrogram → Waveform.

WaveNet (2016):

Autoregressive CNN.
Generates one sample at a time (16kHz = 16,000 samples/sec).
Slow: 1 second of audio takes 10 seconds to generate.

WaveGlow (2018):

Flow-based model.
Parallel generation (all samples at once).
Fast: Real-time on GPU.

HiFi-GAN (2020):

GAN-based.
Fastest: 167x faster than real-time on V100.
Quality: Indistinguishable from ground truth.

13. System Design: Low-Latency TTS

Scenario: Voice assistant (Alexa, Siri).

Requirements:

Latency: < 300ms from text to first audio chunk.
Quality: Natural, expressive.

Architecture:

Text Normalization: “Dr.” → “Doctor”, “$100” → “one hundred dollars”.
Grapheme-to-Phoneme (G2P): “read” → /riːd/ or /rɛd/ (context-dependent).
Prosody Prediction: Predict pitch, duration, energy.
Acoustic Model: FastSpeech 2 (non-autoregressive, parallel).
Vocoder: HiFi-GAN.

Streaming:

Generate mel-spectrogram in chunks (e.g., 50ms).
Vocoder processes each chunk independently.

14. Deep Dive: Translatotron (Direct S2ST)

Challenge: No parallel S2ST data (Audio_Spanish → Audio_English).

Solution: Weak supervision.

Use ASR to get Spanish text.
Use MT to get English text.
Use TTS to get English audio.
Train Translatotron on (Audio_Spanish, Audio_English) pairs.

Architecture:

Encoder: Processes Spanish audio.
Decoder: Generates English mel-spectrogram.
Speaker Encoder: Preserves source speaker’s voice.

15. Code: Simple Seq2Seq ASR

import torch
import torch.nn as nn

class Seq2SeqASR(nn.Module):
    def __init__(self, input_dim, hidden_dim, vocab_size):
        super().__init__()

        # Encoder: Bi-LSTM
        self.encoder = nn.LSTM(input_dim, hidden_dim, num_layers=3,
        batch_first=True, bidirectional=True)

        # Attention
        self.attention = nn.Linear(hidden_dim * 2, 1)

        # Decoder: LSTM
        self.decoder = nn.LSTM(vocab_size + hidden_dim * 2, hidden_dim,
        num_layers=2, batch_first=True)

        # Output projection
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, audio_features, text_input):
        # Encode audio
        encoder_out, _ = self.encoder(audio_features) # (B, T, 2*H)

        # Decode
        outputs = []
        hidden = None

        for t in range(text_input.size(1)):
            # Compute attention
            attn_scores = self.attention(encoder_out).squeeze(-1) # (B, T)
            attn_weights = torch.softmax(attn_scores, dim=1) # (B, T)
            context = torch.bmm(attn_weights.unsqueeze(1), encoder_out) # (B, 1, 2*H)

            # Decoder input: previous char + context
            decoder_input = torch.cat([text_input[:, t:t+1], context], dim=-1)

            # Decoder step
            decoder_out, hidden = self.decoder(decoder_input, hidden)

            # Predict next char
            logits = self.fc(decoder_out)
            outputs.append(logits)

            return torch.cat(outputs, dim=1)

16. Production Considerations

Model Size: Quantize to INT8 for mobile deployment.
Latency: Use non-autoregressive models (FastSpeech, Conformer-CTC).
Personalization: Fine-tune on user’s voice (few-shot learning).
Multilingual: Train on 100+ languages (mSLAM, Whisper).

17. Deep Dive: The Evolution of Seq2Seq in Speech

To understand where we are, we must understand the journey.

1. HMM-GMM (1980s-2010):

Acoustic Model: Gaussian Mixture Models (GMM) modeled the probability of audio features given a phoneme state.
Sequence Model: Hidden Markov Models (HMM) modeled the transition between states.
Pros: Mathematically rigorous, worked on small data.
Cons: Assumed independence of frames (false), complex training pipeline.

2. DNN-HMM (2010-2015):

Replaced GMM with Deep Neural Networks (DNN).
Impact: Massive drop in WER (30% relative).
Cons: Still relied on HMM for alignment.

3. CTC (2006, popularized 2015):

Connectionist Temporal Classification.
First “End-to-End” loss.
Allowed predicting sequences shorter than input without explicit alignment.
Cons: Conditional independence assumption (output at time t depends only on input, not previous outputs).

4. LAS (2016):

Listen-Attend-Spell.
Introduced Attention to speech.
Pros: No independence assumption.
Cons: Not streaming (needs full audio to attend).

5. RNN-T (2012, popularized 2018):

RNN Transducer.
Combined CTC (streaming) with LAS (label dependency).
Result: The standard for streaming ASR (Pixel, Siri, Alexa).

6. Transformer & Conformer (2017-Present):

Self-attention replaces RNNs.
Conformer: Combines CNN (local patterns) with Transformer (global context).

18. Deep Dive: Connectionist Temporal Classification (CTC)

Problem: Audio has T frames (e.g., 1000). Text has L characters (e.g., 50). T \gg L. How do we map them?

CTC Solution:

Introduce a blank token \epsilon.
Output sequence length is T.
Collapse: Remove repeats and blanks.
h h e e l l l l o → hello
h h \epsilon e e l l \epsilon l l o → hello

Loss Function: P(Y|X) = \sum_{A \in \mathcal{B}^{-1}(Y)} P(A|X)

Sum over all valid alignments A that collapse to Y.
Computed efficiently using Dynamic Programming (Forward-Backward algorithm).

Pros:

Fast inference (O(1) per step).
Streaming friendly.

Cons:

Can’t model language (e.g., “pair” vs “pear” sounds same). Needs external Language Model.

19. Deep Dive: RNN-Transducer (RNN-T)

Architecture:

Encoder (Audio): Processes audio frames x_t. Produces h_t^{enc}.
Prediction Network (Text): Processes previous non-blank output y_{u-1}. Produces h_u^{pred}. (Like an LM).
Joint Network: Combines them. z_{t,u} = \text{ReLU}(W_{enc} h_t^{enc} + W_{pred} h_u^{pred}) P(y|t,u) = \text{softmax}(W_{out} z_{t,u})

Inference (Greedy):

If output is non-blank (y), feed to Prediction Network, increment u. Stay at audio frame t.
If output is blank (\epsilon), move to next audio frame t+1.

Why it wins:

Streaming: Encoder is causal.
Accuracy: Prediction network models label dependencies (unlike CTC).
Latency: Controllable.

20. Deep Dive: Transformer vs Conformer

Transformer:

Great at global context (Self-Attention).
Bad at local fine-grained details (needs deep layers to see local patterns).
Positional encodings are brittle for varying audio lengths.

Conformer (Convolution-augmented Transformer):

Macaron Style: Feed-Forward -> Multi-Head Attention -> Conv Module -> Feed-Forward.
Conv Module: Pointwise Conv -> Gated Linear Unit (GLU) -> Depthwise Conv -> BatchNorm -> Swish -> Pointwise Conv.
Why:
CNNs capture local features (formants, transitions).
Transformers capture global semantics.
Result: SOTA on LibriSpeech.

21. Deep Dive: FastSpeech 2 (Non-Autoregressive TTS)

Problem with Tacotron 2:

Autoregressive (slow).
Unstable attention (skipping/repeating).
Hard to control prosody (speed, pitch).

FastSpeech 2 Architecture:

Encoder: Feed-Forward Transformer.
Variance Adaptor:
- Duration Predictor: How many frames does this phoneme last? (Trained on forced alignment).
- Pitch Predictor: Predicts F0 contour.
- Energy Predictor: Predicts volume.
- Adds embeddings of these predictions to the encoder output.
Length Regulator: Expands hidden states based on duration (e.g., “a” lasts 5 frames -> repeat vector 5 times).
Decoder: Feed-Forward Transformer.

Pros:

Fast: Parallel generation.
Robust: No attention failures.
Controllable: Can explicitly set “speak 1.2x faster” or “raise pitch”.

22. Deep Dive: VITS (Conditional Variational Autoencoder)

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) is the current SOTA for E2E TTS.

Key Idea: Combine Acoustic Model and Vocoder into one flow-based model.

Architecture:

Posterior Encoder: Encodes linear spectrogram into latent z.
Prior Encoder: Predicts distribution of z from text (conditioned on alignment).
Decoder (Generator): Transforms z into waveform (HiFi-GAN style).
Stochastic Duration Predictor: Models duration uncertainty.

Training:

Reconstruction Loss: Mel-spectrogram L1 loss.
KL Divergence: Between Posterior and Prior.
Adversarial Loss: Discriminator tries to distinguish real vs generated audio.

Result: extremely natural, high-fidelity speech.

23. System Design: Real-Time Speech Translation System

Scenario: “Universal Translator” device. English Audio -> Spanish Audio.

Constraints:

Latency: < 2 seconds lag.
Compute: Edge device (limited).

Architecture Choices:

Option A: Cascade (ASR -> MT -> TTS)

Pros: Modular. Can use SOTA for each.
Cons: Error propagation. Latency adds up. Loss of paralinguistics (tone).

Option B: Direct S2ST (Audio -> Audio)

Pros: Fast. Preserves tone.
Cons: Data scarcity. Hard to train.

Hybrid Design (Production):

Streaming ASR (RNN-T): Generates English text stream.
Streaming MT: Translates partial sentences. “Hello” -> “Hola”.
Incremental TTS: Generates audio for “Hola” immediately.

Wait-k Policy:

MT model waits for k words before translating.
Balances context (accuracy) vs latency.

24. Case Study: OpenAI Whisper

Goal: Robust ASR that works on “in the wild” audio.

Data:

680,000 hours of web audio.
Weak Supervision: Transcripts from ASR systems, subtitles (noisy).
Multitask:
English Transcription.
Any-to-English Translation.
Language Identification.
Voice Activity Detection.

Architecture:

Standard Encoder-Decoder Transformer.
Input: Log-Mel Spectrogram (30 seconds).
Output: Text tokens.

Key Features:

Task Tokens: <|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>.
Long-form: Processes 30s chunks. Uses previous chunk’s text as prompt (context).

Impact:

Zero-shot performance on many datasets matches supervised models.
Proved that Data Scale > Model Architecture.

25. Case Study: Google USM (Universal Speech Model)

Goal: One model for 300+ languages.

Architecture:

Conformer (2 Billion parameters).
MOST (Multi-Objective Supervised Training):
BEST-RQ: Self-supervised loss (BERT-style on audio).
Text-Injection: Train on text-only data (using shared encoder layers).
ASR: Supervised loss on labeled audio.

Result:

SOTA on 73 languages.
Enables ASR for languages with < 10 hours of data.

26. Deep Dive: Evaluation Metrics

ASR:

WER (Word Error Rate): \frac{S + D + I}{N}
S: Substitutions, D: Deletions, I: Insertions, N: Total words.
CER (Character Error Rate): For languages without spaces (Chinese).
RTF (Real-Time Factor): \frac{\text{Processing Time}}{\text{Audio Duration}}. RTF < 1 means real-time.

TTS:

MOS (Mean Opinion Score): Humans rate naturalness 1-5.
MCD (Mel Cepstral Distortion): Objective distance between generated and ground truth spectrograms.

ST (Speech Translation):

BLEU: Standard MT metric.

27. Code: Implementing Beam Search for Seq2Seq

Greedy decoding (pick max prob) is suboptimal. Beam search explores k paths.

def beam_search_decoder(model, encoder_out, beam_width=3, max_len=50):
    # Start with [SOS] token
    # Beam: list of (sequence, score, hidden_state)
    start_token = model.vocab['<sos>']
    beam = [([start_token], 0.0, None)]

    for _ in range(max_len):
        candidates = []

        for seq, score, hidden in beam:
            if seq[-1] == model.vocab['<eos>']:
                candidates.append((seq, score, hidden))
                continue

                # Predict next token
                last_token = torch.tensor([[seq[-1]]])
                output, new_hidden = model.decoder(last_token, hidden, encoder_out)
                log_probs = torch.log_softmax(output, dim=-1)

                # Get top k
                topk_probs, topk_ids = log_probs.topk(beam_width)

                for i in range(beam_width):
                    token = topk_ids[0][i].item()
                    prob = topk_probs[0][i].item()
                    candidates.append((seq + [token], score + prob, new_hidden))

                    # Select top k candidates
                    beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]

                    # Check if all finished
                    if all(c[0][-1] == model.vocab['<eos>'] for c in beam):
                        break

                        return beam[0][0] # Return best sequence

28. Production: Serving Speech Models with Triton

NVIDIA Triton Inference Server is standard for deploying speech models.

Pipeline:

Feature Extractor (Python Backend): Audio -> Mel-Spec.
Encoder (TensorRT): Mel-Spec -> Hidden States.
Decoder (TensorRT): Hidden States -> Text (Beam Search).

Dynamic Batching:

Triton groups requests arriving within 5ms into a batch.
Increases GPU utilization significantly.

Ensemble Model:

Define a DAG of models.
Client sends audio, Triton handles the flow.

29. Deep Dive: End-to-End Speech-to-Speech Translation (S2ST)

Unit-Based S2ST:

Instead of spectrograms, predict discrete acoustic units (from HuBERT or k-means).
Advantage: Discrete tokens allow using standard Transformer (like NLP).
Vocoder: Unit HiFi-GAN converts discrete units to waveform.

SpeechMatrix:

Mining parallel speech from 100k hours of multilingual audio.
Uses LASER embeddings to find matching sentences in different languages.

30. Summary

Task	Model	Input	Output
ASR	LAS, Whisper	Audio	Text
TTS	Tacotron 2	Text	Audio
ST	Translatotron	Audio (L1)	Audio (L2)
Vocoder	HiFi-GAN	Mel-Spec	Waveform
Streaming	RNN-T	Audio Stream	Text Stream
Fast TTS	FastSpeech 2	Text	Mel-Spec

Streaming	RNN-T	Audio Stream	Text Stream
Fast TTS	FastSpeech 2	Text	Mel-Spec

31. Deep Dive: Self-Supervised Learning (SSL) in Speech

Problem: Labeled data (audio + text) is expensive. Unlabeled audio is free.

Wav2Vec 2.0 (Meta):

Idea: Mask parts of the audio latent space and predict the quantized representation of the masked part.
Contrastive Loss: Identify the correct quantized vector among distractors.
Result: Can reach SOTA with only 10 minutes of labeled data (after pre-training on 53k hours).

HuBERT (Hidden Unit BERT):

Idea: Offline clustering (k-means) of MFCCs to generate pseudo-labels.
Masked Prediction: BERT-style objective. Predict the cluster ID of masked frames.
Result: More robust than Wav2Vec 2.0.

Impact on Seq2Seq:

Initialize the Encoder with Wav2Vec 2.0 / HuBERT.
Add a random Decoder.
Fine-tune on ASR task.
Benefit: Drastic reduction in required labeled data.

32. Deep Dive: Multilingual Seq2Seq

Goal: One model for 100 languages.

Challenges:

Data Imbalance: English has 1M hours, Swahili has 100.
Script Diversity: Latin, Cyrillic, Chinese, Arabic.
Phonetic Overlap: “P” in English \neq “P” in French.

Solutions:

Language ID Token: <|en|>, <|fr|>.
Shared Vocabulary: SentencePiece (BPE) trained on all languages.
Balancing Sampling: Up-sample low-resource languages during training. p_l \propto (\frac{N_l}{N_{total}})^\alpha where \alpha < 1 (e.g., 0.3) flattens the distribution.
Adapter Modules: Small, language-specific layers inserted into the frozen giant model.

33. Deep Dive: End-to-End SLU (Spoken Language Understanding)

Traditional: Audio → ASR → Text → NLP → Intent/Slots. E2E SLU: Audio → Intent/Slots.

Why E2E?

Error Robustness: ASR might transcribe “play jazz” as “play jas”. NLP fails. E2E model learns acoustic features of “jazz”.
Paralinguistics: “Yeah right” (sarcastic) -> Negative Sentiment. Text-only NLP misses this.

Architecture:

Encoder: Pre-trained Wav2Vec 2.0.
Decoder: Predicts semantic frame directly.
[INTENT: PlayMusic] [ARTIST: The Beatles] [GENRE: Rock]

Challenges:

Lack of labeled Audio-to-Semantics data.
Solution: Transfer learning from ASR models.

34. Deep Dive: Attention Visualization Code

Visualizing attention maps is the best way to debug Seq2Seq models.

import matplotlib.pyplot as plt
import seaborn as sns

def plot_attention(attention_matrix, input_text, output_text):
    """
    attention_matrix: (Output_Len, Input_Len) numpy array
    """
    plt.figure(figsize=(10, 8))
    sns.heatmap(attention_matrix, cmap='viridis',
    xticklabels=input_text, yticklabels=output_text)
    plt.xlabel('Encoder Input')
    plt.ylabel('Decoder Output')
    plt.show()

    # Example usage
    # attn = model.get_attention_weights(...)
    # plot_attention(attn, audio_frames, predicted_words)

What to look for:

Diagonal: Good alignment.
Vertical Line: Decoder is stuck on one frame (repeating output).
Horizontal Line: Encoder frame is ignored.
Fuzzy: Weak attention (low confidence).

35. Deep Dive: Handling Long-Form Audio

Standard Transformers have O(N^2) attention complexity. 30s audio = 1500 frames. 1 hour = 180,000 frames.

Strategies:

Chunking (Whisper):
- Slice audio into 30s segments.
- Transcribe independently.
- Problem: Context cut off at boundaries.
- Fix: Pass previous segment’s text as prompt.
Streaming (RNN-T):
- Process frame-by-frame. Infinite length.
- Problem: No future context.
Sparse Attention (BigBird / Longformer):
- Attend only to local window + global tokens.
- O(N) complexity.
Block-Processing (Emformer):
- Block-wise processing with memory bank for history.

36. Future Trends: Audio-Language Models

SpeechGPT / AudioLM:

Treat audio tokens and text tokens as the same thing.
Tokenizer: SoundStream / EnCodec (Neural Audio Codec).
Model: Decoder-only Transformer (GPT).
Training:
Text-only data (Web).
Audio-only data (Radio).
Paired data (ASR).

Capabilities:

Speech-to-Speech Translation: “Translate this to French” (Audio input) -> Audio output.
Voice Continuation: Continue speaking in the user’s voice.
Zero-shot TTS: “Say ‘Hello’ in this voice: [Audio Prompt]”.

37. Ethical Considerations: Voice Cloning & Deepfakes

Seq2Seq TTS (Vall-E) can clone a voice with 3 seconds of audio.

Risks:

Fraud: Impersonating CEOs or relatives.
Disinformation: Fake speeches by politicians.
Harassment: Fake audio of individuals.

Mitigation:

Watermarking: Embed inaudible signals in generated audio.
Detection: Train classifiers to detect artifacts of synthesis.
Regulation: “Know Your Customer” for TTS APIs.
Regulation: “Know Your Customer” for TTS APIs.

38. Deep Dive: The Mathematics of Transformers in Speech

Why did Transformers replace RNNs?

1. Self-Attention Complexity:

RNN: O(N) sequential operations. Cannot parallelize.
Transformer: O(1) sequential operations (parallelizable). O(N^2) memory.
Benefit: We can train on 1000 GPUs efficiently.

2. Positional Encodings in Speech:

Text uses absolute sinusoidal encodings.
Speech Problem: Audio length varies wildly. 10s vs 10m.
Solution: Relative Positional Encoding.
Instead of adding P_i to input X_i, add a bias b_{i-j} to the attention score A_{ij}.
Allows the model to generalize to audio lengths unseen during training.

3. Subsampling:

Audio is high-frequency (100 frames/sec). Text is low-frequency (3 chars/sec).
Conv Subsampling: First 2 layers of Encoder are strided Convolutions (stride 2x2 = 4x reduction).
Reduces sequence length N \to N/4. Reduces attention cost N^2 \to (N/4)^2 = N^2/16.

39. Deep Dive: Troubleshooting Seq2Seq Models

1. The “Hallucination” Problem:

Symptom: Model outputs “Thank you Thank you Thank you” during silence.
Cause: Decoder language model is too strong; it predicts likely text even without acoustic evidence.
Fix:
Voice Activity Detection (VAD): Filter out silence.
Coverage Penalty: Penalize attending to the same frames repeatedly.

2. The “NaN” Loss:

Symptom: Training crashes.
Cause: Exploding gradients in LSTM or division by zero in BatchNorm.
Fix:
Gradient Clipping (norm 1.0).
Warmup learning rate.
Check for empty transcripts in data.

3. The “Babble” Problem:

Symptom: Output is gibberish.
Cause: CTC alignment failed or Attention didn’t converge.
Fix:
Curriculum Learning: Train on short utterances first, then long.
Guided Attention Loss: Force diagonal alignment for first few epochs.
Guided Attention Loss: Force diagonal alignment for first few epochs.

40. Deep Dive: The Future - Audio-Language Models

The boundary between Speech and NLP is blurring.

AudioLM (Google):

Treats audio as a sequence of discrete tokens (using SoundStream codec).
Uses a GPT-style decoder to generate audio tokens.
Capabilities:
Speech Continuation: Given 3s of speech, continue speaking in the same voice and style.
Zero-Shot TTS: Generate speech from text in a target voice without fine-tuning.

SpeechGPT (Fudan University):

Fine-tunes LLaMA on paired speech-text data.
Modality Adaptation: Teaches the LLM to understand discrete audio tokens.
Result: A chatbot you can talk to, which talks back with emotion and nuance.

Implication:

Seq2Seq models (Encoder-Decoder) might be replaced by Decoder-only LLMs that handle all modalities (Text, Audio, Image) in a unified token space.

41. Ethical Considerations in Seq2Seq Speech

1. Deepfakes & Voice Cloning:

Models like Vall-E can clone a voice from 3 seconds of audio.
Risk: Fraud (fake CEO calls), harassment, disinformation.
Mitigation:
Watermarking: Embed inaudible signals in generated audio.
Detection: Train classifiers to detect synthesis artifacts.

2. Bias in ASR:

Models trained on LibriSpeech (audiobooks) fail on AAVE (African American Vernacular English) or Indian accents.
Fix: Diverse training data. “Fairness-aware” loss functions that penalize disparity between groups.

3. Privacy:

Smart speakers listen constantly.
Fix: On-Device Processing. Run the Seq2Seq model on the phone’s NPU, never sending audio to the cloud.

42. Further Reading

To master Seq2Seq speech models, these papers are essential:

“Listen, Attend and Spell” (Chan et al., 2016): The paper that introduced Attention to ASR.
“Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions” (Shen et al., 2018): The Tacotron 2 paper.
“Attention Is All You Need” (Vaswani et al., 2017): The Transformer paper (foundation of modern speech).
“Conformer: Convolution-augmented Transformer for Speech Recognition” (Gulati et al., 2020): The current SOTA architecture for ASR.
“Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (Baevski et al., 2020): The SSL revolution.
“Whisper: Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022): How scale beats architecture.

43. Summary

Task	Model	Input	Output
ASR	LAS, Whisper	Audio	Text
TTS	Tacotron 2	Text	Audio
ST	Translatotron	Audio (L1)	Audio (L2)
Vocoder	HiFi-GAN	Mel-Spec	Waveform
Streaming	RNN-T	Audio Stream	Text Stream
Fast TTS	FastSpeech 2	Text	Mel-Spec
SSL	Wav2Vec 2.0	Masked Audio	Quantized Vector
E2E SLU	SLU-BERT	Audio	Intent/Slots
Troubleshooting	VAD, Gradient Clipping	-	-
Future	AudioLM	Discrete Tokens	Discrete Tokens

FAQ

What is the difference between CTC and attention-based ASR?

CTC enforces monotonic alignment between audio and text and enables streaming inference, but assumes conditional independence between output tokens at each time step. Attention-based models like LAS capture label dependencies and long-range context but require the full input before decoding. Hybrid CTC-Attention training combines both strengths for production systems.

Why did Conformer replace Transformers for speech recognition?

Conformer augments Transformer self-attention with convolution modules in a macaron-style architecture where each block contains Feed-Forward, Multi-Head Attention, Convolution, and Feed-Forward sublayers. CNNs capture local acoustic features like formants and transitions while Transformers handle global semantics, achieving state-of-the-art results that neither architecture achieves alone.

How does Tacotron 2 generate natural-sounding speech?

Tacotron 2 uses a character encoder followed by location-sensitive attention and an autoregressive LSTM decoder that predicts mel-spectrogram frames one at a time. A post-net refines the output, and a vocoder like HiFi-GAN converts the spectrogram to a waveform. Location-sensitive attention prevents word skipping and repeating by using previous alignment weights.

What is RNN-Transducer and why is it the standard for streaming ASR?

RNN-T combines a causal audio encoder with a prediction network that models label dependencies, joined by a network that produces output probabilities. Unlike attention-based models, it processes audio frame-by-frame with controllable latency, making it the standard for on-device streaming ASR in products like Google Pixel, Siri, and Alexa.

Originally published at: arunbaby.com/speech-tech/0037-sequence-to-sequence-speech

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

TL;DR

1. The Seq2Seq Paradigm

2. ASR: Listen, Attend, and Spell (LAS)

3. TTS: Tacotron 2

4. Speech Translation: Direct S2ST

5. Attention Mechanisms

1. Content-Based Attention

2. Location-Aware Attention

3. Monotonic Attention

6. Challenges

1. Exposure Bias

2. Alignment Failures

3. Long Sequences

7. Modern Approach: Transformer-Based

8. Summary

9. Deep Dive: Attention Alignment Visualization

10. Deep Dive: Streaming ASR with Monotonic Attention

11. Deep Dive: Tacotron 2 Architecture Details

12. Deep Dive: Vocoder Evolution

13. System Design: Low-Latency TTS

14. Deep Dive: Translatotron (Direct S2ST)

15. Code: Simple Seq2Seq ASR

16. Production Considerations

17. Deep Dive: The Evolution of Seq2Seq in Speech

18. Deep Dive: Connectionist Temporal Classification (CTC)

19. Deep Dive: RNN-Transducer (RNN-T)

20. Deep Dive: Transformer vs Conformer

21. Deep Dive: FastSpeech 2 (Non-Autoregressive TTS)

22. Deep Dive: VITS (Conditional Variational Autoencoder)

23. System Design: Real-Time Speech Translation System

24. Case Study: OpenAI Whisper

25. Case Study: Google USM (Universal Speech Model)

26. Deep Dive: Evaluation Metrics

27. Code: Implementing Beam Search for Seq2Seq

28. Production: Serving Speech Models with Triton

29. Deep Dive: End-to-End Speech-to-Speech Translation (S2ST)

30. Summary

31. Deep Dive: Self-Supervised Learning (SSL) in Speech

32. Deep Dive: Multilingual Seq2Seq

33. Deep Dive: End-to-End SLU (Spoken Language Understanding)

34. Deep Dive: Attention Visualization Code

35. Deep Dive: Handling Long-Form Audio

36. Future Trends: Audio-Language Models

37. Ethical Considerations: Voice Cloning & Deepfakes

38. Deep Dive: The Mathematics of Transformers in Speech

39. Deep Dive: Troubleshooting Seq2Seq Models

40. Deep Dive: The Future - Audio-Language Models

41. Ethical Considerations in Seq2Seq Speech

42. Further Reading

43. Summary

FAQ

Related across topics

Longest Increasing Subsequence (LIS)

Sequence Modeling in ML

Code Execution Agents

Share on