23 minute read

“From waveforms to words, and back again.”

TL;DR

Sequence-to-sequence models replaced hand-crafted speech pipelines with end-to-end neural networks for ASR (LAS, Whisper), TTS (Tacotron 2, FastSpeech 2), and speech translation (Translatotron). The evolution from HMM-GMM through CTC and RNN-Transducer to Conformer progressively solved streaming, alignment, and robustness challenges. Self-supervised pretraining with Wav2Vec 2.0 and HuBERT enables SOTA results with minimal labeled data, while audio-language models like AudioLM blur the boundary between speech and NLP. For how these models handle multiple tasks simultaneously, see the multi-task learning article.

Two parallel conveyor belts running in the same direction with a robotic arm transferring items between them in preci...

1. The Seq2Seq Paradigm

Traditional speech systems were pipelines:

  • ASR: Acoustic Model + Language Model + Decoder.
  • TTS: Text Analysis + Acoustic Model + Vocoder.

Seq2Seq unifies this into a single neural network:

  • Input: Sequence (audio features or text).
  • Output: Sequence (text or audio features).
  • No hand-crafted features or rules.

2. ASR: Listen, Attend, and Spell (LAS)

Architecture:

  1. Listener (Encoder): Pyramid LSTM processes audio.
    • Reduces time resolution (e.g., 100 frames → 25 frames).
  2. Attender: Attention mechanism focuses on relevant audio frames.
  3. Speller (Decoder): LSTM generates characters/words.

Attention: \alpha_t = \text{softmax}(e_t) e_{t,i} = \text{score}(s_{t-1}, h_i) c_t = \sum_i \alpha_{t,i} h_i

Where:

  • s_{t-1}: Previous decoder state.
  • h_i: Encoder hidden state at time i.
  • c_t: Context vector (weighted sum of encoder states).

3. TTS: Tacotron 2

Goal: Text → Mel-Spectrogram → Waveform.

Architecture:

  1. Encoder: Character embeddings → LSTM.
  2. Attention: Decoder attends to encoder states.
  3. Decoder: Predicts mel-spectrogram frames.
  4. Vocoder (WaveNet/HiFi-GAN): Mel → Waveform.

Key Innovation: Predict multiple frames per step (faster).

4. Speech Translation: Direct S2ST

Traditional: Audio → ASR → Text → MT → Text → TTS → Audio. Direct: Audio → Audio (no text intermediate).

Advantages:

  • Preserves prosody (emotion, emphasis).
  • Works for unwritten languages.

Model: Translatotron (Google).

  • Encoder: Audio (Spanish).
  • Decoder: Spectrogram (English).

5. Attention Mechanisms

1. Content-Based Attention

  • Decoder decides where to attend based on content.
  • Problem: Can attend to same location twice (repetition).

2. Location-Aware Attention

  • Uses previous attention weights to guide current attention.
  • Encourages monotonic progression (left-to-right).

3. Monotonic Attention

  • Enforces strict left-to-right alignment.
  • Good for streaming ASR (can’t look ahead).

6. Challenges

1. Exposure Bias

  • During training, decoder sees ground truth.
  • During inference, decoder sees its own (possibly wrong) predictions.
  • Fix: Scheduled sampling.

2. Alignment Failures

  • Attention might skip words or repeat.
  • Fix: Guided attention (force diagonal alignment early in training).

3. Long Sequences

  • Attention is O(N^2) in memory.
  • Fix: Chunked attention, or use CTC for ASR.

7. Modern Approach: Transformer-Based

Whisper (OpenAI):

  • Encoder-Decoder Transformer.
  • Trained on 680k hours.
  • Handles ASR, Translation, Language ID in one model.

Advantages:

  • Parallel training (unlike RNN).
  • Better long-range dependencies.

8. Summary

Task Model Input Output
ASR LAS, Whisper Audio Text
TTS Tacotron 2 Text Audio
ST Translatotron Audio (L1) Audio (L2)

9. Deep Dive: Attention Alignment Visualization

In ASR, attention should be monotonic (left-to-right).

Good Alignment:

Audio frames: [a1][a2][a3][a4][a5]
Text: [ h ][ e ][ l ][ l ][ o ]
Attention: ████
 ░░░░████
 ░░░░████
 ░░░░████
 ░░░░████

Bad Alignment (Skipping):

Audio frames: [a1][a2][a3][a4][a5]
Text: [ h ][ e ][ l ][ l ][ o ]
Attention: ████
 ░░░░████
 ░░░░████ ← Skipped a3!

Fix: Guided attention loss forces diagonal alignment early in training.

10. Deep Dive: Streaming ASR with Monotonic Attention

Problem: Standard attention looks at the entire input. Can’t stream.

Monotonic Chunkwise Attention (MoChA):

  1. At each decoder step, decide: “Should I move to the next audio chunk?”
  2. Use a sigmoid gate: p_{\text{move}} = \sigma(g(h_t, s_{t-1}))
  3. If yes, attend to next chunk. If no, stay.

Advantage: Can process audio in real-time (with bounded latency).

11. Deep Dive: Tacotron 2 Architecture Details

Encoder:

  • Character embeddings → 3 Conv layers → Bidirectional LSTM.
  • Output: Encoded representation of text.

Decoder:

  • Prenet: 2 FC layers with dropout (helps with generalization).
  • Attention RNN: 1-layer LSTM.
  • Decoder RNN: 2-layer LSTM.
  • Output: Predicts 80-dim mel-spectrogram frame.

Postnet:

  • 5 Conv layers.
  • Refines the mel-spectrogram (adds high-frequency details).

Stop Token:

  • Decoder also predicts “Should I stop?” at each step.
  • Training: Sigmoid BCE loss.

12. Deep Dive: Vocoder Evolution

Goal: Mel-Spectrogram → Waveform.

WaveNet (2016):

  • Autoregressive CNN.
  • Generates one sample at a time (16kHz = 16,000 samples/sec).
  • Slow: 1 second of audio takes 10 seconds to generate.

WaveGlow (2018):

  • Flow-based model.
  • Parallel generation (all samples at once).
  • Fast: Real-time on GPU.

HiFi-GAN (2020):

  • GAN-based.
  • Fastest: 167x faster than real-time on V100.
  • Quality: Indistinguishable from ground truth.

13. System Design: Low-Latency TTS

Scenario: Voice assistant (Alexa, Siri).

Requirements:

  • Latency: < 300ms from text to first audio chunk.
  • Quality: Natural, expressive.

Architecture:

  1. Text Normalization: “Dr.” → “Doctor”, “$100” → “one hundred dollars”.
  2. Grapheme-to-Phoneme (G2P): “read” → /riːd/ or /rɛd/ (context-dependent).
  3. Prosody Prediction: Predict pitch, duration, energy.
  4. Acoustic Model: FastSpeech 2 (non-autoregressive, parallel).
  5. Vocoder: HiFi-GAN.

Streaming:

  • Generate mel-spectrogram in chunks (e.g., 50ms).
  • Vocoder processes each chunk independently.

14. Deep Dive: Translatotron (Direct S2ST)

Challenge: No parallel S2ST data (Audio_Spanish → Audio_English).

Solution: Weak supervision.

  1. Use ASR to get Spanish text.
  2. Use MT to get English text.
  3. Use TTS to get English audio.
  4. Train Translatotron on (Audio_Spanish, Audio_English) pairs.

Architecture:

  • Encoder: Processes Spanish audio.
  • Decoder: Generates English mel-spectrogram.
  • Speaker Encoder: Preserves source speaker’s voice.

15. Code: Simple Seq2Seq ASR

import torch
import torch.nn as nn

class Seq2SeqASR(nn.Module):
    def __init__(self, input_dim, hidden_dim, vocab_size):
        super().__init__()

        # Encoder: Bi-LSTM
        self.encoder = nn.LSTM(input_dim, hidden_dim, num_layers=3,
        batch_first=True, bidirectional=True)

        # Attention
        self.attention = nn.Linear(hidden_dim * 2, 1)

        # Decoder: LSTM
        self.decoder = nn.LSTM(vocab_size + hidden_dim * 2, hidden_dim,
        num_layers=2, batch_first=True)

        # Output projection
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, audio_features, text_input):
        # Encode audio
        encoder_out, _ = self.encoder(audio_features) # (B, T, 2*H)

        # Decode
        outputs = []
        hidden = None

        for t in range(text_input.size(1)):
            # Compute attention
            attn_scores = self.attention(encoder_out).squeeze(-1) # (B, T)
            attn_weights = torch.softmax(attn_scores, dim=1) # (B, T)
            context = torch.bmm(attn_weights.unsqueeze(1), encoder_out) # (B, 1, 2*H)

            # Decoder input: previous char + context
            decoder_input = torch.cat([text_input[:, t:t+1], context], dim=-1)

            # Decoder step
            decoder_out, hidden = self.decoder(decoder_input, hidden)

            # Predict next char
            logits = self.fc(decoder_out)
            outputs.append(logits)

            return torch.cat(outputs, dim=1)

16. Production Considerations

  1. Model Size: Quantize to INT8 for mobile deployment.
  2. Latency: Use non-autoregressive models (FastSpeech, Conformer-CTC).
  3. Personalization: Fine-tune on user’s voice (few-shot learning).
  4. Multilingual: Train on 100+ languages (mSLAM, Whisper).

17. Deep Dive: The Evolution of Seq2Seq in Speech

To understand where we are, we must understand the journey.

1. HMM-GMM (1980s-2010):

  • Acoustic Model: Gaussian Mixture Models (GMM) modeled the probability of audio features given a phoneme state.
  • Sequence Model: Hidden Markov Models (HMM) modeled the transition between states.
  • Pros: Mathematically rigorous, worked on small data.
  • Cons: Assumed independence of frames (false), complex training pipeline.

2. DNN-HMM (2010-2015):

  • Replaced GMM with Deep Neural Networks (DNN).
  • Impact: Massive drop in WER (30% relative).
  • Cons: Still relied on HMM for alignment.

3. CTC (2006, popularized 2015):

  • Connectionist Temporal Classification.
  • First “End-to-End” loss.
  • Allowed predicting sequences shorter than input without explicit alignment.
  • Cons: Conditional independence assumption (output at time t depends only on input, not previous outputs).

4. LAS (2016):

  • Listen-Attend-Spell.
  • Introduced Attention to speech.
  • Pros: No independence assumption.
  • Cons: Not streaming (needs full audio to attend).

5. RNN-T (2012, popularized 2018):

  • RNN Transducer.
  • Combined CTC (streaming) with LAS (label dependency).
  • Result: The standard for streaming ASR (Pixel, Siri, Alexa).

6. Transformer & Conformer (2017-Present):

  • Self-attention replaces RNNs.
  • Conformer: Combines CNN (local patterns) with Transformer (global context).

18. Deep Dive: Connectionist Temporal Classification (CTC)

Problem: Audio has T frames (e.g., 1000). Text has L characters (e.g., 50). T \gg L. How do we map them?

CTC Solution:

  • Introduce a blank token \epsilon.
  • Output sequence length is T.
  • Collapse: Remove repeats and blanks.
  • h h e e l l l l ohello
  • h h \epsilon e e l l \epsilon l l ohello

Loss Function: P(Y|X) = \sum_{A \in \mathcal{B}^{-1}(Y)} P(A|X)

  • Sum over all valid alignments A that collapse to Y.
  • Computed efficiently using Dynamic Programming (Forward-Backward algorithm).

Pros:

  • Fast inference (O(1) per step).
  • Streaming friendly.

Cons:

  • Can’t model language (e.g., “pair” vs “pear” sounds same). Needs external Language Model.

19. Deep Dive: RNN-Transducer (RNN-T)

Architecture:

  1. Encoder (Audio): Processes audio frames x_t. Produces h_t^{enc}.
  2. Prediction Network (Text): Processes previous non-blank output y_{u-1}. Produces h_u^{pred}. (Like an LM).
  3. Joint Network: Combines them. z_{t,u} = \text{ReLU}(W_{enc} h_t^{enc} + W_{pred} h_u^{pred}) P(y|t,u) = \text{softmax}(W_{out} z_{t,u})

Inference (Greedy):

  • If output is non-blank (y), feed to Prediction Network, increment u. Stay at audio frame t.
  • If output is blank (\epsilon), move to next audio frame t+1.

Why it wins:

  • Streaming: Encoder is causal.
  • Accuracy: Prediction network models label dependencies (unlike CTC).
  • Latency: Controllable.

20. Deep Dive: Transformer vs Conformer

Transformer:

  • Great at global context (Self-Attention).
  • Bad at local fine-grained details (needs deep layers to see local patterns).
  • Positional encodings are brittle for varying audio lengths.

Conformer (Convolution-augmented Transformer):

  • Macaron Style: Feed-Forward -> Multi-Head Attention -> Conv Module -> Feed-Forward.
  • Conv Module: Pointwise Conv -> Gated Linear Unit (GLU) -> Depthwise Conv -> BatchNorm -> Swish -> Pointwise Conv.
  • Why:
  • CNNs capture local features (formants, transitions).
  • Transformers capture global semantics.
  • Result: SOTA on LibriSpeech.

21. Deep Dive: FastSpeech 2 (Non-Autoregressive TTS)

Problem with Tacotron 2:

  • Autoregressive (slow).
  • Unstable attention (skipping/repeating).
  • Hard to control prosody (speed, pitch).

FastSpeech 2 Architecture:

  1. Encoder: Feed-Forward Transformer.
  2. Variance Adaptor:
    • Duration Predictor: How many frames does this phoneme last? (Trained on forced alignment).
    • Pitch Predictor: Predicts F0 contour.
    • Energy Predictor: Predicts volume.
    • Adds embeddings of these predictions to the encoder output.
  3. Length Regulator: Expands hidden states based on duration (e.g., “a” lasts 5 frames -> repeat vector 5 times).
  4. Decoder: Feed-Forward Transformer.

Pros:

  • Fast: Parallel generation.
  • Robust: No attention failures.
  • Controllable: Can explicitly set “speak 1.2x faster” or “raise pitch”.

22. Deep Dive: VITS (Conditional Variational Autoencoder)

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) is the current SOTA for E2E TTS.

Key Idea: Combine Acoustic Model and Vocoder into one flow-based model.

Architecture:

  • Posterior Encoder: Encodes linear spectrogram into latent z.
  • Prior Encoder: Predicts distribution of z from text (conditioned on alignment).
  • Decoder (Generator): Transforms z into waveform (HiFi-GAN style).
  • Stochastic Duration Predictor: Models duration uncertainty.

Training:

  • Reconstruction Loss: Mel-spectrogram L1 loss.
  • KL Divergence: Between Posterior and Prior.
  • Adversarial Loss: Discriminator tries to distinguish real vs generated audio.

Result: extremely natural, high-fidelity speech.

23. System Design: Real-Time Speech Translation System

Scenario: “Universal Translator” device. English Audio -> Spanish Audio.

Constraints:

  • Latency: < 2 seconds lag.
  • Compute: Edge device (limited).

Architecture Choices:

Option A: Cascade (ASR -> MT -> TTS)

  • Pros: Modular. Can use SOTA for each.
  • Cons: Error propagation. Latency adds up. Loss of paralinguistics (tone).

Option B: Direct S2ST (Audio -> Audio)

  • Pros: Fast. Preserves tone.
  • Cons: Data scarcity. Hard to train.

Hybrid Design (Production):

  1. Streaming ASR (RNN-T): Generates English text stream.
  2. Streaming MT: Translates partial sentences. “Hello” -> “Hola”.
  3. Incremental TTS: Generates audio for “Hola” immediately.

Wait-k Policy:

  • MT model waits for k words before translating.
  • Balances context (accuracy) vs latency.

24. Case Study: OpenAI Whisper

Goal: Robust ASR that works on “in the wild” audio.

Data:

  • 680,000 hours of web audio.
  • Weak Supervision: Transcripts from ASR systems, subtitles (noisy).
  • Multitask:
  • English Transcription.
  • Any-to-English Translation.
  • Language Identification.
  • Voice Activity Detection.

Architecture:

  • Standard Encoder-Decoder Transformer.
  • Input: Log-Mel Spectrogram (30 seconds).
  • Output: Text tokens.

Key Features:

  • Task Tokens: <|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>.
  • Long-form: Processes 30s chunks. Uses previous chunk’s text as prompt (context).

Impact:

  • Zero-shot performance on many datasets matches supervised models.
  • Proved that Data Scale > Model Architecture.

25. Case Study: Google USM (Universal Speech Model)

Goal: One model for 300+ languages.

Architecture:

  • Conformer (2 Billion parameters).
  • MOST (Multi-Objective Supervised Training):
  • BEST-RQ: Self-supervised loss (BERT-style on audio).
  • Text-Injection: Train on text-only data (using shared encoder layers).
  • ASR: Supervised loss on labeled audio.

Result:

  • SOTA on 73 languages.
  • Enables ASR for languages with < 10 hours of data.

26. Deep Dive: Evaluation Metrics

ASR:

  • WER (Word Error Rate): \frac{S + D + I}{N}
  • S: Substitutions, D: Deletions, I: Insertions, N: Total words.
  • CER (Character Error Rate): For languages without spaces (Chinese).
  • RTF (Real-Time Factor): \frac{\text{Processing Time}}{\text{Audio Duration}}. RTF < 1 means real-time.

TTS:

  • MOS (Mean Opinion Score): Humans rate naturalness 1-5.
  • MCD (Mel Cepstral Distortion): Objective distance between generated and ground truth spectrograms.

ST (Speech Translation):

  • BLEU: Standard MT metric.

27. Code: Implementing Beam Search for Seq2Seq

Greedy decoding (pick max prob) is suboptimal. Beam search explores k paths.

def beam_search_decoder(model, encoder_out, beam_width=3, max_len=50):
    # Start with [SOS] token
    # Beam: list of (sequence, score, hidden_state)
    start_token = model.vocab['<sos>']
    beam = [([start_token], 0.0, None)]

    for _ in range(max_len):
        candidates = []

        for seq, score, hidden in beam:
            if seq[-1] == model.vocab['<eos>']:
                candidates.append((seq, score, hidden))
                continue

                # Predict next token
                last_token = torch.tensor([[seq[-1]]])
                output, new_hidden = model.decoder(last_token, hidden, encoder_out)
                log_probs = torch.log_softmax(output, dim=-1)

                # Get top k
                topk_probs, topk_ids = log_probs.topk(beam_width)

                for i in range(beam_width):
                    token = topk_ids[0][i].item()
                    prob = topk_probs[0][i].item()
                    candidates.append((seq + [token], score + prob, new_hidden))

                    # Select top k candidates
                    beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]

                    # Check if all finished
                    if all(c[0][-1] == model.vocab['<eos>'] for c in beam):
                        break

                        return beam[0][0] # Return best sequence

28. Production: Serving Speech Models with Triton

NVIDIA Triton Inference Server is standard for deploying speech models.

Pipeline:

  1. Feature Extractor (Python Backend): Audio -> Mel-Spec.
  2. Encoder (TensorRT): Mel-Spec -> Hidden States.
  3. Decoder (TensorRT): Hidden States -> Text (Beam Search).

Dynamic Batching:

  • Triton groups requests arriving within 5ms into a batch.
  • Increases GPU utilization significantly.

Ensemble Model:

  • Define a DAG of models.
  • Client sends audio, Triton handles the flow.

29. Deep Dive: End-to-End Speech-to-Speech Translation (S2ST)

Unit-Based S2ST:

  • Instead of spectrograms, predict discrete acoustic units (from HuBERT or k-means).
  • Advantage: Discrete tokens allow using standard Transformer (like NLP).
  • Vocoder: Unit HiFi-GAN converts discrete units to waveform.

SpeechMatrix:

  • Mining parallel speech from 100k hours of multilingual audio.
  • Uses LASER embeddings to find matching sentences in different languages.

30. Summary

Task Model Input Output
ASR LAS, Whisper Audio Text
TTS Tacotron 2 Text Audio
ST Translatotron Audio (L1) Audio (L2)
Vocoder HiFi-GAN Mel-Spec Waveform
Streaming RNN-T Audio Stream Text Stream
Fast TTS FastSpeech 2 Text Mel-Spec
Streaming RNN-T Audio Stream Text Stream
Fast TTS FastSpeech 2 Text Mel-Spec

31. Deep Dive: Self-Supervised Learning (SSL) in Speech

Problem: Labeled data (audio + text) is expensive. Unlabeled audio is free.

Wav2Vec 2.0 (Meta):

  • Idea: Mask parts of the audio latent space and predict the quantized representation of the masked part.
  • Contrastive Loss: Identify the correct quantized vector among distractors.
  • Result: Can reach SOTA with only 10 minutes of labeled data (after pre-training on 53k hours).

HuBERT (Hidden Unit BERT):

  • Idea: Offline clustering (k-means) of MFCCs to generate pseudo-labels.
  • Masked Prediction: BERT-style objective. Predict the cluster ID of masked frames.
  • Result: More robust than Wav2Vec 2.0.

Impact on Seq2Seq:

  • Initialize the Encoder with Wav2Vec 2.0 / HuBERT.
  • Add a random Decoder.
  • Fine-tune on ASR task.
  • Benefit: Drastic reduction in required labeled data.

32. Deep Dive: Multilingual Seq2Seq

Goal: One model for 100 languages.

Challenges:

  • Data Imbalance: English has 1M hours, Swahili has 100.
  • Script Diversity: Latin, Cyrillic, Chinese, Arabic.
  • Phonetic Overlap: “P” in English \neq “P” in French.

Solutions:

  1. Language ID Token: <|en|>, <|fr|>.
  2. Shared Vocabulary: SentencePiece (BPE) trained on all languages.
  3. Balancing Sampling: Up-sample low-resource languages during training. p_l \propto (\frac{N_l}{N_{total}})^\alpha where \alpha < 1 (e.g., 0.3) flattens the distribution.
  4. Adapter Modules: Small, language-specific layers inserted into the frozen giant model.

33. Deep Dive: End-to-End SLU (Spoken Language Understanding)

Traditional: Audio → ASR → Text → NLP → Intent/Slots. E2E SLU: Audio → Intent/Slots.

Why E2E?

  • Error Robustness: ASR might transcribe “play jazz” as “play jas”. NLP fails. E2E model learns acoustic features of “jazz”.
  • Paralinguistics: “Yeah right” (sarcastic) -> Negative Sentiment. Text-only NLP misses this.

Architecture:

  • Encoder: Pre-trained Wav2Vec 2.0.
  • Decoder: Predicts semantic frame directly.
  • [INTENT: PlayMusic] [ARTIST: The Beatles] [GENRE: Rock]

Challenges:

  • Lack of labeled Audio-to-Semantics data.
  • Solution: Transfer learning from ASR models.

34. Deep Dive: Attention Visualization Code

Visualizing attention maps is the best way to debug Seq2Seq models.

import matplotlib.pyplot as plt
import seaborn as sns

def plot_attention(attention_matrix, input_text, output_text):
    """
    attention_matrix: (Output_Len, Input_Len) numpy array
    """
    plt.figure(figsize=(10, 8))
    sns.heatmap(attention_matrix, cmap='viridis',
    xticklabels=input_text, yticklabels=output_text)
    plt.xlabel('Encoder Input')
    plt.ylabel('Decoder Output')
    plt.show()

    # Example usage
    # attn = model.get_attention_weights(...)
    # plot_attention(attn, audio_frames, predicted_words)

What to look for:

  • Diagonal: Good alignment.
  • Vertical Line: Decoder is stuck on one frame (repeating output).
  • Horizontal Line: Encoder frame is ignored.
  • Fuzzy: Weak attention (low confidence).

35. Deep Dive: Handling Long-Form Audio

Standard Transformers have O(N^2) attention complexity. 30s audio = 1500 frames. 1 hour = 180,000 frames.

Strategies:

  1. Chunking (Whisper):
    • Slice audio into 30s segments.
    • Transcribe independently.
    • Problem: Context cut off at boundaries.
    • Fix: Pass previous segment’s text as prompt.
  2. Streaming (RNN-T):
    • Process frame-by-frame. Infinite length.
    • Problem: No future context.
  3. Sparse Attention (BigBird / Longformer):
    • Attend only to local window + global tokens.
    • O(N) complexity.
  4. Block-Processing (Emformer):
    • Block-wise processing with memory bank for history.

SpeechGPT / AudioLM:

  • Treat audio tokens and text tokens as the same thing.
  • Tokenizer: SoundStream / EnCodec (Neural Audio Codec).
  • Model: Decoder-only Transformer (GPT).
  • Training:
  • Text-only data (Web).
  • Audio-only data (Radio).
  • Paired data (ASR).

Capabilities:

  • Speech-to-Speech Translation: “Translate this to French” (Audio input) -> Audio output.
  • Voice Continuation: Continue speaking in the user’s voice.
  • Zero-shot TTS: “Say ‘Hello’ in this voice: [Audio Prompt]”.

37. Ethical Considerations: Voice Cloning & Deepfakes

Seq2Seq TTS (Vall-E) can clone a voice with 3 seconds of audio.

Risks:

  • Fraud: Impersonating CEOs or relatives.
  • Disinformation: Fake speeches by politicians.
  • Harassment: Fake audio of individuals.

Mitigation:

  • Watermarking: Embed inaudible signals in generated audio.
  • Detection: Train classifiers to detect artifacts of synthesis.
  • Regulation: “Know Your Customer” for TTS APIs.

  • Regulation: “Know Your Customer” for TTS APIs.

38. Deep Dive: The Mathematics of Transformers in Speech

Why did Transformers replace RNNs?

1. Self-Attention Complexity:

  • RNN: O(N) sequential operations. Cannot parallelize.
  • Transformer: O(1) sequential operations (parallelizable). O(N^2) memory.
  • Benefit: We can train on 1000 GPUs efficiently.

2. Positional Encodings in Speech:

  • Text uses absolute sinusoidal encodings.
  • Speech Problem: Audio length varies wildly. 10s vs 10m.
  • Solution: Relative Positional Encoding.
  • Instead of adding P_i to input X_i, add a bias b_{i-j} to the attention score A_{ij}.
  • Allows the model to generalize to audio lengths unseen during training.

3. Subsampling:

  • Audio is high-frequency (100 frames/sec). Text is low-frequency (3 chars/sec).
  • Conv Subsampling: First 2 layers of Encoder are strided Convolutions (stride 2x2 = 4x reduction).
  • Reduces sequence length N \to N/4. Reduces attention cost N^2 \to (N/4)^2 = N^2/16.

39. Deep Dive: Troubleshooting Seq2Seq Models

1. The “Hallucination” Problem:

  • Symptom: Model outputs “Thank you Thank you Thank you” during silence.
  • Cause: Decoder language model is too strong; it predicts likely text even without acoustic evidence.
  • Fix:
  • Voice Activity Detection (VAD): Filter out silence.
  • Coverage Penalty: Penalize attending to the same frames repeatedly.

2. The “NaN” Loss:

  • Symptom: Training crashes.
  • Cause: Exploding gradients in LSTM or division by zero in BatchNorm.
  • Fix:
  • Gradient Clipping (norm 1.0).
  • Warmup learning rate.
  • Check for empty transcripts in data.

3. The “Babble” Problem:

  • Symptom: Output is gibberish.
  • Cause: CTC alignment failed or Attention didn’t converge.
  • Fix:
  • Curriculum Learning: Train on short utterances first, then long.
  • Guided Attention Loss: Force diagonal alignment for first few epochs.

  • Guided Attention Loss: Force diagonal alignment for first few epochs.

40. Deep Dive: The Future - Audio-Language Models

The boundary between Speech and NLP is blurring.

AudioLM (Google):

  • Treats audio as a sequence of discrete tokens (using SoundStream codec).
  • Uses a GPT-style decoder to generate audio tokens.
  • Capabilities:
  • Speech Continuation: Given 3s of speech, continue speaking in the same voice and style.
  • Zero-Shot TTS: Generate speech from text in a target voice without fine-tuning.

SpeechGPT (Fudan University):

  • Fine-tunes LLaMA on paired speech-text data.
  • Modality Adaptation: Teaches the LLM to understand discrete audio tokens.
  • Result: A chatbot you can talk to, which talks back with emotion and nuance.

Implication:

  • Seq2Seq models (Encoder-Decoder) might be replaced by Decoder-only LLMs that handle all modalities (Text, Audio, Image) in a unified token space.

41. Ethical Considerations in Seq2Seq Speech

1. Deepfakes & Voice Cloning:

  • Models like Vall-E can clone a voice from 3 seconds of audio.
  • Risk: Fraud (fake CEO calls), harassment, disinformation.
  • Mitigation:
  • Watermarking: Embed inaudible signals in generated audio.
  • Detection: Train classifiers to detect synthesis artifacts.

2. Bias in ASR:

  • Models trained on LibriSpeech (audiobooks) fail on AAVE (African American Vernacular English) or Indian accents.
  • Fix: Diverse training data. “Fairness-aware” loss functions that penalize disparity between groups.

3. Privacy:

  • Smart speakers listen constantly.
  • Fix: On-Device Processing. Run the Seq2Seq model on the phone’s NPU, never sending audio to the cloud.

42. Further Reading

To master Seq2Seq speech models, these papers are essential:

  1. “Listen, Attend and Spell” (Chan et al., 2016): The paper that introduced Attention to ASR.
  2. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions” (Shen et al., 2018): The Tacotron 2 paper.
  3. “Attention Is All You Need” (Vaswani et al., 2017): The Transformer paper (foundation of modern speech).
  4. “Conformer: Convolution-augmented Transformer for Speech Recognition” (Gulati et al., 2020): The current SOTA architecture for ASR.
  5. “Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (Baevski et al., 2020): The SSL revolution.
  6. “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022): How scale beats architecture.

43. Summary

Task Model Input Output
ASR LAS, Whisper Audio Text
TTS Tacotron 2 Text Audio
ST Translatotron Audio (L1) Audio (L2)
Vocoder HiFi-GAN Mel-Spec Waveform
Streaming RNN-T Audio Stream Text Stream
Fast TTS FastSpeech 2 Text Mel-Spec
SSL Wav2Vec 2.0 Masked Audio Quantized Vector
E2E SLU SLU-BERT Audio Intent/Slots
Troubleshooting VAD, Gradient Clipping - -
Future AudioLM Discrete Tokens Discrete Tokens

FAQ

What is the difference between CTC and attention-based ASR?

CTC enforces monotonic alignment between audio and text and enables streaming inference, but assumes conditional independence between output tokens at each time step. Attention-based models like LAS capture label dependencies and long-range context but require the full input before decoding. Hybrid CTC-Attention training combines both strengths for production systems.

Why did Conformer replace Transformers for speech recognition?

Conformer augments Transformer self-attention with convolution modules in a macaron-style architecture where each block contains Feed-Forward, Multi-Head Attention, Convolution, and Feed-Forward sublayers. CNNs capture local acoustic features like formants and transitions while Transformers handle global semantics, achieving state-of-the-art results that neither architecture achieves alone.

How does Tacotron 2 generate natural-sounding speech?

Tacotron 2 uses a character encoder followed by location-sensitive attention and an autoregressive LSTM decoder that predicts mel-spectrogram frames one at a time. A post-net refines the output, and a vocoder like HiFi-GAN converts the spectrogram to a waveform. Location-sensitive attention prevents word skipping and repeating by using previous alignment weights.

What is RNN-Transducer and why is it the standard for streaming ASR?

RNN-T combines a causal audio encoder with a prediction network that models label dependencies, joined by a network that produces output probabilities. Unlike attention-based models, it processes audio frame-by-frame with controllable latency, making it the standard for on-device streaming ASR in products like Google Pixel, Siri, and Alexa.


Originally published at: arunbaby.com/speech-tech/0037-sequence-to-sequence-speech

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch