Speech Pipeline Dependencies
“Garbage in, Garbage out. Silence in, Hallucination out.”
1. Problem Statement
A modern “Voice Assistant” is not one model. It is a Cascade of Models.
- VAD: Is someone speaking?
- Diarization: Who is speaking?
- ASR: What did they say?
- NLP/Entity: What does it mean?
The Problem: These models depend on each other. If VAD cuts off the first 50ms of a word, ASR fails. If ASR makes a typo, NLP fails. How do we orchestrate this pipeline efficiently?
2. Fundamentals: The Tightly Coupled Chain
Unlike generic microservices where Service A calls Service B via HTTP, Speech pipelines share Time Alignment.
- VAD output:
User spoke [0.5s - 2.5s]. - Diarization output:
Speaker A at [0.4s - 2.6s]. - The timestamps must match perfectly.
- Error Propagation: A 10% error in VAD can cause a 50% error in Diarization (if it misses the speaker change).
3. High-Level Architecture
We can view this as a Stream Processing DAG.
graph LR
A[Microphone Stream] --> B{VAD}
B -- Silence --> C[Discard / Noise Profile Update]
B -- Speech --> D[Buffer]
D --> E[Speaker ID (Embedding)]
D --> F[ASR (Transcription)]
E --> G[Meeting Transcript Builder]
F --> G
4. Component Deep-Dives
4.1 Voice Activity Detection (VAD)
The Gatekeeper.
- Role: Save compute by ignoring silence. Prevent ASR from hallucinating on noise.
- Latency: Must be < 10ms.
- Models: Silero VAD (RNN), WebRTC VAD (GMM).
4.2 Speaker Diarization
The labeler.
- Role: Assign “Speaker 1” vs “Speaker 2”.
- Complexity: $O(N^2)$ (Clustering). Hard to do streaming.
- Solution: “Online Diarization” keeps a centroid for active speakers and assigns new frames to nearest centroid.
4.3 ASR
The Heavy Lifter.
- Takes the buffered audio from VAD.
- Returns text + timestamps.
5. Data Flow: The “Turn” Concept
Processing audio byte-by-byte is inefficient for ASR. We group audio into Turns (Utterances).
- VAD detects
Speech Start. - Pipeline starts Accumulating audio into RAM.
- VAD detects
Speech End(trailing silence > 500ms). - Trigger: Send accumulated buffer to Diarization and ASR in parallel.
- Merge: Combine
Speaker=JohnandText="Hello".
6. Model Selection & Trade-offs
| Stage | Model Option A (Fast) | Model Option B (Accurate) | Selection Logic |
|---|---|---|---|
| VAD | WebRTC (CPU) | Silero (NN) | Use Silero. Accuracy is vital. Cost is low. |
| Diarization | Speaker Embedding (ResNet) | End-to-End (EEND) | Use Embedding. EEND is too slow for real-time. |
| ASR | Whisper-Tiny | Whisper-Large | Use Tiny for streaming, Large for final correction. |
7. Implementation: The Pipeline Class
import torch
class SpeechPipeline:
def __init__(self):
self.vad_model = load_silero_vad()
self.asr_model = load_whisper()
self.buffer = []
def process_frame(self, audio_chunk):
# 1. Filter Silence
speech_prob = self.vad_model(audio_chunk, 16000)
if speech_prob > 0.5:
self.buffer.append(audio_chunk)
elif len(self.buffer) > 0:
# Trailing silence detected -> End of Turn
full_audio = torch.cat(self.buffer)
self.buffer = [] # Reset
# 2. Trigger ASR (The Dependency)
text = self.asr_model.transcribe(full_audio)
print(f"Turn Complete: {text}")
8. Streaming Implications
In a true streaming pipeline, we cannot wait for “End of Turn”. We use Speculative Execution.
- ASR runs continuously on partial buffer:
H -> He -> Hel -> Hello. - Diarization runs every 1 second:
Speaker A. - Correction: If Diarization changes its mind (
Speaker A->Speaker B), we send a “Correction Event” to the UI to overwrite the previous line.
9. Quality Metrics
- DER (Diarization Error Rate):
False Alarm + Missed Detection + Confusion. - CpWER (Concatenated Person WER): WER calculated per speaker. Finding out if the model is biased against Speaker B.
10. Common Failure Modes
- The “Schrodinger’s Word”: A word at the boundary of a VAD cut.
- User: “Important.”
- VAD cuts at 0.1s.
- Audio: “…portant.”
- ASR: “Portent.”
- Fix: Padding. Always keep 200ms of history before the VAD trigger.
- Overlapping Speech: Two people talk at once.
- Standard ASR fails.
- Standard Diarization fails.
- Fix: Source Separation models (rarely used in production due to cost).
11. State-of-the-Art
Joint Models (Transducers).
Instead of VAD -> ASR -> NLP, train one massive Transformer:
Input: Audio.
Output: <speaker:1> Hello <speaker:2> Hi there <sentiment:pos>.
This removes the pipeline latency but makes modular upgrades impossible.
12. Key Takeaways
- VAD is critical: It is the “Trigger” for the whole DAG. If it’s flawed, the system is flawed.
- Padding saves lives: Never feed exact VAD boundaries to ASR. Add context.
- Latency Budget: If Total Latency limit is 500ms, and ASR takes 400ms, VAD+Diarization must happen in 100ms.
- Async Design: Run ASR and Diarization in parallel threads, not sequential, to minimize wall-clock time.
Originally published at: arunbaby.com/speech-tech/0049-speech-pipeline-dependencies
If you found this helpful, consider sharing it with others who might benefit.