When self-hosted TTS beats the API: the 2026 cost cliff
“Every month, your TTS vendor sends an invoice measured in characters. The same characters you could process on a $619 GPU.”
TL;DR
At 100K utterances per month, self-hosting Qwen3-TTS on burst GPU pricing costs $24 — versus $2,220 for ElevenLabs Flash or $375 for OpenAI TTS-1. A dedicated L40S at $619 per month breaks even against ElevenLabs Flash at roughly 29,000 utterances. Quality has narrowed but not closed: commercial APIs still lead on naturalness (MOS ~4.7 versus ~4.5 for the best open-source), while self-hosted models win on voice cloning, latency control, and data privacy. This post runs the numbers, compares four open-source TTS models against five commercial APIs, and gives you a decision matrix for production. For the voice agent architectures these TTS models plug into, see voice agent frameworks.

Why are voice teams reconsidering TTS APIs in 2026?
Because the price gap between commercial TTS APIs and self-hosted models becomes striking at scale. ElevenLabs Multilingual v2 on the Scale plan ($330/month) includes 2,000 minutes of audio, with overage at $0.18 per minute. At 100K utterances (25,000 minutes), that adds up to $4,470 per month. A single L40S GPU running Qwen3-TTS costs $0.86 per hour on RunPod — and at burst pricing, the same 100K utterances cost just $24.
The trigger is Qwen3-TTS (arXiv:2601.15621), released by Alibaba’s Qwen team in January 2026. It’s a 1.7B parameter model with 3-second voice cloning, 10-language support, and 101ms time-to-first-audio. Apache 2.0 licensed. It runs on an RTX 4090 at 5.4 GB VRAM with FlashAttention 2.
Qwen3-TTS isn’t alone. Kani-TTS-2 (400M params, 3GB VRAM), Kyutai Pocket TTS (100M params, CPU-only, 6x real-time on two CPU cores), and Kokoro (82M params, MOS 4.5 on CodeSOTA) all shipped in the same window. The open-source TTS space went from “interesting but not production-ready” to “legitimate alternative” in about three months.
The question: at what volume does self-hosting make financial sense, and what quality do you give up?
How does Qwen3-TTS work under the hood?
Most TTS systems chain a text encoder, an acoustic model, and a vocoder as separate stages. Qwen3-TTS collapses this into a single autoregressive language model. Text tokens and acoustic tokens are concatenated along the channel axis, predicting the next acoustic token directly from the text context with no intermediate mel-spectrogram step.
graph LR
A[Input text] --> B[Text tokenizer]
B --> C[Unified LM\n1.7B params]
G[3-sec reference\naudio] --> H[Speaker\nembedding]
H --> C
C --> D[Multi-token\nprediction]
D --> E[Speech tokenizer\ndecoder · 12Hz]
E --> F[Streamed audio]
The 12Hz tokenizer uses 16 codebooks at 12.5 Hz with a codebook size of 2,048. A Multi-Token Prediction module generates residual codebooks from the zeroth codebook, creating hierarchical acoustic detail without a separate diffusion step.
This architecture differs from CosyVoice 3 (also from Alibaba, but a different internal team). CosyVoice 3 uses flow matching with a diffusion transformer for higher peak quality at the expense of higher latency. Qwen3-TTS trades that peak quality for speed: 101ms first-packet for the 1.7B model versus CosyVoice 3’s substantially longer warm-up. On cross-lingual synthesis (Chinese to Korean), Qwen3-TTS achieves a 4.82% error rate versus CosyVoice 3’s 14.4% (arXiv:2601.15621).
For TTS fundamentals including mel-spectrograms, vocoders, and the traditional cascade architecture, see TTS system fundamentals.
What does self-hosting cost versus API fees at 100K utterances per month?
Here are the production numbers. I’m assuming 100,000 utterances per month at ~250 characters per utterance (a typical 15-second voice agent response — roughly 37 words at 150 words per minute). That’s 25 million characters and 25,000 minutes of audio per month.
Commercial API costs
| Provider | Model | Pricing | Monthly (100K utterances) |
|---|---|---|---|
| ElevenLabs | Multilingual v2 | $0.18/min over† | $4,470 |
| ElevenLabs | Flash v2.5 | $0.09/min over† | $2,220 |
| Cartesia | Sonic (Scale plan) | $37/M chars | $925 |
| MiniMax | Speech-02-Turbo | $30/M chars | $750 |
| OpenAI | TTS-1-HD | $30/M chars | $750 |
| Google Cloud | Neural2 | $16/M chars | $400 |
| Amazon Polly | Neural | $16/M chars | $400 |
| OpenAI | TTS-1 | $15/M chars | $375 |
| Inworld AI | TTS-1-Max (8.8B) | $10/M chars | $250 |
†ElevenLabs prices by minute of generated audio, not by character. Scale plan ($330/month) includes 2,000 MV2 minutes or 4,000 Flash minutes; overage billed per minute. Business plan ($1,320/month) includes 11,000/22,000 minutes at $0.12/$0.06 per extra minute. Character-based providers calculated at 25M characters (100K utterances × ~250 chars).
Sources: ElevenLabs pricing page (March 2026), OpenAI API pricing, Google Cloud TTS pricing, AWS Polly pricing, MiniMax API pricing (Replicate), Cartesia pricing (eesel.ai), Inworld TTS comparison (inworld.ai).
Self-hosted costs
| Setup | GPU | Monthly cost | Handles 100K? |
|---|---|---|---|
| L40S dedicated 24/7 (RunPod) | L40S 48GB | $619 | Yes, massive headroom |
| A10G dedicated (AWS) | A10G 24GB | $734 | Yes |
| L40S burst (RunPod) | L40S | ~$24 | Yes (28 GPU-hrs needed) |
| Own RTX 4090 | RTX 4090 24GB | ~$30 electricity | Yes, 15-20 concurrent streams |
The burst pricing is the real story. Each utterance takes roughly 1 second of GPU time. At 100K utterances, that’s ~28 GPU-hours. At $0.86/hour on an L40S, the compute cost is $24 per month.
Break-even
Against ElevenLabs Flash v2.5 on the Scale plan ($330 base plus $0.09 per additional minute, with 4,000 minutes included), a dedicated L40S ($619/month) breaks even at roughly 29,000 utterances per month. Below that, the API is cheaper. Above it, the GPU’s fixed cost amortizes quickly — at 100K utterances, self-hosting costs $619 versus $2,220.
Against budget APIs like OpenAI TTS-1 ($15/M chars), a dedicated GPU only breaks even around 165,000 utterances per month on pure cost. At 100K utterances, OpenAI TTS-1 costs $375 — less than the $619 L40S. The self-hosting case against budget APIs is feature-driven: voice cloning, sub-100ms latency control, and keeping audio data on your own infrastructure. Or use burst pricing ($24 for 100K utterances) and beat everything on cost.
How does quality compare to ElevenLabs and Cartesia?
Commercial APIs still win on raw naturalness. The CodeSOTA speech leaderboard (March 2026) puts the field like this:
| Model | MOS (1-5) | Type |
|---|---|---|
| ElevenLabs Turbo v2.5 | 4.7 | Commercial |
| Sesame CSM | 4.7 | Open-source |
| OpenAI TTS HD | 4.7 | Commercial |
| Cartesia Sonic 2 | 4.7 | Commercial |
| ElevenLabs Flash v2.5 | 4.6 | Commercial |
| Kokoro v1.0 | 4.5 | Open-source |
| XTTS v2 | 4.5 | Open-source |
| Fish Speech 1.5 | 4.4 | Open-source |
| Dia 1.6B | 4.3 | Open-source |
Source: CodeSOTA Speech Leaderboard, March 2026 (editorial aggregation by a single maintainer; methodology not published). Third-party benchmarks report ElevenLabs Turbo v2.5 at MOS 4.72 (WaveSpeedAI). Qwen3-TTS is not yet on this leaderboard. Treat these scores as directional rather than rigorous head-to-head evaluations.
What Qwen3-TTS does report (self-benchmarked, arXiv:2601.15621):
- English word error rate: 1.24% (state-of-the-art on Seed-TTS benchmark)
- Chinese word error rate: 0.77%
- Speaker similarity for voice cloning: 0.789 (WavLM-based)
- Best WER in 6 of 10 tested languages versus MiniMax-Speech and ElevenLabs
The Inworld AI team ran blind human preference tests (400 votes, 20 annotators) for their TTS-1-Max (8.8B): 59.1% win rate versus ElevenLabs, 60.9% versus Cartesia Sonic 2, 60.7% versus OpenAI TTS-1-HD (arXiv:2507.21138). These are self-conducted evaluations, so treat them as directional.
My honest assessment: for voice agents where the goal is accurate information delivery, open-source quality is production-ready today. For consumer products where voice is the brand experience (audiobooks, meditation apps, premium assistants), commercial APIs still justify their premium.
Which open-source TTS model fits your deployment?
Four models cover four distinct hardware targets:
| Model | Params | VRAM | Runs on | TTFA | Languages | Voice clone | Best for |
|---|---|---|---|---|---|---|---|
| Qwen3-TTS 1.7B | 1.7B | 5-6 GB | GPU (Ampere+) | 101ms | 10 | Yes (3 sec) | Max quality self-hosted |
| Kani-TTS-2 | 400M | 3 GB | GPU | RTF ~0.2 | TBD | Yes | Edge GPU |
| Kyutai Pocket | 100M | 0 | CPU only | ~200ms | English | No | No-GPU environments |
| Kokoro | 82M | Minimal | CPU/GPU | Fast | English | No | Best MOS per param |
graph TD
A{Need voice cloning?} -->|Yes| B{GPU available?}
A -->|No| C{Need multilingual?}
B -->|6+ GB VRAM| D[Qwen3-TTS 1.7B]
B -->|3 GB VRAM| E[Kani-TTS-2]
B -->|No GPU| F[Stay on API]
C -->|Yes| D
C -->|No, English only| G{GPU available?}
G -->|Yes| H[Kokoro · 82M · MOS 4.5]
G -->|No| I[Kyutai Pocket · CPU]
Qwen3-TTS 1.7B is the clear pick for multilingual production workloads with voice cloning. Avoid the 0.6B variant. In long-form generation tests, the 0.6B produced 106 pauses longer than 1.5 seconds (including silences of 27 and 23 seconds) versus only 2 pauses for the 1.7B on identical text (Gary Stafford, Medium, 2026).
Kyutai Pocket TTS removes the GPU requirement entirely. At 100M parameters, it runs 6x real-time on two CPU cores of a MacBook Air M4 (Kyutai technical report). English only. Distilled from 24 layers to 6, trained on 88,000 hours.
Kokoro at 82M parameters scores MOS 4.5 on CodeSOTA, the highest quality-to-size ratio in the open-source field. English-only, no voice cloning, but for pure naturalness on minimal hardware, nothing else comes close at this scale.
What breaks when you self-host TTS in production?
The README won’t tell you this. Here’s what GitHub issues and community reports reveal about Qwen3-TTS specifically.
Streaming wasn’t actually streaming at launch. The initial Python tooling returned fully-generated audio despite the paper’s streaming claims. True chunk-by-chunk output required community workarounds: buffering the first ~38 tokens before sending audio to avoid voice clone drift (GitHub issues #10, #77, HuggingFace discussions #3, #4). Patches exist now, but this pattern repeats across open-source TTS releases.
Speaking rate drifts on long texts. For inputs exceeding 100 characters, speaking rate gradually accelerates toward the end (GitHub issue #239). The fix is chunking long texts at sentence boundaries before synthesis.
No native voice agent framework integration. There is no official LiveKit Agents plugin or Pipecat integration for Qwen3-TTS as of March 2026. Community repos exist (faster-qwen3-tts, Qwen3-TTS-streaming), but you’re wiring this yourself. Compare this to ElevenLabs or Cartesia, where LiveKit and Pipecat have first-party plugins with documented latency guarantees.
FlashAttention 2 is required for reasonable VRAM. Without it, the 1.7B model needs 14-16 GB in float32. With it, 5-6 GB. FA2 requires Ampere or newer GPUs (RTX 30/40 series, A-series, H-series). RTX 20-series hardware won’t work.
vLLM online serving is still in progress. The vLLM-Omni project provides batch inference optimization, but real-time online serving support is not complete. Budget for serving infrastructure engineering beyond model benchmarking.
For security considerations when self-hosting voice inference (data retention, endpoint protection, voice data privacy), see security for voicebots.
When should you self-host versus stay on the API?
Self-host when:
- Volume exceeds ~29,000 utterances per month (the ElevenLabs Flash break-even on dedicated GPU) — or any volume on burst GPU pricing
- You need voice cloning without per-clone API charges
- Sub-100ms TTFA with full control over the serving stack matters
- Audio data must stay on your infrastructure (healthcare, finance, legal)
- You want to fine-tune or customize the model’s behavior
Stay on the API when:
- Volume is under 5,000 utterances per month
- Maximum naturalness matters more than cost (premium consumer products)
- You need turnkey LiveKit or Pipecat integration today
- Your team doesn’t have GPU infrastructure or MLOps capacity
The hybrid play: Run Qwen3-TTS for high-volume, latency-sensitive paths (real-time voice agent responses) and route quality-sensitive paths (marketing content, onboarding flows) through ElevenLabs or Cartesia. At high volume, the self-hosting savings fund both.
Takeaways
- The break-even for self-hosted TTS against ElevenLabs Flash is ~29,000 utterances/month on a dedicated L40S. For burst workloads using serverless GPU ($24 for 100K utterances), self-hosting wins at any volume. Against budget APIs like OpenAI TTS-1, the dedicated GPU is more expensive — the case there is features, not price.
- Use the 1.7B model, not the 0.6B. The stability difference in production is not subtle.
- Quality gap is real but narrowing: MOS 4.5-4.7 open-source versus ~4.7 commercial. For voice agents delivering information, this gap rarely matters. For consumer audio products, it does.
- Budget engineering time for integration. No first-party voice agent framework plugins exist yet. Streaming required community patches.
- The hybrid approach (self-host high-volume, API for quality-sensitive) captures the best of both economics.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch