How much VRAM does Qwen3-TTS need?

The 1.7B parameter model needs 5-6 GB VRAM with bfloat16 and FlashAttention 2. The 0.6B model needs 2-3 GB. An RTX 4090 handles the 1.7B model at 5.4 GB, leaving room for concurrent requests. Consumer GPUs with 8+ GB VRAM work fine for single-stream inference.

What is the break-even point for self-hosting TTS versus API fees?

Against ElevenLabs Flash v2.5 on the Scale plan ($0.09 per additional minute beyond 4,000 included), self-hosting on a dedicated L40S GPU at $619 per month breaks even at roughly 29,000 utterances per month. Against OpenAI TTS-1 at $15 per million characters, the dedicated GPU only breaks even around 165,000 utterances — but self-hosting adds voice cloning, latency control, and data privacy that budget APIs lack.

Can Qwen3-TTS match ElevenLabs quality?

Not yet on raw naturalness. ElevenLabs Turbo v2.5 scores around MOS 4.7 versus open-source models in the 4.1-4.5 range. But Qwen3-TTS achieves 1.24% word error rate in English and 0.789 speaker similarity for voice cloning, which is production-viable for voice agents and information delivery.

Does Qwen3-TTS support real-time streaming?

Yes, with caveats. The paper reports 101ms time-to-first-audio for the 1.7B model, but the initial Python tooling returned fully generated audio rather than streamed chunks. Community patches now enable true streaming by buffering the first 38 tokens. Budget for integration work beyond what the README suggests.

Which open-source TTS model should I use for edge deployment?

For GPU edge devices with 3+ GB VRAM, Kani-TTS-2 at 400M parameters runs at 5x real-time. For CPU-only deployment, Kyutai Pocket TTS at 100M parameters runs 6x real-time on two CPU cores. For maximum quality with a GPU, Qwen3-TTS 1.7B is the strongest multilingual option.

When self-hosted TTS beats the API: the 2026 cost cliff

11 minute read

“Every month, your TTS vendor sends an invoice measured in characters. The same characters you could process on a $619 GPU.”

TL;DR

At 100K utterances per month, self-hosting Qwen3-TTS on burst GPU pricing costs $24 — versus $2,220 for ElevenLabs Flash or $375 for OpenAI TTS-1. A dedicated L40S at $619 per month breaks even against ElevenLabs Flash at roughly 29,000 utterances. Quality has narrowed but not closed: commercial APIs still lead on naturalness (MOS ~4.7 versus ~4.5 for the best open-source), while self-hosted models win on voice cloning, latency control, and data privacy. This post runs the numbers, compares four open-source TTS models against five commercial APIs, and gives you a decision matrix for production. For the voice agent architectures these TTS models plug into, see voice agent frameworks.

A single compact GPU server on a desk next to a stack of cloud service invoices being weighed on a balance scale

Why are voice teams reconsidering TTS APIs in 2026?

Because the price gap between commercial TTS APIs and self-hosted models becomes striking at scale. ElevenLabs Multilingual v2 on the Scale plan ($330/month) includes 2,000 minutes of audio, with overage at $0.18 per minute. At 100K utterances (25,000 minutes), that adds up to $4,470 per month. A single L40S GPU running Qwen3-TTS costs $0.86 per hour on RunPod — and at burst pricing, the same 100K utterances cost just $24.

The trigger is Qwen3-TTS (arXiv:2601.15621), released by Alibaba’s Qwen team in January 2026. It’s a 1.7B parameter model with 3-second voice cloning, 10-language support, and 101ms time-to-first-audio. Apache 2.0 licensed. It runs on an RTX 4090 at 5.4 GB VRAM with FlashAttention 2.

Qwen3-TTS isn’t alone. Kani-TTS-2 (400M params, 3GB VRAM), Kyutai Pocket TTS (100M params, CPU-only, 6x real-time on two CPU cores), and Kokoro (82M params, MOS 4.5 on CodeSOTA) all shipped in the same window. The open-source TTS space went from “interesting but not production-ready” to “legitimate alternative” in about three months.

The question: at what volume does self-hosting make financial sense, and what quality do you give up?

How does Qwen3-TTS work under the hood?

Most TTS systems chain a text encoder, an acoustic model, and a vocoder as separate stages. Qwen3-TTS collapses this into a single autoregressive language model. Text tokens and acoustic tokens are concatenated along the channel axis, predicting the next acoustic token directly from the text context with no intermediate mel-spectrogram step.

graph LR
    A[Input text] --> B[Text tokenizer]
    B --> C[Unified LM\n1.7B params]
    G[3-sec reference\naudio] --> H[Speaker\nembedding]
    H --> C
    C --> D[Multi-token\nprediction]
    D --> E[Speech tokenizer\ndecoder · 12Hz]
    E --> F[Streamed audio]

The 12Hz tokenizer uses 16 codebooks at 12.5 Hz with a codebook size of 2,048. A Multi-Token Prediction module generates residual codebooks from the zeroth codebook, creating hierarchical acoustic detail without a separate diffusion step.

This architecture differs from CosyVoice 3 (also from Alibaba, but a different internal team). CosyVoice 3 uses flow matching with a diffusion transformer for higher peak quality at the expense of higher latency. Qwen3-TTS trades that peak quality for speed: 101ms first-packet for the 1.7B model versus CosyVoice 3’s substantially longer warm-up. On cross-lingual synthesis (Chinese to Korean), Qwen3-TTS achieves a 4.82% error rate versus CosyVoice 3’s 14.4% (arXiv:2601.15621).

For TTS fundamentals including mel-spectrograms, vocoders, and the traditional cascade architecture, see TTS system fundamentals.

What does self-hosting cost versus API fees at 100K utterances per month?

Here are the production numbers. I’m assuming 100,000 utterances per month at ~250 characters per utterance (a typical 15-second voice agent response — roughly 37 words at 150 words per minute). That’s 25 million characters and 25,000 minutes of audio per month.

Commercial API costs

Provider	Model	Pricing	Monthly (100K utterances)
ElevenLabs	Multilingual v2	$0.18/min over†	$4,470
ElevenLabs	Flash v2.5	$0.09/min over†	$2,220
Cartesia	Sonic (Scale plan)	$37/M chars	$925
MiniMax	Speech-02-Turbo	$30/M chars	$750
OpenAI	TTS-1-HD	$30/M chars	$750
Google Cloud	Neural2	$16/M chars	$400
Amazon Polly	Neural	$16/M chars	$400
OpenAI	TTS-1	$15/M chars	$375
Inworld AI	TTS-1-Max (8.8B)	$10/M chars	$250

†ElevenLabs prices by minute of generated audio, not by character. Scale plan ($330/month) includes 2,000 MV2 minutes or 4,000 Flash minutes; overage billed per minute. Business plan ($1,320/month) includes 11,000/22,000 minutes at $0.12/$0.06 per extra minute. Character-based providers calculated at 25M characters (100K utterances × ~250 chars).

Sources: ElevenLabs pricing page (March 2026), OpenAI API pricing, Google Cloud TTS pricing, AWS Polly pricing, MiniMax API pricing (Replicate), Cartesia pricing (eesel.ai), Inworld TTS comparison (inworld.ai).

Self-hosted costs

Setup	GPU	Monthly cost	Handles 100K?
L40S dedicated 24/7 (RunPod)	L40S 48GB	$619	Yes, massive headroom
A10G dedicated (AWS)	A10G 24GB	$734	Yes
L40S burst (RunPod)	L40S	~$24	Yes (28 GPU-hrs needed)
Own RTX 4090	RTX 4090 24GB	~$30 electricity	Yes, 15-20 concurrent streams

The burst pricing is the real story. Each utterance takes roughly 1 second of GPU time. At 100K utterances, that’s ~28 GPU-hours. At $0.86/hour on an L40S, the compute cost is $24 per month.

Break-even

Against ElevenLabs Flash v2.5 on the Scale plan ($330 base plus $0.09 per additional minute, with 4,000 minutes included), a dedicated L40S ($619/month) breaks even at roughly 29,000 utterances per month. Below that, the API is cheaper. Above it, the GPU’s fixed cost amortizes quickly — at 100K utterances, self-hosting costs $619 versus $2,220.

Against budget APIs like OpenAI TTS-1 ($15/M chars), a dedicated GPU only breaks even around 165,000 utterances per month on pure cost. At 100K utterances, OpenAI TTS-1 costs $375 — less than the $619 L40S. The self-hosting case against budget APIs is feature-driven: voice cloning, sub-100ms latency control, and keeping audio data on your own infrastructure. Or use burst pricing ($24 for 100K utterances) and beat everything on cost.

How does quality compare to ElevenLabs and Cartesia?

Commercial APIs still win on raw naturalness. The CodeSOTA speech leaderboard (March 2026) puts the field like this:

Model	MOS (1-5)	Type
ElevenLabs Turbo v2.5	4.7	Commercial
Sesame CSM	4.7	Open-source
OpenAI TTS HD	4.7	Commercial
Cartesia Sonic 2	4.7	Commercial
ElevenLabs Flash v2.5	4.6	Commercial
Kokoro v1.0	4.5	Open-source
XTTS v2	4.5	Open-source
Fish Speech 1.5	4.4	Open-source
Dia 1.6B	4.3	Open-source

Source: CodeSOTA Speech Leaderboard, March 2026 (editorial aggregation by a single maintainer; methodology not published). Third-party benchmarks report ElevenLabs Turbo v2.5 at MOS 4.72 (WaveSpeedAI). Qwen3-TTS is not yet on this leaderboard. Treat these scores as directional rather than rigorous head-to-head evaluations.

What Qwen3-TTS does report (self-benchmarked, arXiv:2601.15621):

English word error rate: 1.24% (state-of-the-art on Seed-TTS benchmark)
Chinese word error rate: 0.77%
Speaker similarity for voice cloning: 0.789 (WavLM-based)
Best WER in 6 of 10 tested languages versus MiniMax-Speech and ElevenLabs

The Inworld AI team ran blind human preference tests (400 votes, 20 annotators) for their TTS-1-Max (8.8B): 59.1% win rate versus ElevenLabs, 60.9% versus Cartesia Sonic 2, 60.7% versus OpenAI TTS-1-HD (arXiv:2507.21138). These are self-conducted evaluations, so treat them as directional.

My honest assessment: for voice agents where the goal is accurate information delivery, open-source quality is production-ready today. For consumer products where voice is the brand experience (audiobooks, meditation apps, premium assistants), commercial APIs still justify their premium.

Which open-source TTS model fits your deployment?

Four models cover four distinct hardware targets:

Model	Params	VRAM	Runs on	TTFA	Languages	Voice clone	Best for
Qwen3-TTS 1.7B	1.7B	5-6 GB	GPU (Ampere+)	101ms	10	Yes (3 sec)	Max quality self-hosted
Kani-TTS-2	400M	3 GB	GPU	RTF ~0.2	TBD	Yes	Edge GPU
Kyutai Pocket	100M	0	CPU only	~200ms	English	No	No-GPU environments
Kokoro	82M	Minimal	CPU/GPU	Fast	English	No	Best MOS per param

graph TD
    A{Need voice cloning?} -->|Yes| B{GPU available?}
    A -->|No| C{Need multilingual?}
    B -->|6+ GB VRAM| D[Qwen3-TTS 1.7B]
    B -->|3 GB VRAM| E[Kani-TTS-2]
    B -->|No GPU| F[Stay on API]
    C -->|Yes| D
    C -->|No, English only| G{GPU available?}
    G -->|Yes| H[Kokoro · 82M · MOS 4.5]
    G -->|No| I[Kyutai Pocket · CPU]

Qwen3-TTS 1.7B is the clear pick for multilingual production workloads with voice cloning. Avoid the 0.6B variant. In long-form generation tests, the 0.6B produced 106 pauses longer than 1.5 seconds (including silences of 27 and 23 seconds) versus only 2 pauses for the 1.7B on identical text (Gary Stafford, Medium, 2026).

Kyutai Pocket TTS removes the GPU requirement entirely. At 100M parameters, it runs 6x real-time on two CPU cores of a MacBook Air M4 (Kyutai technical report). English only. Distilled from 24 layers to 6, trained on 88,000 hours.

Kokoro at 82M parameters scores MOS 4.5 on CodeSOTA, the highest quality-to-size ratio in the open-source field. English-only, no voice cloning, but for pure naturalness on minimal hardware, nothing else comes close at this scale.

What breaks when you self-host TTS in production?

The README won’t tell you this. Here’s what GitHub issues and community reports reveal about Qwen3-TTS specifically.

Streaming wasn’t actually streaming at launch. The initial Python tooling returned fully-generated audio despite the paper’s streaming claims. True chunk-by-chunk output required community workarounds: buffering the first ~38 tokens before sending audio to avoid voice clone drift (GitHub issues #10, #77, HuggingFace discussions #3, #4). Patches exist now, but this pattern repeats across open-source TTS releases.

Speaking rate drifts on long texts. For inputs exceeding 100 characters, speaking rate gradually accelerates toward the end (GitHub issue #239). The fix is chunking long texts at sentence boundaries before synthesis.

No native voice agent framework integration. There is no official LiveKit Agents plugin or Pipecat integration for Qwen3-TTS as of March 2026. Community repos exist (faster-qwen3-tts, Qwen3-TTS-streaming), but you’re wiring this yourself. Compare this to ElevenLabs or Cartesia, where LiveKit and Pipecat have first-party plugins with documented latency guarantees.

FlashAttention 2 is required for reasonable VRAM. Without it, the 1.7B model needs 14-16 GB in float32. With it, 5-6 GB. FA2 requires Ampere or newer GPUs (RTX 30/40 series, A-series, H-series). RTX 20-series hardware won’t work.

vLLM online serving is still in progress. The vLLM-Omni project provides batch inference optimization, but real-time online serving support is not complete. Budget for serving infrastructure engineering beyond model benchmarking.

For security considerations when self-hosting voice inference (data retention, endpoint protection, voice data privacy), see security for voicebots.

When should you self-host versus stay on the API?

Self-host when:

Volume exceeds ~29,000 utterances per month (the ElevenLabs Flash break-even on dedicated GPU) — or any volume on burst GPU pricing
You need voice cloning without per-clone API charges
Sub-100ms TTFA with full control over the serving stack matters
Audio data must stay on your infrastructure (healthcare, finance, legal)
You want to fine-tune or customize the model’s behavior

Stay on the API when:

Volume is under 5,000 utterances per month
Maximum naturalness matters more than cost (premium consumer products)
You need turnkey LiveKit or Pipecat integration today
Your team doesn’t have GPU infrastructure or MLOps capacity

The hybrid play: Run Qwen3-TTS for high-volume, latency-sensitive paths (real-time voice agent responses) and route quality-sensitive paths (marketing content, onboarding flows) through ElevenLabs or Cartesia. At high volume, the self-hosting savings fund both.

Takeaways

The break-even for self-hosted TTS against ElevenLabs Flash is ~29,000 utterances/month on a dedicated L40S. For burst workloads using serverless GPU ($24 for 100K utterances), self-hosting wins at any volume. Against budget APIs like OpenAI TTS-1, the dedicated GPU is more expensive — the case there is features, not price.
Use the 1.7B model, not the 0.6B. The stability difference in production is not subtle.
Quality gap is real but narrowing: MOS 4.5-4.7 open-source versus ~4.7 commercial. For voice agents delivering information, this gap rarely matters. For consumer audio products, it does.
Budget engineering time for integration. No first-party voice agent framework plugins exist yet. Streaming required community patches.
The hybrid approach (self-host high-volume, API for quality-sensitive) captures the best of both economics.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

When self-hosted TTS beats the API: the 2026 cost cliff

TL;DR

Why are voice teams reconsidering TTS APIs in 2026?

How does Qwen3-TTS work under the hood?

What does self-hosting cost versus API fees at 100K utterances per month?

Commercial API costs

Self-hosted costs

Break-even

How does quality compare to ElevenLabs and Cartesia?

Which open-source TTS model fits your deployment?

What breaks when you self-host TTS in production?

When should you self-host versus stay on the API?

Takeaways

Related across topics

Share on

TL;DR

Why are voice teams reconsidering TTS APIs in 2026?

How does Qwen3-TTS work under the hood?

What does self-hosting cost versus API fees at 100K utterances per month?

Commercial API costs

Self-hosted costs

Break-even

How does quality compare to ElevenLabs and Cartesia?

Which open-source TTS model fits your deployment?

What breaks when you self-host TTS in production?

When should you self-host versus stay on the API?

Takeaways

Related across topics

Voice Agent Frameworks: LiveKit & Pipecat

Share on