13 minute read

TL;DR

Full-Duplex-Bench-v3 (arXiv:2604.04847) tests six voice LLMs on real human audio annotated for fillers, pauses, hesitations, false starts, and self-corrections. Every configuration (GPT-Realtime, Gemini Live 2.5/3.1, Grok, Ultravox, a cascaded pipeline) collapses on self-correction and hard multi-step reasoning. That pattern is architectural, not an ASR bug.

Oscilloscope traces showing a clean speech waveform next to one interrupted by a self-correction, representing disfluency failure in voice agents

Why teams keep misdiagnosing disfluency

Most voice agent teams blame ASR when users restart sentences and the agent goes off the rails. The fix, they assume, is a better transcription model or a stricter VAD. That’s the wrong diagnosis.

The cascaded Whisper + GPT-4o + TTS baseline in Full-Duplex-Bench-v3 has access to mature, well-funded ASR. On the self-correction category, it scores 17.6%, the worst of all six systems tested. End-to-end voice LLMs score 29 to 59%. The correlation between ASR quality and self-correction survival is not just weak, it inverts.

I have seen three production teams go through the same cycle. They ship a demo that sounds great in the office, hit 30 to 40% failure rates in real calls, and spend a quarter tuning Whisper. The ASR benchmark gap explains why WER is not the bottleneck; this post explains what is. Before Full-Duplex-Bench-v3, the vocabulary for this failure mode didn’t exist. Now it does.

What the five disfluency categories actually test

Full-Duplex-Bench-v3 annotates every query with one of five speech phenomena, each targeting a different layer of the voice agent stack.

Category What users do What it stresses
Filler “um”, “uh” inserted mid-sentence ASR normalization, backchannel handling
Pause Silence between words End-of-turn detection, VAD
Hesitation Filler + word repetition (“I, I want to…”) Parsing under repetition
False start Abandon initial intent, start new one Context discard, intent switching
Self-correction Update a parameter mid-sentence State rollback on in-flight tool calls

The categories are ordered roughly by how deep in the pipeline the failure lands. Fillers and pauses are surface-level audio features. False starts and self-corrections are reasoning and state problems.

The bench’s dataset consists entirely of real human recordings in uncontrolled environments, not synthesized disfluency, not scripted readings. Every scenario pairs a disfluent query with a chained tool-use task across Travel & Identity, Finance & Billing, Housing & Location, and E-Commerce Support. The evaluation measures Pass@1 on task completion, first-word latency, tool-call latency, and turn-taking rate in one pass.

Which six voice LLMs got graded

Six configurations, three of them proprietary, one open source, one cascaded for comparison:

  • GPT-Realtime (OpenAI)
  • Gemini Live 2.5 (Google)
  • Gemini Live 3.1 (Google)
  • Grok (xAI)
  • Ultravox v0.7 (Fixie)
  • Cascaded baseline: Whisper → GPT-4o → TTS

The overall Pass@1 leaderboard:

Model Pass@1
GPT-Realtime 0.600
Gemini Live 3.1 0.540
Gemini Live 2.5 0.490
Cascaded 0.450
Grok 0.430
Ultravox v0.7 0.410

GPT-Realtime leads. No system clears 60% on tool-use scenarios built from real human speech. For a capability most vendors demo as solved, the ceiling is sobering.

Self-correction is the architectural cliff

The single most important finding in the paper is that self-correction fails everywhere, for different reasons, and the reasons reveal the architecture.

Per-category Pass@1:

Category GPT-RT Gem 2.5 Gem 3.1 Grok Ultravox Cascaded
Filler 0.621 0.621 0.586 0.483 0.414 0.448
Pause 0.556 0.444 0.500 0.333 0.333 0.444
Hesitation 0.700 0.600 0.600 0.500 0.500 0.600
False start 0.667 0.417 0.583 0.583 0.250 0.500
Self-correction 0.588 0.471 0.353 0.294 0.353 0.176

Every model scores worst on self-correction. The cascaded pipeline collapses to 17.6%, which is not a small margin. It’s a phase change. Whisper commits a transcript before the correction arrives. The downstream LLM sees “book flight to Boston” and fires the tool call. When the user says “actually, Chicago,” the rollback never happens, because there’s no mechanism for rollback in the architecture.

End-to-end models do better (29 to 59%) but still fail most hard scenarios. They can observe the correction in audio, but they don’t consistently revise an in-flight plan. The pattern is architectural: current voice agents treat the user’s utterance as append-only, and real speech is not append-only.

Contrast that with hesitation, where every system scores 50 to 70%. Word repetition is a parsing nuisance; everyone handles it reasonably. The gap between hesitation and self-correction for the cascaded baseline is 42 percentage points. That gap is the cost of missing state rollback.

The easy/medium/hard breakdown is the real story

Full-Duplex-Bench-v3 also grades scenarios by difficulty. The degradation curve is the clearest signal in the paper:

Tier GPT-RT Gem 2.5 Gem 3.1 Grok Ultravox Cascaded
Easy 0.750 0.667 0.694 0.583 0.556 0.639
Medium 0.588 0.500 0.588 0.471 0.382 0.441
Hard 0.433 0.267 0.300 0.200 0.267 0.233

Hard scenarios chain multiple tool calls with disfluent arguments. GPT-Realtime drops from 75% on easy to 43.3% on hard, a 32-point fall. Gemini Live 2.5 falls 40 points. The top model roughly halves its success rate when you ask it to reason over two steps while a user is still restructuring the request.

This is why demos mislead. Demos are easy-tier tasks: single tool, clean phrasing, maybe one filler. Production is medium-and-hard: multi-step, disfluent, with a correction thrown in. The benchmark shows this gap numerically for the first time.

Latency-versus-robustness is the real tradeoff

Full-duplex designs each pick a point on a tradeoff curve. The latency table makes the choices visible:

Model First word Tool call Task complete
Gemini Live 3.1 3.95s 2.21s 4.25s
Grok 5.97s 0.63s 6.65s
GPT-Realtime 6.36s 3.89s 6.89s
Gemini Live 2.5 7.03s 4.61s 7.26s
Ultravox 3.88s 6.01s 8.40s
Cascaded 8.78s 3.15s 10.12s

Gemini Live 3.1 wins on speed at 4.25 seconds task completion, but its turn-take rate is only 78%. It misses more than one in five user turns. GPT-Realtime keeps a 13.5% interruption rate with 96% turn-take, paying 2.64 seconds in latency for that robustness. Grok trades the other way: aggressive pre-emptive tool calling (41.6%) with lower interruption (25.5%) but a weaker Pass@1 of 0.43.

Gemini Live 3.1 also produces no speech in 22% of examples while executing correct tool calls in 86% of those silent cases, a reasoning-to-speech disconnect that shows up on no other model. That’s a warning to anyone treating “fastest” as “best.”

The voice agent latency budget post covers the general shape of this tradeoff. Full-Duplex-Bench-v3 gives you the data to decide where to sit on it.

What’s architectural versus fixable with prompting

Teams ask me this constantly: which of these failures can I paper over, and which require rebuilding the stack?

flowchart LR
    A["Filler / pause handling"] -->|prompting + VAD tuning| B["Fixable"]
    C["Hesitation parsing"] -->|system prompt| B
    D["False start / intent switch"] -->|explicit 'ignore earlier' prompt| B
    E["Self-correction mid-tool-call"] -->|needs rollback mechanism| F["Architectural"]
    G["Multi-step reasoning under interruption"] -->|needs interruptible planning| F

    style B fill:#e8f5e9
    style F fill:#ffebee

Fillers, pauses, hesitations, and most false starts sit on the fixable side. A system prompt that says “ignore filler words and treat repeated phrases as one” plus a slightly longer end-of-turn timeout gets you 80% of the way on those categories. The benchmark shows all systems scoring 40 to 70% on these without special handling, which means the headroom from prompt tuning is real.

Self-correction and hard multi-step reasoning are architectural. You cannot prompt a system into capabilities it doesn’t have. The cascaded pipeline cannot roll back a committed transcript. End-to-end voice LLMs don’t have an explicit planner you can interrupt. Until somebody ships a voice agent with a tool-call staging layer (a tool call is tentative until the user finishes speaking, and any correction rewinds the stage), self-correction will remain the visible failure mode.

One practical guardrail that actually works today: add an explicit confirm-before-commit step on any tool call with more than one mutable argument. “Booking flight to Chicago on Friday, confirm?” is a speech bump that buys you the rollback window. Users hate it slightly less than they hate the wrong flight.

A pre-deploy testing framework

The benchmark is not the only way to get this signal. You can run a lightweight version of it against your own agent before shipping.

1. Build 50 tool-use scenarios per domain. Each scenario should require at least one tool call. Half should require two or more chained calls. Use your actual production tools, not toy ones.

2. Record five disfluency variants per scenario. Clean, filler-heavy, pause-heavy, false-start, and self-correction mid-argument. Real human recordings, not synth. Ask three people to do the recording so accents vary.

3. Measure Pass@1 per disfluency category separately. Do not average across categories. The whole point of Full-Duplex-Bench-v3 is that self-correction is a different failure from fillers, and you need to see them separated.

4. Set a gate on self-correction. If self-correction scores more than 20 points below clean, don’t ship the mid-turn correction UX. Either route self-correction through an explicit “actually, let me change that” keyword, or add confirm-before-commit on every tool call.

5. Log the easy/medium/hard breakdown. If your hard-tier score is less than half your easy-tier score, the agent will fall off a cliff in production even if the overall number looks fine.

6. Track turn-take rate and interruption rate together. A turn-take rate above 95% with interruption rate below 15% is the benchmark leader’s profile. Below 85% turn-take means you’re missing users; above 25% interruption means you’re talking over them. Both are dealbreakers for different reasons.

This protocol takes roughly a week for a small team. It is the cheapest insurance you can buy before a full-duplex launch.

Key takeaways

  • Full-Duplex-Bench-v3 (arXiv:2604.04847, NTU + NVIDIA, April 2026) is the first benchmark to grade voice LLMs on real human disfluency paired with multi-step tool use across four domains.
  • Self-correction is the consistent failure mode. Every model scores worst on it: GPT-Realtime 58.8% down to the cascaded pipeline at 17.6%. That 41-point spread is not noise, it’s architecture.
  • Hard multi-step reasoning is the second consistent failure. Top models drop 32 to 40 points from easy to hard scenarios. Demos hide this because demos are easy-tier.
  • Better ASR does not help. The cascaded pipeline has the best ASR in the test and the worst self-correction score. The bottleneck is state rollback, not transcription.
  • Fillers, pauses, and hesitations are fixable with prompting and VAD tuning. Self-correction and hard reasoning are not; they need architectural changes.
  • Run the five-disfluency test before you ship. Fifty scenarios per domain, five variants each, Pass@1 per category. If self-correction scores 20 points below clean, add confirm-before-commit.

FAQ

What is Full-Duplex-Bench-v3? Full-Duplex-Bench-v3 (arXiv:2604.04847, NTU and NVIDIA, April 2026) evaluates six full-duplex voice LLM configurations on real human audio annotated for five disfluency categories across four tool-use domains. It scores Pass@1 task completion, latency, and turn-taking simultaneously. The best system (GPT-Realtime) completes 60% of tasks overall, dropping to 43% on hard scenarios and 58.8% on self-correction.

Which disfluency hurts voice agents most? Self-correction. On Full-Duplex-Bench-v3, every configuration scores worst on the self-correction category: GPT-Realtime 58.8%, Gemini Live 2.5 47.1%, Gemini Live 3.1 35.3%, Ultravox 35.3%, Grok 29.4%, and the cascaded Whisper+GPT-4o pipeline 17.6%. Mid-utterance intent changes force a state rollback no current voice agent reliably performs.

Is disfluency failure an ASR problem or an architecture problem? Architecture. The cascaded Whisper + GPT-4o + TTS baseline has the best ASR in the test and still scores 17.6% on self-correction — the lowest of all six systems. Whisper finalizes the pre-correction transcript before the correction arrives, and the downstream LLM has no mechanism to revise a committed tool call. End-to-end voice LLMs fail the same test for a different reason: they lack explicit state rollback.

How does Full-Duplex-Bench-v3 differ from τ-Voice? τ-Voice (arXiv:2603.13686) measures end-to-end task completion under realistic audio degradation across customer-service domains. Full-Duplex-Bench-v3 isolates disfluency as the stress test and binds tool use to each query. τ-Voice tells you how robust an agent is to noise and accents; Full-Duplex-Bench-v3 tells you how it handles natural speech pathologies. See the τ-Voice analysis for the complementary view.

What should practitioners test before deploying full-duplex? Run fifty tool-use scenarios per domain with five disfluency profiles: clean, filler-heavy, pause-heavy, false-start, and self-correction mid-argument. Measure Pass@1 per disfluency category separately. If self-correction scores more than 20 points below clean, disable mid-turn corrections in the UX or add an explicit confirm-before-commit step on tool calls. That is the single highest-leverage guardrail from the benchmark data.


Related posts:


Sources:

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch