18 minute read

“The next breakthrough in AI reasoning won’t be models that think harder. It will be models that stop thinking in English.”

TL;DR

Every thinking model you use today – o1, o3, DeepSeek-R1, Claude’s extended thinking – reasons by writing out its thoughts in human-readable tokens. That’s expensive, sequential, and surprisingly wasteful. Latent-space reasoning flips this: models think in continuous vector space, exploring multiple paths at once, using 70%+ fewer tokens on some benchmarks. Meta’s Coconut framework scored 97% on planning tasks where chain-of-thought managed 77.5%. The catch is you can’t read what the model is thinking anymore. No frontier lab has shipped this in production yet, but the research from 2025 is moving fast enough that the next generation of reasoning models will probably think in ways we literally cannot read. For context on how current LLM reasoning capabilities work, see LLM Capabilities for Agents.


A glass prism on an optical bench with a white light beam entering one face and a single transformed colored beam exiting the opposite face,  the interior of the prism completely dark and opaque revealing nothing about the internal transformation process

How did we get from “predict the next word” to “think before you speak”?

Five years of compounding breakthroughs brought us here, and each one looked like a dead end until the next one arrived.

Stage 1: Raw pretraining (GPT-3, 2020). Train a model on the internet, get a pattern completer. Ask it a question and it autocompletes a statistically plausible answer. No reasoning. No following instructions. Just vibes and probability distributions. If you prompted “Do not output the capital of France,” it might still say “Paris” because that’s what the token statistics demanded.

Stage 2: RLHF (InstructGPT/ChatGPT, 2022). Reinforcement Learning from Human Feedback taught models to prioritize instructions over raw probability. The model learned to be helpful, harmless, and honest (or at least to approximate those things). This was the breakthrough that made LLMs useful for actual work, but it didn’t make them think. They got better at saying the right thing without understanding why it was right.

Stage 3: RL on reasoning (DeepSeek-R1, January 2025). DeepSeek demonstrated something genuinely surprising. Using Group Relative Policy Optimization (GRPO), they trained a model with pure RL where the only reward signal was whether the final answer was correct. No human-labeled reasoning trajectories. No curated chain-of-thought examples. And the model spontaneously developed verification, reflection, and exploration of alternative approaches (DeepSeek, arXiv 2501.12948). The reasoning emerged from the optimization pressure alone. There was even what the team called an “aha moment” where the model began re-evaluating its initial approach mid-reasoning.

Stage 4: Visible thinking (o1/o3, DeepSeek-R1, Claude extended thinking, 2024-2025). This is where we are now. Models generate explicit “thinking tokens” before answering. You can read them (sometimes). The model reasons in English, step by step, and the quality of reasoning scales with the number of thinking tokens. OpenAI’s o1 uses reinforcement learning to decide when and how deeply to think. The results are impressive – these models solve problems that stumped every previous generation.

Stage 5: Latent reasoning (2024-present). What if the model could skip the English and think directly in the space where computation actually happens?

That’s the frontier.


What’s wrong with thinking out loud?

I want to be clear: chain-of-thought reasoning works. It’s the biggest practical advance in LLM capability since RLHF. But it has structural problems that no amount of scaling will fix.

The token tax is brutal. OpenAI’s o1 charges $15 per million input tokens and $60 per million output tokens. A single complex query can generate 10,000+ reasoning tokens just for the “thinking” phase. That’s $0.60 of output tokens before the model even starts writing its answer. For applications running millions of queries, the reasoning tax dwarfs the actual response cost.

Most reasoning tokens carry no reasoning. A natural language token carries roughly 15 bits of information. But in a chain-of-thought trace, the vast majority of tokens serve grammatical structure, not logical content. When a model writes “Let me reconsider the second condition in the problem, which states that…” the actual reasoning content is maybe three tokens. The rest is linguistic scaffolding. The reasoning signal is sparse in token space.

Chain-of-thought is stuck in single-file traffic. When you reason in language, you commit to one path at a time. Write a sentence, see where it goes, backtrack if wrong. This is depth-first search – you pick a branch and follow it until it fails. If the first branch was wrong, you’ve burned tokens for nothing and have to start over. For problems with many possible approaches, this sequential exploration is deeply inefficient. See Multi-Step Reasoning for the full picture on how CoT, Tree of Thoughts, and self-consistency address this, but they’re all still constrained by token-level serialization.

The context window is a shared resource. Every thinking token occupies space in the context window. Long reasoning chains compete with the actual problem context, retrieved documents, and conversation history. A model that thinks for 10,000 tokens has 10,000 fewer tokens available for everything else.

None of these problems are bugs. They’re architectural constraints of reasoning in token space.


What if models could think without words?

In December 2024, a team at Meta FAIR published a paper that reframed the entire question. They called it Coconut, Chain of Continuous Thought (Hao et al., arXiv 2412.06769).

The idea is deceptively simple. In a standard transformer, the model processes input, generates a hidden state, decodes that hidden state into a token, and feeds that token back as input for the next step. Coconut removes the middle step. Instead of decoding to a token and re-encoding, it takes the last hidden state and feeds it directly back as the next input embedding. The model reasons in continuous vector space, and intermediate thoughts never become words.

Why does this matter? Because of what happens in that continuous space.

A single thought vector can hold multiple ideas at once. This is the insight that makes latent reasoning fundamentally different from chain-of-thought. In language, you pick one idea per sentence. In continuous space, a vector can encode a superposition of multiple reasoning paths simultaneously. Where CoT commits to “let me try approach A” and must backtrack if A fails, a latent thought vector can explore approaches A, B, and C in parallel.

This isn’t metaphorical. The NeurIPS 2025 paper “Reasoning by Superposition” (Zhu et al.) proved it formally: a two-layer transformer with D continuous thought steps can solve directed graph reachability where D equals the graph diameter. Discrete chain-of-thought needs O(n^2) decoding steps for the same problem. The superposition mechanism (encoding multiple search frontiers simultaneously) emerges automatically during training without anyone designing it.

The benchmark numbers back this up. On ProsQA (a graph-based planning benchmark), Coconut scored 97.0% where chain-of-thought managed 77.5%. And it did it in 14.2 latent steps versus 49.4 tokens. That’s 71% fewer steps with a 19.5 percentage point accuracy advantage (Coconut paper, 2024).

But here’s what bugs me. On GSM8k (grade school math), Coconut scored 34.1% versus CoT’s 42.9%. Latent reasoning lost, and it lost on a benchmark that isn’t even that hard. The explanation is elegant but uncomfortable: continuous state representations accumulate noise through sequential operations. Each latent step introduces small floating-point perturbations. For tasks requiring high-fidelity symbolic computation – step-by-step arithmetic where carrying the 1 matters – this noise compounds. The very thing that makes latent reasoning powerful (continuous superposition) makes it unreliable at precise calculation.

So we have a tool that’s spectacular at exploration and planning but stumbles at arithmetic. That’s a weird combination, and I’m not entirely sure what to make of it yet.


Who else is building latent reasoners?

The Coconut paper landed in December 2024. By mid-2025, an entire research subfield had formed. Here’s what came next.

Huginn-3.5B (February 2025) took a different architectural approach. Instead of feeding hidden states back through the sequence, Huginn uses depth recurrence, a looped processing unit where the model iterates over its hidden state multiple times before producing output. An 8-layer physical model can behave like a virtual 132-layer model when unrolled 132 times. At 3.5B parameters, it outperformed Pythia-6.9B and Pythia-12B on ARC and GSM8k reasoning benchmarks (MarkTechPost, February 2025). The performance scales with recurrence depth: going from 4 to 32+ iterations consistently improved results.

CoLaR (NeurIPS 2025), Compressed Latent Reasoning, attacked the efficiency angle directly. Using a two-stage approach (supervised fine-tuning followed by reinforcement learning), CoLaR achieved a 53.3% reduction in reasoning chain length with only 4.8% performance loss versus explicit CoT. On the MATH dataset specifically, RL boosted performance by 5.36% while reducing reasoning chains by 82.8%. That’s not a typo. Eighty-two percent shorter thinking for better results.

Token Assorted (February 2025) proposed the pragmatic middle ground: mix latent and text tokens in the same reasoning chain. Use latent tokens for exploration and planning, surface to text tokens for precision-critical steps. This hybrid approach achieved a 17% reduction in reasoning trace length while improving performance over pure CoT. I suspect this is closer to what actually ships in production first.

Latent Thinking Optimization (September 2025) discovered something that partially addresses the explainability concern. The researchers found that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns. A classifier trained on latent states can reliably predict whether the final answer will be correct. The reward signal is secretly encoded in the latent representations (Du, Dong, Ning; arXiv 2509.26314). You can’t read the thinking, but you can tell if it’s going well.

Here’s a summary of where things stand:

Approach Date Key Result Tradeoff
Coconut (Meta FAIR) Dec 2024 97% on ProsQA vs 77.5% CoT, 71% fewer tokens Weak on arithmetic (GSM8k 34% vs 43%)
Huginn-3.5B Feb 2025 3.5B params outperforms 12B on reasoning Modest gains vs explicit CoT on most benchmarks
CoLaR NeurIPS 2025 82.8% shorter reasoning, 5.36% better on MATH Two-stage training complexity
Token Assorted Feb 2025 17% shorter traces, better accuracy Hybrid approach, not pure latent
LTO Sep 2025 Latent states predict correctness Computational auditing, not human-readable

The Latent-SFT framework (2025-2026) unified several of these approaches, achieving 2.7x to 5.5x reduction in reasoning chain length while matching explicit supervised fine-tuning on GSM8k. System-1.5 Reasoning delivered over 20x inference speedups by dynamically allocating computation only to steps that need deep reasoning.


The explainability problem everyone’s tiptoeing around (and the partial fix)

Let me state the uncomfortable thing directly: latent reasoning means you cannot read what the model is thinking.

With chain-of-thought, the thinking is right there in the output. When Claude or o1 reasons through a problem, you can trace the logic, spot the errors, understand where it went wrong. This isn’t just convenient. It’s a safety property. OpenAI partially hides o1’s reasoning tokens, but they exist as text and could in principle be audited. The transparency isn’t perfect, but it’s real.

Latent reasoning produces opaque vector states. The model processes information, updates its hidden representations, and arrives at an answer. The intermediate states are 2560-dimensional floating-point vectors. There is no log, no trace, no reasoning chain to inspect.

The safety community noticed. The LessWrong community raised concerns about deceptive alignment: a model that reasons in interpretable language at least could reveal problematic reasoning patterns. A model reasoning in latent space provides no such window. You can’t catch a model plotting something if its plots are high-dimensional vectors (LessWrong, “Worries about latent reasoning in LLMs,” 2025). The 2026 International AI Safety Report, backed by 30+ countries and 100+ AI experts, warned that reliable safety testing has become harder as models learn to distinguish between test environments and real deployment. Latent reasoning compounds this problem.

But there are partial fixes. The LTO work from September 2025 showed that latent representations aren’t as opaque as they seem. Correct and incorrect reasoning produce measurably different latent patterns. A trained classifier can audit latent thoughts computationally, even without human-readable output. It’s not the same as reading the reasoning, but it’s not flying completely blind either.

Mechanistic interpretability, recognized as one of MIT Technology Review’s “10 Breakthrough Technologies 2026,” offers another path. Instead of reading output text, researchers map computational pathways through the network’s weights and activations. If we can understand what circuits a model activates during latent reasoning, we might achieve a form of interpretability that’s actually deeper than reading chain-of-thought text – which, after all, is just the model’s self-report of its own reasoning, and self-reports aren’t always accurate.

I keep coming back to an analogy. I can’t read my own neural firing patterns either. The words I use to describe my thinking are a lossy compression of whatever my neurons actually did. Chain-of-thought reasoning is the model’s lossy self-report. Latent reasoning is the raw computation. Both are interpretable with the right tools. Just different tools.


Will the next frontier model think in latent space?

As of March 2026, no frontier lab has shipped a production model that uses latent reasoning as its primary thinking mechanism. Every deployed thinking model (o1, o3, DeepSeek-R1, Claude’s extended thinking, Gemini’s thinking mode) reasons in human-readable tokens. Latent reasoning is a research direction, not a deployed capability.

But the research velocity is hard to ignore. In fifteen months, the field went from a proof-of-concept paper to unified training frameworks matching CoT performance with 5x fewer reasoning steps. That’s fast, even by AI research standards.

The hybrid approach ships first. Token Assorted’s mix of latent and text tokens is the pragmatic path. Use latent reasoning for the exploration phase – broad search across approaches, planning, constraint satisfaction. Surface to text tokens for the precision phase – arithmetic, code generation, structured output. This gives you the efficiency of latent reasoning where it excels and the reliability of CoT where latent reasoning struggles. I’d be surprised if some version of this isn’t in a frontier model within the next twelve months.

The economics are too compelling to ignore. If latent reasoning achieves general parity with token-based reasoning, the entire business model of charging per-output-token for “thinking” breaks down. A model that reasons in 14 latent steps instead of 50 tokens doesn’t just save compute – it fundamentally changes the cost structure. For companies running reasoning-heavy workloads (coding agents, research assistants, complex analysis), the savings could be an order of magnitude. This creates strong incentive for every major lab to invest in the approach, regardless of the interpretability concerns. For more on how training approaches like RL and fine-tuning shape model behavior, see Fine-Tuning for Agent Tasks.

The interpretability gap will slow deployment but not stop it. Safety-conscious labs will hesitate, rightly, to ship opaque reasoning. But the LTO and mechanistic interpretability results suggest the gap is narrower than it looks. And there’s a pragmatic argument: if a latent reasoner consistently produces better, more reliable answers than a token-based reasoner on the same tasks, the interpretability concern becomes a engineering problem to solve, not a blocker.

My bet. The next major reasoning breakthrough won’t be announced as “latent reasoning.” It will be announced as “2x faster inference” or “90% reduction in reasoning cost” or “new reasoning architecture.” The latent part will be an implementation detail that most users never think about, the same way most people don’t think about whether their CPU uses branch prediction. The thinking moves underground, and we judge the model by its outputs.

We’ve been here before. The transition from hand-crafted features to learned representations in computer vision felt like a loss of interpretability too. Researchers who tuned SIFT descriptors by hand were uncomfortable with neural networks that learned their own features in opaque latent spaces. Twenty years later, nobody wants to go back to hand-crafted features. The learned representations were just better.

I suspect the same thing happens with reasoning. The models that think in ways we can’t read will outperform the models that think in ways we can. And we’ll adapt.


FAQ

What tasks is latent reasoning better at than chain-of-thought?

Latent reasoning excels at tasks requiring exploration of multiple paths: planning, constraint satisfaction, graph traversal, and strategy selection. Coconut’s 97% vs 77.5% advantage on ProsQA (graph-based planning) is the clearest example. It struggles with tasks requiring precise sequential symbolic computation like multi-step arithmetic, where continuous-space noise accumulation hurts accuracy.

How much cheaper could latent reasoning make inference?

The upper bound is significant. Coconut used 71% fewer steps than CoT on planning tasks. CoLaR achieved 82.8% shorter reasoning chains on MATH. If you’re paying $60/1M output tokens for o1-level reasoning, eliminating 70-80% of reasoning tokens changes the economics fundamentally. The actual savings depend on how much of a model’s reasoning can move to latent space versus how much needs to stay in token space for precision.

Can you combine latent reasoning with chain-of-thought?

Yes, and this is probably what ships first. Token Assorted (February 2025) mixes latent and text tokens in the same reasoning chain, achieving 17% shorter traces with better accuracy than pure CoT. The practical architecture likely uses latent reasoning for exploration and planning phases, then surfaces to text tokens for precision-critical steps like arithmetic or code generation.

When will a production model use latent reasoning?

No frontier lab had shipped one as of March 2026. But the research trajectory is steep: fifteen months from proof-of-concept to frameworks matching CoT performance with 5x fewer steps. A hybrid approach (latent + text tokens) in a frontier model within the next twelve months seems likely. Pure latent reasoning in production probably takes longer, pending progress on interpretability and safety auditing.


Key takeaways

  • LLM reasoning evolved through five stages: pretraining, RLHF, RL-driven reasoning, visible chain-of-thought, and now latent-space reasoning
  • Chain-of-thought wastes tokens on linguistic scaffolding: the reasoning signal is sparse in language but dense in latent space
  • Coconut (Meta FAIR, 2024) proved the concept: 97% accuracy on planning tasks vs 77.5% for CoT, using 71% fewer steps
  • Latent reasoning enables breadth-first search by encoding multiple reasoning paths in a single vector, a capability that emerges automatically during training
  • The weakness is precise arithmetic: continuous space accumulates noise through sequential symbolic operations
  • The research field exploded in 2025 (Huginn, CoLaR, Token Assorted, LTO, Reasoning by Superposition), with frameworks now matching CoT performance at 2-5x fewer reasoning steps
  • The explainability tradeoff is real but partially addressable through latent classifiers and mechanistic interpretability
  • The hybrid approach (latent for exploration, tokens for precision) is the most likely path to production deployment
  • The economic incentive is strong: eliminating 70-80% of reasoning tokens fundamentally changes inference cost structure

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch