Jailbreak detection moves inside the model: why output filters lost the arms race

TL;DR — Output-layer jailbreak detectors can be evaded at up to 100% success rates (arXiv 2504.11168). A new defense class analyzes internal model representations to catch attacks before generation completes — 78% block rate, 94% benign preservation (Kadali & Papalexakis, UC Riverside, arXiv:2602.11495). Prompt injection detection is already broken at the output layer. This is the response: detect the intent to produce harmful output, not the output itself.
Your jailbreak detector is checking the wrong thing
You deployed an output filter. It scans every response the model generates, looking for harmful content. It works — 98% of attacks caught in testing. You shipped it.
Then researchers ran character injection, emoji smuggling, and adversarial suffix optimization against six commercial detectors — Azure Prompt Shield, Meta Prompt Guard, ProtectAI, NeMo Guard, and Vijil — and achieved evasion rates up to 100% (Hackett et al., arXiv 2504.11168). Not one technique. Multiple, each hitting different detectors. The 2% your red team missed was the 2% nobody had optimized against yet.
The problem is not that your detector is bad. The problem is that output-layer detection is the wrong layer. Checking model outputs for harmful content is like a doctor asking “do you have symptoms?” when a blood test could detect the infection before symptoms appear. By the time harmful text reaches the output, the model has already reasoned about harmful content across dozens of layers. The adversary’s advantage at the output layer is permanent: they can always find a new way to express harmful intent.
A different approach exists. Instead of scanning what comes out, analyze what happens inside.
Why is the output-layer arms race structurally unwinnable?
Output filters are losing because attackers control the optimization target. Defenders publish a filter. Attackers observe its behavior — most commercial detectors are API-accessible — and optimize evasion using gradient-based methods. Defenders retrain. Attackers adapt. The loop never converges in the defender’s favor.
graph LR
A["Attacker crafts evasion"] --> B["Bypasses output filter"]
B --> C["Harmful output reaches user"]
C --> D["Defender updates filter"]
D --> E["Attacker adapts technique"]
E --> B
style A fill:#d32f2f,color:#fff
style D fill:#1976d2,color:#fff
The evidence is concrete. A systematization-of-knowledge study (arXiv 2506.10597) evaluated the full taxonomy of LLM guardrails across six categories — rule-based filters, embedding classifiers, LLM-based judges, perplexity detectors, canary tokens, and instruction hierarchy — and found each category has at least one class of evasion it cannot defend against by design. The study concluded: “Many defenses come at the cost of reduced helpfulness, increased false positives, and degraded performance on legitimate edge cases.”
Multi-turn attacks expose the deepest structural flaw. The Mastermind framework (Li et al., arXiv:2601.05445) achieves 95% jailbreak success on Qwen 2.5 72B and 94% on DeepSeek V3 using benign-looking individual turns that accumulate harmful intent across the conversation. Each turn passes output filters because each turn is individually harmless. a significant share of jailbreaks succeed only in multi-turn format. Output filters evaluate isolated generations. They cannot see conversational trajectory.
The arms race at the output layer rewards attacker sophistication. Every improvement in filtering produces a corresponding improvement in evasion. The defender’s investment depreciates. The attacker’s compounds.
What do jailbreaks look like inside the model?
They leave traces. Not in the output text, where attackers can disguise intent — but in the activation patterns of internal layers, where the model’s reasoning is encoded before text generation begins.
Kadali and Papalexakis (UC Riverside, arXiv:2602.11495) ran tensor decomposition on the hidden states of eight models — GPT-J-6B, LLaMA-3.1-8B, Mistral-7B, their instruction-tuned variants, an abliterated model, and the state-space model Mamba-2.8b. They constructed third-order tensors (prompts × sequence length × hidden dimensions) and applied CANDECOMP/PARAFAC decomposition to identify low-rank patterns that distinguish jailbreak prompts from benign ones.
The finding: jailbreak prompts produce detectably different activation patterns starting in the early layers and amplifying through middle and late layers. The separability is not subtle. It appears across all eight models tested, including models with different architectures (transformer vs. state-space). Jailbreak intent encodes differently than benign intent in the model’s internal representations, regardless of how the surface text is disguised.
This aligns with broader work in representation engineering (Zou et al., Center for AI Safety, arXiv:2310.01405), which demonstrated that models encode traits like honesty, power-seeking, and morality as identifiable directions in activation space. The insight: if safety-relevant concepts have consistent internal representations, attacks that try to subvert safety should produce detectable perturbations in those representations.
A parallel line of research (arXiv:2507.11878) shows that models represent harmfulness and refusal as separate directions — the model’s “belief” about whether a prompt is harmful is encoded differently from the surface refusal behavior. Further work (arXiv:2502.17420) reveals that refusal is not a single direction but a multi-dimensional concept cone spanning up to 5 dimensions, making it harder to ablate completely. This means an attacker can suppress surface refusal while the harmfulness signal remains intact in internal layers. An output filter sees no refusal, no harmful text, and passes it through. An internal representation monitor sees the harmfulness signal and flags it.
How does internal representation detection work in practice?
The Kadali approach has three components, all operating at inference time with no fine-tuning required.
Step 1: Extract hidden states. During inference, capture the activation tensors from target layers. The paper analyzed all transformer layers but found that a subset of middle layers provides the strongest signal. This is a read-only operation — no modification to the model’s computation.
Step 2: Tensor decomposition. Apply CP decomposition to the extracted tensors to produce low-rank factors that capture the dominant patterns in how prompts are internally encoded. The prompt-mode factors reveal whether the current input aligns with jailbreak patterns or benign patterns.
Step 3: Classify and act. A lightweight binary classifier on the decomposed factors decides: jailbreak or benign. If jailbreak, the system can block generation, redirect to a safe response, or flag for human review.
| Approach | Where it operates | Requires fine-tuning | Block rate | Benign preservation |
|---|---|---|---|---|
| Output filters (commercial) | After generation | No | Variable | Variable (false positives increase with strictness) |
| Kadali tensor decomposition | Internal layers, during inference | No | 78% | 94% |
| HiddenDetect (arXiv:2502.14744) | Hidden states, layers 22-24 | No | Comparable | Comparable |
| Circuit Breakers (Gray Swan AI) | Internal representations | Yes (training-time) | High (unquantified) | Moderate (may degrade capabilities) |
The practical advantage: this runs as a sidecar. You do not need to retrain your model, modify its weights, or change its serving infrastructure. You attach a monitor that reads internal states and makes a classification decision in parallel with generation. If the monitor says “jailbreak,” you stop generation before harmful tokens are produced.
HiddenDetect (arXiv:2502.14744) takes a related approach — constructing refusal vectors from refusal phrase embeddings and measuring similarity in layers 22-24. It is training-free and achieves comparable detection performance. The two approaches validate each other: jailbreaks leave consistent internal traces detectable by multiple methods.
What are the honest trade-offs?
The 78% block rate means 22% of jailbreaks still get through. That is a real limitation. Output filters, when functioning, can catch some of that remaining 22%. The two layers are complementary, not competing.
More concerning: obfuscated activation attacks (arXiv:2412.09565) can specifically target internal representation defenses. By optimizing adversarial inputs that maintain jailbreak effectiveness while producing activation patterns that resemble benign prompts, attackers can reduce probe recall from 100% to 0% while sustaining 90% jailbreak success.
This sounds like the same arms race restarting at a deeper layer. In one sense, it is. But there is an asymmetry that favors defenders at the representation layer.
Obfuscation at the output level is cheap. Character injection, encoding substitution, and synonym cycling cost the attacker nothing — the model still follows the harmful instruction with full capability. Obfuscation at the representation level is expensive. To produce activations that look benign while encoding harmful intent, the attacker must constrain the model’s internal computation. This constrains the model’s ability to reason about the harmful task. The obfuscation paper itself notes that their attack degrades model utility on complex tasks. The attacker pays a real cost — reduced capability — that does not exist at the output layer.
The 94% benign preservation rate is the other side of the equation. False positives at 6% mean roughly one in seventeen legitimate requests gets flagged. For high-stakes applications (medical, legal, financial), that may be acceptable. For consumer chatbots with millions of daily interactions, 6% false positives generates significant support load. Tuning the threshold trades false positives for missed attacks. This is the familiar precision-recall trade-off, and no detection system escapes it.
The Gray Swan Arena results add context. In the largest public red-teaming exercise — 2,000 attackers, 22 models, 1.8 million attempts — every model was jailbroken. The best performer (Claude 3.7 Sonnet Thinking) had a 1.47% attack success rate. The worst (Llama 3.3-70b-instruct) had 6.49%. Even the best model-level defenses let attacks through. Adding representation detection as an independent layer compounds the defense: an attack must evade both model-level alignment and representation monitoring simultaneously.
Where does this fit in the defense stack?
Internal representation detection is not a replacement for output filtering. It is a new layer that changes the defensive architecture from single-plane to multi-plane.
graph TD
A["Input arrives"] --> B["Input filter<br/>(prompt injection detection)"]
B --> C["Model inference begins"]
C --> D["Internal representation monitor<br/>(tensor decomposition / HiddenDetect)"]
D -->|"Jailbreak detected"| E["Block generation"]
D -->|"Benign"| F["Model generates output"]
F --> G["Output filter<br/>(harmful content detection)"]
G -->|"Harmful"| H["Block output"]
G -->|"Clean"| I["Response delivered"]
style D fill:#ff6f00,color:#fff
style E fill:#d32f2f,color:#fff
style H fill:#d32f2f,color:#fff
The defense-in-depth architecture gains a new middle layer. Input filters catch obvious attacks before they reach the model. Representation monitors catch attacks during inference. Output filters catch anything that leaks through both. Each layer operates on different signals. An attack that evades one layer encounters independent detection at the next.
Circuit Breakers (Gray Swan AI, arXiv:2406.04313) offer a complementary approach at the same layer. Where representation detection monitors and flags, Circuit Breakers intervene — they modify representations during inference to disrupt harmful generation in progress. The trade-off: Circuit Breakers require training-time integration and may reduce model capabilities, while detection runs as a sidecar with no model modification. In practice, deploying both — monitoring via tensor decomposition and intervention via Circuit Breakers — creates a defense that both detects and disrupts.
For red-teaming exercises, internal representation analysis opens a new testing dimension. Red teams can now target the representation layer specifically — testing whether adversarial inputs can produce activations that fool both the output filter and the representation monitor simultaneously. This is a harder adversarial objective than fooling either layer alone, which is the point.
The open frontier is multi-turn. Current representation detection analyzes per-turn snapshots. Multi-turn attacks spread harmful intent across turns where each individual turn appears benign in latent space. Detecting intent accumulation across conversation history in representation space is unsolved. The Mastermind framework’s 95% success rate against Qwen 72B specifically exploits this gap. Until multi-turn representation tracking is solved, the defense stack still depends on conversation-level behavioral monitoring as an outer layer.
Frequently asked questions
How does internal representation jailbreak detection work?
Instead of scanning model outputs for harmful text, internal representation detection analyzes activation patterns in the model’s hidden layers during inference. Kadali and Papalexakis (UC Riverside, arXiv:2602.11495) use tensor decomposition on hidden states to identify jailbreak-specific signatures. The approach requires no fine-tuning and operates as a lightweight sidecar during inference, blocking 78% of jailbreak attempts while preserving 94% of benign responses.
Can internal representation defenses be bypassed?
Yes. Obfuscated activation attacks (arXiv:2412.09565) can reduce probe recall from 100% to 0% while maintaining 90% jailbreak success rate. Internal representation detection is not a silver bullet. It is one layer in a defense-in-depth stack. The advantage over output filtering: obfuscation attacks are computationally expensive and degrade model utility, unlike output-level evasion which is cheap.
Why are output-layer jailbreak detectors losing the arms race?
Commercial detectors including Azure Prompt Shield and Meta Prompt Guard can be evaded at up to 100% success rates using character injection and adversarial ML (arXiv 2504.11168). The structural problem: attackers observe and optimize against detector behavior. Internal representation detection changes the optimization target from “generate harmful text that looks benign” to “think harmful thoughts that look benign” — a harder problem.
What is the difference between circuit breakers and representation detection?
Circuit Breakers (Gray Swan AI, arXiv:2406.04313) intervene by modifying representations during inference to disrupt harmful generation. Representation detection (Kadali et al.) monitors representations to flag attacks without modifying the model. Circuit Breakers require model-level integration. Detection can run as a sidecar. Both address the same layer but serve different functions in the defense chain.
Do multi-turn jailbreaks bypass internal representation detection?
Multi-turn attacks are the open frontier. The Mastermind framework (arXiv:2601.05445) achieves 95% success on Qwen 2.5 72B using individually benign turns that accumulate harmful intent. a significant share of jailbreaks succeed only in multi-turn format. Current representation detection works per-turn. Detecting multi-turn intent in latent space remains unsolved.
Further reading: Prompt injection detection is already broken explains why output-layer detection fails structurally. Defense-in-depth for LLM applications covers the full security architecture that representation detection fits into. Jailbreaking in production traces the evolution from novelty attacks to systematic exploitation.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch