What is capability bounding in AI?

Capability bounding is an architectural pattern where deployment constraints, usage monitoring, and behavioral conditioning restrict what a model can do in practice without reducing what it can do in principle. Unlike alignment (changing model weights) or access control (limiting who can call the API), capability bounding operates at the deployment layer -- controlling how, where, and for what purpose a model runs.

Why did Anthropic refuse to publicly release Claude Mythos Preview?

Claude Mythos Preview demonstrated unprecedented cybersecurity capabilities, finding thousands of zero-day vulnerabilities across every major operating system and browser with an 83.1% first-attempt exploit success rate. Anthropic determined the offensive risk of general availability outweighed the benefits, instead distributing it through Project Glasswing to 50+ organizations for defensive use only.

What is Project Glasswing?

Project Glasswing is Anthropic's $100M cybersecurity initiative that provides Claude Mythos Preview to approved organizations for defensive vulnerability discovery and patching. Launch partners include AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks, plus 40+ additional critical infrastructure maintainers.

How does capability bounding differ from AI alignment?

Alignment changes the model itself through training, RLHF, or constitutional methods so it refuses harmful requests. Capability bounding leaves model weights untouched and instead restricts the deployment envelope -- who gets access, what use cases are permitted, how outputs are monitored, and what behavioral conditions apply. A capability-bounded model retains full capability but operates within enforced constraints.

Is capability bounding just marketing for restricted access?

Partially. The security rationale is genuine -- Mythos found vulnerabilities that survived 27 years of human review. But restricted access to a model priced at $25/$125 per million tokens also creates premium positioning and competitive differentiation. The safety argument and the business argument reinforce each other, which is exactly what makes capability bounding viable as a long-term strategy.

Capability bounding as product architecture: what Claude Mythos and Project Glasswing actually mean

16 minute read

On April 7, 2026, Anthropic did something no frontier lab had done before: it announced its most capable model and simultaneously told the world it would not sell it.

TL;DR

Anthropic’s Claude Mythos Preview finds zero-days in every major OS and browser, reproduces exploits on first attempt 83.1% of the time, and autonomously discovered a 17-year-old FreeBSD RCE (CVE-2026-4747). Instead of releasing it, Anthropic created Project Glasswing: $100M in credits distributed to 50+ organizations for defensive use only. This is not alignment research and it is not access control. It is capability bounding – a new architectural pattern where deployment constraints, usage monitoring, and behavioral conditioning restrict what a model can do in practice without reducing what it can do in principle. The contrarian read: capability bounding creates premium pricing power ($25/$125 per million tokens) and a competitive moat that safety alone does not explain. For context on the broader threat landscape these capabilities reshape, see The AI threat model every security team is missing.

$A laser beam refracting through a glass prism into controlled directional beams in a dark laboratory, representing capability bounding of frontier AI$

What did Claude Mythos Preview actually find?

The numbers need to land before the architecture discussion makes sense.

Mythos Preview is a general-purpose frontier model that turns out to be terrifyingly effective at computer security. On CyberGym vulnerability reproduction, it scored 83.1% versus Opus 4.6’s 66.6%. On SWE-bench Verified, it hit 93.9%. On the OSS-Fuzz corpus, it produced 595 tier 1-2 crashes compared to 150-175 for Opus 4.6, and achieved full control flow hijack on ten separate, fully patched targets.

The raw vulnerability count tells part of the story: thousands of zero-days across every major operating system and every major web browser. Over 99% remain unpatched as of this writing. But specific findings tell the rest:

FreeBSD NFS (17 years old, CVE-2026-4747): Mythos autonomously identified and exploited a remote code execution vulnerability that grants root access to any machine running NFS from an unauthenticated remote position. The exploit chains RPCSEC_GSS authentication bypass with ROP gadget chaining split across six sequential packets, finishing with SSH key injection into root’s authorized_keys.
OpenBSD TCP/SACK (27 years old): A remote denial-of-service vulnerability exploiting signed integer overflow in sequence number comparisons. Discovered for under $50 in compute cost in a single run.
FFmpeg H.264 codec (16 years old): A sentinel collision vulnerability in slice tracking that automated tools missed 5 million times.
Linux kernel privilege escalation chains: Multiple chained exploits combining read and write primitives to bypass KASLR.
Firefox JavaScript engine: Where Opus 4.6 developed working exploits twice out of several hundred attempts, Mythos Preview developed working exploits 181 times and achieved register control on 29 more.

Human security researchers working with Mythos reported that tasks requiring weeks to develop now take hours with model assistance. This is not incremental improvement. It is a functional difference in kind.

What is capability bounding and how does it differ from alignment?

The industry language around this is a mess. Three distinct mechanisms exist for controlling what an AI system does, and most commentary mashes all three together.

Alignment operates on the model itself. Through training, RLHF, or constitutional methods, you change the model’s weights so it refuses harmful requests or follows specific behavioral norms. The capability is reduced at the parameter level. The model genuinely cannot (or strongly resists) producing certain outputs.

Access control operates on the API layer. Rate limiting, authentication, role-based permissions, content filtering on inputs and outputs. The model retains full capability, but the serving infrastructure gates who can reach it and what passes through.

Capability bounding operates at the deployment layer. The model retains its full weights and full capability. Access control alone does not define it. Instead, a combination of deployment constraints (who gets access, for what use cases), usage monitoring (what they do with it, with mandatory reporting), and behavioral conditioning (the model is steered toward defensive patterns through system prompts and fine-tuning that does not reduce raw capability) creates an envelope around the model’s practical use.

graph TB
    subgraph "Three Control Mechanisms"
        A["Alignment<br/><i>Change the model</i>"] --> W["Modify weights via RLHF,<br/>constitutional AI, training"]
        B["Access Control<br/><i>Gate the API</i>"] --> X["Auth, rate limits,<br/>content filters"]
        C["Capability Bounding<br/><i>Constrain deployment</i>"] --> Y["Use-case restrictions,<br/>monitoring, reporting,<br/>behavioral conditioning"]
    end

    subgraph "What Changes"
        W --> W1["Model capability reduced"]
        X --> X1["Model access restricted"]
        Y --> Y1["Model use directed"]
    end

    style A fill:#e74c3c,color:#fff
    style B fill:#f39c12,color:#fff
    style C fill:#2ecc71,color:#fff

The distinction matters because alignment has costs. Reducing a model’s capability to prevent misuse also reduces its capability for legitimate use. Access control is binary – you are either in or out, with limited granularity. Capability bounding attempts to preserve the full capability for approved purposes while restricting the deployment envelope.

Project Glasswing is the first public implementation of capability bounding at the frontier. Mythos Preview is not aligned away from finding exploits – that is its purpose. It is not merely behind an API gate – the 50+ organizations have API access. It is bounded: defensive use only, 90-day vulnerability reporting requirements, cryptographic hashing of undisclosed vulnerabilities pending patches, and a Cyber Verification Program for security professionals needing safeguard exceptions.

How does Project Glasswing enforce capability bounding in practice?

The enforcement architecture has four layers, each reinforcing the others.

Layer 1: Partner vetting. Access requires organizational approval. Launch partners – AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks – are not random API customers. They are organizations with existing security infrastructure, incident response teams, and reputational incentives to comply. The additional 40+ organizations maintain critical software infrastructure.

Layer 2: Contractual constraints. Partners commit to defensive-only use with mandatory 90-day reporting on vulnerabilities discovered and patches applied. This creates a paper trail and legal liability that pure access control does not provide.

Layer 3: Technical monitoring. Anthropic coordinates usage monitoring across the partner network. Undisclosed vulnerabilities get cryptographic hashes – creating a tamper-evident record of what was found and when. This is not just logging API calls. It is building an audit chain for vulnerability disclosure.

Layer 4: Behavioral steering. The model’s system-level configuration directs it toward defensive workflows: scanning for vulnerabilities, generating patches, producing disclosure reports. Safeguards “detect and block the model’s most dangerous outputs.” This is lighter than alignment – the weights are not changed to remove exploit capability – but heavier than a content filter.

graph LR
    subgraph "Capability Bounding Stack"
        L1["Partner Vetting<br/>Organizational approval"] --> L2["Contractual Constraints<br/>Defensive-only, 90-day reporting"]
        L2 --> L3["Technical Monitoring<br/>Crypto hashes, audit chains"]
        L3 --> L4["Behavioral Steering<br/>Defensive system config,<br/>output safeguards"]
    end

    subgraph "Enforcement Properties"
        L1 -.-> P1["Reputational incentive"]
        L2 -.-> P2["Legal liability"]
        L3 -.-> P3["Tamper evidence"]
        L4 -.-> P4["Operational direction"]
    end

    style L1 fill:#3498db,color:#fff
    style L2 fill:#2980b9,color:#fff
    style L3 fill:#1abc9c,color:#fff
    style L4 fill:#16a085,color:#fff

Each layer alone is insufficient. A vetted partner could still misuse the model. Contractual constraints are enforceable only after violation. Technical monitoring catches misuse only if you know what to monitor for. Behavioral steering can be circumvented by sophisticated users. Together, they create enough friction that the expected cost of misuse exceeds the expected benefit for any partner organization.

Why did Anthropic choose capability bounding over alignment for Mythos?

Because alignment would destroy the product.

Mythos Preview’s value proposition is finding and exploiting vulnerabilities. An aligned model that refuses to produce exploit code is useless for defensive security research. You cannot scan a codebase for vulnerabilities if the model has been trained to avoid discussing vulnerabilities. The defensive and offensive capabilities are the same capabilities applied to different targets.

This is the core tension that makes capability bounding architecturally interesting: the capability you want to restrict for safety is the same capability you want to preserve for value. Alignment cannot thread this needle. Access control is too coarse. Capability bounding attempts to solve it at the deployment layer.

Anthropic plans to “launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.” The strategy is explicit: use a less capable model to iterate on safety mechanisms, then apply those mechanisms to the frontier model. This is capability bounding as an iterative engineering process, not a one-time policy decision.

Is capability bounding actually a competitive moat disguised as safety?

Here is the contrarian read that the safety narrative obscures.

Claude Mythos Preview is priced at $25 per million input tokens and $125 per million output tokens. For reference, Opus 4.6 sits at $5/$25. That is a 5x premium on both input and output. The pricing alone signals that this is not charity work.

Restricted access creates scarcity. Scarcity creates premium pricing power. Premium pricing funds the infrastructure to maintain restricted access. The flywheel is self-reinforcing: build the most capable model in a high-stakes domain, demonstrate the capabilities loudly (thousands of zero-days, 27-year-old bugs, 83.1% exploit rate), restrict access to vetted organizations, charge a premium justified by the capability gap, then use the safety argument to prevent competitors from undercutting through open release.

OpenAI noticed. After GPT-5.3 Codex demonstrated similar (though less dramatic) cybersecurity capabilities, they launched the “Trusted Access for Cyber” pilot program – the same restricted-access pattern. GPT-5.4 is the first OpenAI model with cybersecurity mitigations built in, rated “High” under their Preparedness Framework.

The safety rationale is genuine. Mythos found vulnerabilities that survived 27 years of expert human review and millions of automated scans. Releasing that capability to anyone with an API key would be reckless. But the business rationale is equally real. As Zvi Mowshowitz observed, restricting access to a company that acted responsibly “while other labs may obtain similar capabilities risks pushing capability advantages toward less safety-conscious actors.” The responsible actor gets the moat. The irresponsible actor gets the market.

Simon Willison put it more simply: “I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off.” He is right. But trade-offs that happen to generate premium revenue tend to stick around longer than trade-offs that do not.

What precedents exist for capability bounding in AI?

Capability bounding is not entirely new, but Mythos is the first implementation at this scale with this level of public architecture.

GPT-4 staged release (2023): OpenAI released GPT-4 without full details of training data, architecture, or compute. This was information bounding, not capability bounding – the model itself was generally available.

DALL-E content restrictions (2022-present): Image generation models with filters preventing certain content categories. This is closer to alignment (the model is modified) plus access control (content filters at the API layer) than capability bounding.

Nuclear technology export controls: The closest physical analogy. Centrifuge technology is not classified or destroyed. It is bounded: restricted to approved nations, monitored by the IAEA, subject to inspection regimes. The capability exists. The deployment envelope is constrained. The enforcement relies on layered mechanisms (treaties, inspections, sanctions) rather than any single control.

Pharmaceutical scheduling: Fentanyl is not banned from production. It is capability-bounded: manufactured by approved facilities, distributed through controlled channels, prescribed by licensed practitioners, monitored through prescription databases. The capability (pain relief) and the risk (addiction, death) are the same molecule.

Same logic every time: when capability and risk are inseparable, you bound the deployment rather than destroying the capability.

What does capability bounding mean for security teams?

If you run a security program, capability bounding changes your planning in three ways.

First, vulnerability discovery velocity is now asymmetric. Mythos can find vulnerabilities in hours that took human researchers weeks. If your organization is a Glasswing partner, you have access to this velocity for defense. If you are not, attackers may gain similar capabilities before you do. The question is not whether AI-discovered zero-days will affect your infrastructure. It is whether you will find them first. For a framework to evaluate this, see Building an AI security program: from ad-hoc to governance.

Second, your threat model needs a new actor. Autonomous AI vulnerability discovery is no longer theoretical. Mythos Preview found a 17-year-old RCE in FreeBSD with no human involvement after the initial request. Your threat model should now include “AI-automated vulnerability discovery and exploitation” as a distinct capability tier, separate from human-assisted and fully manual. For a structured approach to this, see AI security assessments: evaluating systems you did not build.

Third, patching cadence assumptions are obsolete. If an AI model can find thousands of zero-days in weeks, the assumption that you have months between vulnerability discovery and exploitation collapses. The 90-day responsible disclosure window that Glasswing enforces is optimistic. Defenders need to plan for compressed timelines.

What happens when capability bounding fails?

Capability bounding is not permanent. Three failure modes are predictable.

Capability diffusion. The exploit generation capabilities emerged in Mythos “as a downstream consequence of general improvements in code, reasoning, and autonomy.” They were not explicitly trained. This means any sufficiently capable frontier model may develop similar capabilities as a side effect of general improvement. Anthropic cannot bound capabilities that emerge in competitor models.

Enforcement decay. Contracts work until someone breaks them. A partner organization could be compromised. An employee could exfiltrate techniques. The 90-day reporting window creates a known delay that sophisticated adversaries could exploit. Cryptographic hashing proves what was known when, but does not prevent the knowledge from spreading.

Economic pressure. The $100M in credits is a subsidy. At $25/$125 per million tokens, sustained scanning of large codebases is expensive. When the credits run out, partner organizations face a choice: pay premium rates for defensive scanning, or reduce coverage. Economic pressure will eventually thin the defensive network that capability bounding created.

None of these failure modes hit tomorrow. But they are structural, and they compound. Capability bounding buys time, not permanence. The question is whether the time it buys is enough to develop better mechanisms – which is exactly what Anthropic says it is doing with the upcoming Claude Opus safeguard iteration.

Where does this leave the industry?

Capability bounding is now an established pattern. Anthropic built it. OpenAI copied it. The next frontier model with dual-use capabilities will face the same architectural decision: align it (and lose the capability), release it (and accept the risk), or bound it (and capture the premium).

From a safety perspective, this is pragmatic in a way that alignment research is not. You do not need to solve the alignment problem (still unsolved) or trust access control alone (insufficient). You build a deployment architecture that preserves capability for legitimate use while making misuse expensive and traceable.

From a business perspective, pay attention. Restricted access to the most capable model in a high-stakes domain, premium pricing, and a safety narrative that shields you from competitive pressure to open up? That is a business model, not just a safety policy. It works because the safety argument is real – but the revenue is real too.

The honest assessment: capability bounding works today because frontier cybersecurity capabilities exist at only two or three labs. When those capabilities diffuse – and they will, because they emerge from general capability improvement – the deployment constraints that define capability bounding will need to evolve into something more durable. What that looks like is the open question.

Key takeaways

Capability bounding is distinct from alignment and access control. It operates at the deployment layer, preserving full model capability while constraining its practical use through partner vetting, contractual obligations, technical monitoring, and behavioral steering.
Mythos Preview’s findings are real and unprecedented. Thousands of zero-days across every major OS and browser, 83.1% first-attempt exploit rate, autonomously exploiting vulnerabilities that survived decades of human review.
Project Glasswing is the first public implementation of capability bounding at the frontier: $100M in credits, 50+ partner organizations, defensive-only use, 90-day reporting requirements.
The safety argument and the business argument are the same argument. Restricted access creates both responsible deployment and premium pricing power.
Capability bounding buys time, not permanence. As capabilities diffuse to other labs, the enforcement mechanisms must evolve.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Capability bounding as product architecture: what Claude Mythos and Project Glasswing actually mean

TL;DR

What did Claude Mythos Preview actually find?

What is capability bounding and how does it differ from alignment?

How does Project Glasswing enforce capability bounding in practice?

Why did Anthropic choose capability bounding over alignment for Mythos?

Is capability bounding actually a competitive moat disguised as safety?

What precedents exist for capability bounding in AI?

What does capability bounding mean for security teams?

What happens when capability bounding fails?

Where does this leave the industry?

Key takeaways

Related across topics

Share on

TL;DR

What did Claude Mythos Preview actually find?

What is capability bounding and how does it differ from alignment?

How does Project Glasswing enforce capability bounding in practice?

Why did Anthropic choose capability bounding over alignment for Mythos?

Is capability bounding actually a competitive moat disguised as safety?

What precedents exist for capability bounding in AI?

What does capability bounding mean for security teams?

What happens when capability bounding fails?

Where does this leave the industry?

Key takeaways

Related across topics

Prompt Injection Defense

Share on