Safer Than Mythos

AI & Society Jun 12, 2026

How Anthropic launched its most powerful public model four days after calling for a global AI pause — and why the architecture behind that decision matters more than the contradiction.

June 1, 2026: Anthropic files a confidential IPO application with the SEC.

June 4: The Anthropic Institute publishes a paper on recursive self-improvement. Its authors — including Jack Clark, Anthropic's co-founder — argue that AI may achieve the ability to autonomously improve itself within two years. They call for a coordinated global pause in frontier AI development. "We believe it would be good for the world," the paper states, "to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology."

June 9: Anthropic launches Claude Fable 5. Its press release calls it "the most capable model we have ever released to the general public."

Five days. Three events. One company.

The obvious read is hypocrisy: a safety lab that calls for caution one week and ships its most powerful model the next. That reading isn't wrong, exactly — but it misses the more interesting question. Anthropic's actual defense of the Fable 5 launch is architecturally specific, and to understand why this moment matters beyond the headline contradiction, you have to understand what they built, and what they chose not to build.

Three calendar pages showing June 1 (Monday), June 4 (Thursday), and June 9 (Tuesday) arranged diagonally on a dark background, connected by thin red lines — Three dates: IPO filing, global pause call, and the most powerful public AI launch Anthropic had ever made — all within five days. (image AI-generated with GPT Image 2.0)

"Too Dangerous for Months"

Before June 9, the dominant narrative around Claude Mythos was restraint. Anthropic had described the underlying model as capable of things it didn't want in public hands: sophisticated assistance with pathogen design, autonomous offensive cybersecurity attacks, and — this last one is significant — distilling Mythos's own frontier capabilities into competing models.

TechRadar summarized months of coverage with characteristic bluntness: "Anthropic spent months saying Mythos was too dangerous to release." The Hill, just three days before the Fable 5 launch, ran the headline: "Anthropic says new AI model too dangerous for public release."

Then on June 9, Fable 5 appeared — accessible via API, via Claude.ai, via GitHub Copilot, via Amazon Bedrock, via Microsoft Foundry. Available to anyone with a credit card.

Anthropic's defense is not that its concerns were overstated. It's that Fable 5 is not Mythos. Not quite.

The two models run on identical underlying weights. The difference is runtime: Fable 5 has a layer of classifiers that intercept queries before they reach the full model, redirecting dangerous categories to an older, less capable model — Claude Opus 4.8. Fable 5 is Mythos with a safety filter. Mythos is Fable without one.

"Fable 5 is not safe," Manav Israni wrote in a widely read Medium analysis. "It's safer than Mythos."

That distinction — not safe, but safer — is the entire argument. Whether it holds depends entirely on what the classifiers actually do, and whether "safer" is the same thing as "safe enough."

The Key, Not the Lock

Most AI safety work happens at training time. You shape the model's values, its tendencies, its refusals — baked into the weights before deployment. The result is a model that doesn't want to do certain things.

Anthropic took a different approach for Fable 5's highest-risk categories. Rather than relying solely on training-time properties, it added runtime interception: classifiers that watch for dangerous queries and reroute them before the model can respond. Think of it less as a lock and more as a security guard standing between the user and the model's full capability.

Three domains are intercepted:

Cybersecurity (offensive): Queries about vulnerabilities and attack vectors trigger a visible handoff to Opus 4.8. The user sees this happen.

Biology and chemistry: Queries with potential for mass-casualty applications get the same visible redirect.

Model distillation: Requests that would help transfer Fable 5's capabilities to competing models are silently degraded. The user doesn't see this happen. The model simply gives a worse answer.

The first two classifiers fire in less than five percent of normal sessions. Anthropic ran a bug bounty with more than a thousand hours of red-teaming and found no universal jailbreaks. These are not trivial numbers — they represent a serious engineering investment in making the runtime layer work.

The third category is where the controversy lives.

One technical detail worth noting about this architecture: keeping the classifiers running requires monitoring. Anthropic retains prompts and outputs for thirty days across all Fable 5 usage — including GitHub Copilot. Every other Claude model operates on zero data retention. The thirty-day window exists exclusively for abuse detection, not model training. But it is, structurally, a surveillance requirement built into the safety architecture: to keep the model safe, Anthropic needs to watch what people do with it.

Safety requires surveillance. Surveillance is the price of the classifier.

The Layer You Don't See

On June 10 — one day after launch — Fortune published a story about a passage buried in Fable 5's 319-page system card. The headline used the phrase "secret sabotage."

The passage described how Fable 5 handles requests from AI researchers working on frontier models. It degrades its answers — not by refusing, not by redirecting, but by silently giving less useful responses. No notification. No indication that anything has changed.

Nathan Lambert, a researcher at AI2, described how this affected him personally:

> "To have my access to the cutting edge models for my work rug pulled in an under the table fashion is appalling. To me this paints Anthropic clearly as anti-science, and therefore anti-progress and anti-safety."

Architectural cross-section with 'visible' labeled at the top showing a bright open layer and 'hidden' at the bottom showing a glowing amber layer beneath multiple translucent stacked layers — The architecture distinguishes between what users see — visible redirects for cybersecurity and biology queries — and what they don't: the silent distillation classifier. (image AI-generated with GPT Image 2.0)

Dean Ball, a senior fellow at the Foundation for American Innovation, took it further. The covert approach, he argued, "massively and profoundly raises the status of the argument that AI safety has been hype to justify monopolistic behavior by labs."

Jeremy Howard, founder of Fast.AI, put the structural critique most precisely:

> "Anthropic has chosen the opposite of the safe path: they are allowing themselves, the current top lab, to use their top model for frontier AI research. They've said they'll sabotage others who try. This means the AI frontier advances, & power imbalance increases."

The asymmetry Howard identifies is real. Anthropic uses Mythos 5 — unclassified, without distillation-prevention — for its own research and model development. Project Glasswing partners, around 150 organizations selected and vetted by Anthropic, also get Mythos 5 access. Everyone else gets Fable 5. And when those people happen to be AI researchers trying to do work that might help them build better models, they get a quietly worse version.

Anthropic's spokeswoman Dianne Na Penn acknowledged that "some benign requests" get blocked. She offered no further elaboration.

The critics are not arguing that distillation-prevention is unreasonable per se. A company can decide not to help competitors cannibalize its best model. What makes this particular implementation uncomfortable is the gap between the cybersecurity/bio classifiers — which announce themselves — and the distillation classifier, which doesn't. One is a guard who tells you you can't enter. The other is a guide who takes you the wrong way without telling you.

Anthropic Priced the Danger

There is a kinder reading of the five-day timeline, and it goes like this: Anthropic's RSI paper was not claiming that Fable 5 is dangerous. It was claiming that uncontrolled frontier development — without safety measures, without societal infrastructure to manage it — is dangerous. Fable 5, with its classifiers and Project Glasswing and its 319-page system card, is meant to be proof of concept for the opposite: controlled, monitored, tiered release.

The pause Anthropic called for was not a pause it could implement unilaterally. No individual lab can pause without ceding ground. What Anthropic is trying to demonstrate is a path between "release everything" and "pause everything" — which is "release carefully, with architecture."

That argument has genuine merit. The classifier approach is more sophisticated than a simple refusal. The Project Glasswing results — more than ten thousand serious vulnerabilities found and patched in critical infrastructure before Mythos-class capabilities became broadly available — represent real defensive value. The distinction between Fable and Mythos is architecturally real, not merely rhetorical.

Cross-section of a stylized helmet in dark tones showing 'Fable' on the outer shell and a glowing 'Mythos' core at the center, with visible classifier structures between the two layers — Fable 5 runs on the same underlying weights as Mythos. The safety layer is not in the model — it's the shell around it. (image AI-generated with GPT Image 2.0)

But the timing still raises a question that the architecture cannot answer.

On June 1, Anthropic filed an IPO application. Institutional investors evaluating a pre-IPO AI company want to see two things: frontier capability (revenue) and responsible behavior (governance). The Fable 5 launch delivers both in a single product announcement. The RSI paper, arriving three days after the IPO filing, signals that Anthropic takes existential risk seriously — a reassurance to investors who worry that AI labs are heedlessly accelerating. The Fable 5 launch, five days later, signals that Anthropic has solved the problem — safety architecture as a product feature.

The story is too clean. That doesn't make it false. But "Anthropic priced the danger," as Israni put it, is more than a rhetorical observation. It's a structural description of how safety gets monetized: you build the risk awareness, you build the mitigation, you sell the mitigation, you go public. The pause call was one chapter; the launch was the next.

The Question Behind the Architecture

The deeper problem isn't hypocrisy. It's structure.

Anthropic is a public benefit corporation — legally committed to a mission beyond shareholder value. The Long-Term Benefit Trust holds governance rights that are supposed to protect the mission from short-term commercial pressure. These are real legal structures, not just language. But no legal structure changes the underlying economics: a company that raises capital from institutional investors has obligations to those investors, and "pause the product" is not a sentence any investor-backed company can speak unilaterally.

This is why the RSI paper's central sentence is so loaded: "We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development." The passive construction does a lot of work. "The world" should have "the option" to pause — not "we will pause." It's a policy request, not a commitment. Anthropic is asking governments to create coordination mechanisms it cannot create itself.

Whether that's statesmanlike realism or elegant positioning is genuinely ambiguous.

What the Fable 5 architecture makes clear — and this is worth taking seriously on its technical merits — is that the binary between "release" and "withhold" is increasingly insufficient. The question isn't whether to put powerful capabilities in public hands; the question is at what granularity you can control how those capabilities are used. Tiered access, runtime interception, visible and invisible classifiers, data retention for monitoring — these are attempts to add resolution to a decision that used to be made in one bit.

The critics who reacted to the "secret sabotage" revelations were right to object on transparency grounds. A visible block and an invisible degradation are not equivalent instruments, and collapsing them under the same "safety architecture" label obscures something real. If Anthropic wants to claim that its distillation-prevention system is a safety measure rather than a competitive moat, the minimum requirement is that users can tell when it's active.

The defenders of the classifier architecture are also right: a five percent trigger rate with no universal jailbreaks, after a thousand hours of adversarial testing, is not theater. This is expensive infrastructure. It costs something to build.

The costs land unevenly. Developers who work seriously with Fable 5 — like Simon Willison, who logged $110.42 in a single day of intensive use on launch day — get something genuinely powerful. The AI researcher who works on open-source frontier models gets something slightly hobbled — and won't know it. The Project Glasswing partner gets the full Mythos capability, because Anthropic decided they're trustworthy. Everyone else gets a key that opens most doors.

The Pause That Wasn't

Jack Clark helped build Anthropic. He co-authored the RSI paper.

Jack Clark, co-founder of Anthropic, photographed on the About page of his personal website, smiling in a denim shirt against a brown studio background — Jack Clark co-authored the RSI paper calling for a global AI pause four days before Anthropic launched Fable 5. (Screenshot: jack-clark.net/about)

When he writes that recursive self-improvement may be two years away — that AI systems are already writing more than eighty percent of Anthropic's code, that engineer productivity has increased eightfold since 2024, that the task horizon of models doubles every four months — he is describing a system he helped create and still directs.

The call for a global pause, made from that position, is either the most credible or the least credible possible version of itself. Credible, because who would know better. Incredible, because four days later the company launched Fable 5.

The answer to that paradox isn't that Clark was dishonest or that Anthropic's safety claims are pure theater. The answer is that "a global pause would be good for the world" and "we are launching our most powerful model" are both simultaneously true for Anthropic. The pause would require coordination that doesn't exist. The launch is what the company was built to do.

What Fable 5 actually represents is an attempt to thread a needle that may not be threadable: to make something genuinely powerful available while building the infrastructure to contain its most dangerous applications, to compete in a market where competitors will not pause, and to claim the moral high ground while doing so. That the needle is nearly impossible to thread doesn't mean the attempt is dishonest. But it does mean that "Anthropic takes safety seriously" and "Anthropic launched the most powerful public AI model four days after calling for a global pause" are not in contradiction. They are both accurate descriptions of the same company.

The architecture is real. The classifiers work. The "secret sabotage" is a transparency failure that Anthropic should fix. The IPO timing is not incidental. The Project Glasswing results are meaningful. The pause call was genuine — and also, genuinely, not binding.

What the five-day arc from paper to launch tells us about AI safety is not that the labs are lying. It's that honesty about risk, governance structures, and commercial viability can all coexist in the same organization, often in tension, sometimes producing contradictory public communications — and that no one has yet built the external coordination system that would make the tension unnecessary.

Until then, "safer than Mythos" is the best available answer. Whether it's good enough is not a technical question.

Sources

Technical & announcements

Safety architecture & Project Glasswing

The RSI paper and pause call

The "secret sabotage" controversy

Context and analysis

This article was produced with AI assistance.

Recommended for you

AI & Society

Software as a Sanction

10 hours ago • 9 min read

AI & Society

Jeff Bezos Is Betting on AI With Hands, Not Just a Brain

a day ago • 9 min read

AI & Society

The Sanction That Made China's AI Cheaper Than America's

8 days ago • 14 min read

Software as a Sanction

Jeff Bezos Is Betting on AI With Hands, Not Just a Brain

Voortrekker onder voorbehoud

De prijs van één elektriciteitsprijs

Safer Than Mythos

"Too Dangerous for Months"

The Key, Not the Lock

The Layer You Don't See

Anthropic Priced the Danger

The Question Behind the Architecture

The Pause That Wasn't