For eleven days, Claude Opus 4.8 was the best AI model in the world. Released May 28, 2026, it dethroned GPT-5.5 on the Artificial Analysis Intelligence Index and led on coding benchmarks by wide margins. Then on June 9, Anthropic released Claude Fable 5 — its first publicly available Mythos-class model — and Opus 4.8 became the former king overnight. But the story isn't just "new model beats old model." Fable 5 represents a fundamental shift in Anthropic's philosophy: from calibrated, cautious honesty to raw, knowledge-driven power. And it costs twice as much. Here's the complete data-driven comparison — 30 benchmarks, pricing analysis, the honesty tradeoff, and when Opus 4.8 is still the smarter choice. Both models available on CodingFleet.

TL;DR: Claude Fable 5 leads on every published benchmark — SWE-bench Pro (+11.1), Terminal-Bench 2.1 (+5.3), AA-Omniscience (+12.7), GDPval-AA Elo (+42). But it costs 2× more ($10/$50 vs $5/$25 per 1M tokens) and represents a strategic shift: Fable 5 prioritizes raw accuracy (61% Omniscience) over calibrated honesty (hallucinates more than Opus 4.8's 35.9%). Fable 5 also includes safety classifiers that fall back to Opus 4.8 on ~5–9% of queries. For unattended coding agents where silent errors cost money, Opus 4.8's calibrated caution may still be safer.

Want to test both models side by side? Claude Fable 5 and Opus 4.8 are available on CodingFleet. Start a new chat → and compare them on your own code.

Release Context: 11 Days That Reshaped Anthropic's Lineup

DetailClaude Fable 5Claude Opus 4.8
Release DateJune 9, 2026May 28, 2026
Model ClassMythos (above Opus)Opus (flagship)
Days as #1Current12 days (May 28 – June 9)
PredecessorClaude Mythos 5 (restricted)Claude Opus 4.7 (April 16)
AA Intelligence Index64.9 (#1)61.4 (#2)

Artificial Analysis notes Fable 5's AA-Omniscience accuracy suggests it could be the largest public Anthropic model to date.

Opus 4.8 had the shortest reign of any Claude flagship. It brought Dynamic Workflows (hundreds of parallel subagents), a 4× honesty improvement, and the #1 Intelligence Index position. In any other month, it would have been the story of the summer. Instead, it became the bridge between the Opus era and the Mythos era.

Fable 5 is the same underlying model as Claude Mythos 5 — Anthropic's restricted model available only to US government partners through Project Glasswing. Fable 5 adds safety classifiers for cybersecurity, biology, chemistry, and distillation topics, routing flagged queries to Opus 4.8. Anthropic says this fallback occurs in fewer than 5% of sessions; Artificial Analysis measured it at ~8–9% across their benchmark suite.

Head-to-Head Benchmark Comparison

Below is the most comprehensive side-by-side comparison available as of June 11, 2026. Purple cells indicate the leader. "—" means no published score.

Agentic Coding Benchmarks

BenchmarkClaude Fable 5Claude Opus 4.8Winner
SWE-bench Verified95.0% (Vals.ai); 93.9% (Anthropic)88.6%Fable 5 (+6.4)
SWE-bench Pro80.3%69.2%Fable 5 (+11.1)
SWE-bench Multilingual~87%84.4%Fable 5 (~+2.6)
Terminal-Bench 2.188.0%82.7%Fable 5 (+5.3)
FrontierCode Diamond29.3%13.4%Fable 5 (+15.9)
CursorBench 3.172.9%Fable 5

Browser, Tools & Computer Use

BenchmarkClaude Fable 5Claude Opus 4.8Winner
OSWorld-Verified85.0%83.4%Fable 5 (+1.6)
MCP Atlas82.2%Opus 4.8 (only score)
Online-Mind2Web84%Opus 4.8 (only score)
BrowseComp (multi-agent)88.5%Opus 4.8 (only score)
Toolathlon59.9%Opus 4.8 (only score)

Knowledge Work & Professional Benchmarks

BenchmarkClaude Fable 5Claude Opus 4.8Winner
GDPval-AA (Elo)19321890Fable 5 (+42)
GDP.pdf (no tools)29.8%22.5%Fable 5 (+7.3)
Finance Agent v256.3%53.9%Fable 5 (+2.4)
AutomationBench17.4%15.5%Fable 5 (+1.9)
HealthBench Professional66.0%56.9%Fable 5 (+9.1)

Reasoning & Academic

BenchmarkClaude Fable 5Claude Opus 4.8Winner
GPQA Diamond~93% (estimated, tied)93.6%Tie
HLE (no tools)59.0%49.8%Fable 5 (+9.2)
HLE (with tools)64.5%57.9%Fable 5 (+6.6)
Blueprint-Bench 2 (spatial reasoning)38.6%14.5%Fable 5 (+24.1)
USAMO 202696.7%Opus 4.8 (only score)
ARC-AGI-2 (High)72.1%Opus 4.8 (only score)

Intelligence & Honesty

BenchmarkClaude Fable 5Claude Opus 4.8Winner
AA Intelligence Index64.9 (#1)61.4 (#2)Fable 5 (+3.5)
AA-Omniscience Index40.0 (#1)27.4 (#3)Fable 5 (+12.6)
AA-Omniscience Accuracy61.0% (#1)46.6%Fable 5 (+14.4pp)
AA-Omniscience HallucinationHigher than 35.9%*35.9%Opus 4.8 (more honest)
GraphWalks BFS 1M80.0%68.1%Fable 5 (+11.9)
Legal Agent Benchmark13.3%10.4%Fable 5 (+2.9)

Biology & Health

BenchmarkClaude Fable 5Claude Opus 4.8Winner
BioMysteryBench (hard)46.1%40.0%Fable 5 (+6.1)
BioMysteryBench (human solved)83.9%80.4%Fable 5 (+3.5)

*AA states Fable 5's Omniscience is "driven by leading accuracy, rather than low hallucinations." Exact Fable 5 hallucination rate unpublished. Fallback to Opus 4.8 on 9% of Omniscience questions.

Cybersecurity

BenchmarkClaude Fable 5Claude Opus 4.8Winner
ExploitBench78.0% (Mythos 5; Fable 5 falls back on cyber)40.0%Fable 5 (+38.0)*
CyberGym(falls back to Opus 4.8)78.8%Opus 4.8 (effective)

*Fable 5's safety classifiers reroute cybersecurity queries to Opus 4.8 — effective performance is closer to Opus 4.8 levels for most users.

Claude Fable 5 vs Opus 4.8 benchmark comparison bar chart

What Each Benchmark Actually Means

SWE-bench Pro — The 11.1-Point Leap

SWE-bench Pro tests real GitHub issue resolution: reading codebases, making multi-file changes, passing test suites. Fable 5's 80.3% vs Opus 4.8's 69.2% is an 11.1-point gap — larger than Opus 4.8's lead over GPT-5.5 (10.6 points). On SWE-bench Verified, Fable 5 hits 95.0% per Vals.ai independent testing — approaching the benchmark ceiling. The gap widens on harder tasks: Fable 5 scores 93%+ on 1–4 hour tasks while competing models drop sharply.

FrontierCode Diamond — The 15.9-Point Explosion

FrontierCode Diamond tests production-quality code for complex, novel problems. Fable 5: 29.3%. Opus 4.8: 13.4%. GPT-5.5: 5.7%. This is the largest gap on any shared benchmark. Fable 5 more than doubles Opus 4.8's score on original problem decomposition — the kind of coding that has no Stack Overflow answer.

Terminal-Bench — The CLI Takeover

Fable 5: 88.0%. Opus 4.8: 82.7%. A 5.3-point gap. Fable 5 is now the #1 model on Terminal-Bench 2.1, dethroning GPT-5.5 (83.4%). For DevOps automation and CLI agent coding, Fable 5 is the best model from any provider.

AA-Omniscience — The Honesty Tradeoff

This is where the comparison gets philosophically interesting. Fable 5 scores 40 (#1). Opus 4.8 scores 27.4 (#3). But the drivers are completely different:

  • Fable 5: 61% accuracy (highest ever) — it knows more than any model. But hallucinates more than Opus 4.8. When Fable 5 doesn't know, it's more likely to fabricate.
  • Opus 4.8: 46.6% accuracy — knows less. But 35.9% hallucination rate — when it doesn't know, it stays quiet.

Fable 5 knows more but lies more. Opus 4.8 knows less but is more calibrated about its uncertainty. For unattended agents where silent errors cost money, Opus 4.8 is safer. For research where you'll verify output, Fable 5 is transformative.

GraphWalks — Long-Context Retrieval

Fable 5: 80.0% at BFS 1M. Opus 4.8: 68.1%. An 11.9-point gap. For reasoning over entire codebases, Fable 5 retrieves significantly more reliably at the upper end of the context window.

HLE — Fable 5 Takes the Lead

Fable 5: 59.0% (no tools), 64.5% (with tools). Opus 4.8: 49.8%, 57.9%. Gaps of 9.2 and 6.6 points respectively. Fable 5 leads decisively on Humanity's Last Exam — the hardest academic benchmark. Note: Artificial Analysis independently measured Fable 5 at 53.0% with tools, lower than Anthropic's 64.5% — the difference is that AA's measurement includes the 9% of HLE tasks where Fable 5's safety classifiers triggered fallback to Opus 4.8. Anthropic's system card score represents Fable 5's native capability without fallback.

Blueprint-Bench 2 — The 24.1-Point Spatial Explosion

Fable 5: 38.6%. Opus 4.8: 14.5%. This is the largest relative gap on any benchmark — a 24.1-point difference. Blueprint-Bench 2 tests spatial reasoning: understanding blueprints, floor plans, and 3D spatial relationships. Fable 5 nearly triples Opus 4.8's score. Anthropic reports this represents a "nearly 3× improvement" in spatial reasoning over Opus 4.8. For architecture, engineering, and any domain requiring spatial intelligence, Fable 5 is in a different league.

Opus 4.8 was the first model to break 10% on this benchmark (10.4%). Fable 5 extends the lead to 13.3% — a 2.9-point gain. Both models far outpace GPT-5.5 (2.1%) and Gemini 3.1 Pro (0.0%). For legal coding and compliance workflows, either Claude model is the clear choice — but Fable 5 is measurably better.

Performance radar — Fable 5 vs Opus 4.8 across 8 dimensions

Pricing: The 2× Decision

DetailClaude Fable 5Claude Opus 4.8
Input (per 1M tokens)$10.00$5.00
Output (per 1M tokens)$50.00$25.00
Cache write (per 1M tokens)$12.50$6.25
Cache read (per 1M tokens)$1.00$0.50
Context window1M input1M input / 128K output
Batch/Flex discount✅ 50% off✅ 50% off
Subscription accessIncluded until June 22, then creditsIncluded in all paid plans

Cost per SWE-bench Pro point: Fable 5: $0.62/point. Opus 4.8: $0.36/point. Fable 5 gives you 16% more Pro performance for 100% more cost. For teams fixing production bugs where an incorrect fix costs thousands, the premium is trivial. For high-volume generation, Opus 4.8 is better value.

API pricing comparison — Fable 5 vs Opus 4.8
Cost per SWE-bench Pro point

The Safety Architecture: Fable 5's Hidden Opus 4.8

Fable 5's most unusual feature is invisible to most users. Safety classifiers automatically reroute queries on cybersecurity, biology, chemistry, and AI distillation topics to Opus 4.8. When this happens, latency and cost profiles change mid-request. Users report classifiers being "trigger-happy" — a medical physicist wrote: "I genuinely can't use Fable. I'm a medical physicist. I use the word nuclear a lot." If your work touches these domains, using Opus 4.8 directly may be more predictable — and cheaper.

Unique Features: What Each Model Brings

Fable 5 Advantages

  • Mythos-class intelligence: Largest public Anthropic model. Highest accuracy on AA-Omniscience (61%).
  • Token efficiency on complex tasks: Frontier physics research requires 3× fewer tokens vs GPT-5.5.
  • Stripe's 50M-line migration in a day: Estimated at 2+ months for a human team.
  • #1 on Terminal-Bench, SWE-bench Pro, FrontierCode Diamond, AA-Omniscience, GraphWalks.

Opus 4.8 Advantages

  • 35.9% hallucination rate: Best calibration of any flagship. When it doesn't know, it says so.
  • Dynamic Workflows in Claude Code: Hundreds of parallel subagents with adversarial verification.
  • Fast Mode: 2.5× speed at $10/$50. Fable 5 has no Fast mode equivalent.
  • 96.7% USAMO 2026: Highest published math score of any model.
  • No safety fallbacks: You always get Opus 4.8 — not a downgraded model on certain topics.
  • Stable subscription access: Included in all plans with no credit expiration looming.
  • Half the cost: $5/$25 vs $10/$50 per 1M tokens. Better value for high-volume use.

Which Model Should You Use?

Use CaseBetter ModelWhy
Complex bug fixing (multi-file, 4+ files)Fable 580.3% Pro vs 69.2%. Gap widens on harder tasks.
CLI / DevOps automationFable 588.0% Terminal-Bench. #1 overall.
Codebase-scale migrationsFable 5FrontierCode 29.3%. Stripe 50M-line proof.
Knowledge work (research, finance, legal)Fable 51932 GDPval-AA. 61% accuracy. 74% win rate vs Opus.
Long-context reasoning (500K–1M)Fable 580.0% GraphWalks vs 68.1%.
Computer-use / browser agentsFable 585.0% OSWorld vs 83.4%.
Unattended agents (no human review)Opus 4.835.9% hallucination. Won't silently ship broken code.
Scientific / academic reasoningFable 559.0% HLE (no tools) vs 49.8%. 64.5% with tools. Fable leads decisively.
Cybersecurity workOpus 4.8Fable 5 reroutes cyber queries to Opus 4.8 anyway.
Cost-sensitive high-volume productionOpus 4.8$25 vs $50/1M output. 2× cheaper.
Biology / chemistry / medical researchOpus 4.8Fable 5 classifiers flag these domains. Despite higher BioMysteryBench scores, fallback risk makes Opus more predictable.
Math competition / theorem provingOpus 4.896.7% USAMO. Fable 5 score not published.

Conclusion: Two Kings, Different Thrones

Claude Fable 5 is the most capable AI model Anthropic has ever made publicly available. The 11.1-point SWE-bench Pro gap, the 15.9-point FrontierCode gap, and the 12.6-point AA-Omniscience gap are generational — not marginal. For complex engineering work, codebase migrations, CLI automation, and knowledge-heavy tasks, Fable 5 is unmatched.

But calling it a pure upgrade misses the point. These two models represent different philosophies within the same company:

  • Fable 5 bets on raw capability — it knows more, codes better, reasons deeper. But hallucinates more, costs 2× more, and sometimes silently downgrades to Opus 4.8.
  • Opus 4.8 bets on calibrated trust — it knows less but lies less (35.9% hallucination). Costs half. Has no safety fallbacks. Holds the published math crown (96.7% USAMO). Better for unattended agents and cost-sensitive pipelines.

For most developers, the practical answer is both. Use Fable 5 for complex engineering where you'll verify output. Use Opus 4.8 for unattended agents, cost-sensitive pipelines, scientific work, and cybersecurity. Both are available side by side on CodingFleet — you don't have to choose one forever.

And if you can only pick one? Fable 5 is the better model by every benchmark. But the question isn't just "which is better." It's "which is better for your specific work — and which tradeoffs can you live with?" For the first time, Anthropic is asking users to make that call between two very different versions of excellence.

⚡ Compare Fable 5 & Opus 4.8 on CodingFleet →

Both models available. Side-by-side testing on your own code.


Sources: Anthropic — Fable 5 & Mythos 5 System Card | Anthropic — Opus 4.8 System Card | Artificial Analysis — Fable 5 Intelligence Index | AA-Omniscience Leaderboard | Vals.ai — SWE-bench Verified | Vellum — Opus 4.8 Benchmarks | Finout — Fable 5 Pricing. All benchmark scores vendor-reported unless otherwise noted. Pricing as of June 11, 2026.