AI Coding Progress Tracker: 14 Months of Benchmark Scores (2024–2026)

🆕 Updated June 9, 2026: Claude Fable 5 launched — 93.9% SWE-bench Verified, 80.3% SWE-bench Pro, 88.0% Terminal-Bench 2.1. The first Mythos-class model available to everyone. First model to break 90% Verified. First to break 80% Pro. First to break 85% Terminal-Bench — taking the crown from GPT-5.5, which had dominated CLI coding at 83.4%. $10/$50 per 1M tokens. This is the story of the most intense period of progress in AI history. Test all models on CodingFleet.

🆕 Claude Fable 5 — The Mythos Milestone

June 9, 2026: Anthropic's first publicly available Mythos-class model. Same underlying model as Claude Mythos 5 (restricted). 93.9% SWE-bench Verified (+5.3 over Opus 4.8), 80.3% SWE-bench Pro (+11.1), 88.0% Terminal-Bench 2.1 (+4.6 over GPT-5.5 — dethrones the former CLI king). First model above 90% Verified. First above 80% Pro. Price: $10/$50 per 1M. Safety classifiers on cyber/bio/chemistry queries fall back to Opus 4.8 (~5% of sessions). See full leaderboard →

In March 2024, Claude 3 Opus scored 33.4% on SWE-bench Verified — the best available. It cost $75 per million output tokens. Twenty-seven months later, Claude Fable 5 scores 93.9% — a 60.5-point leap. But the price story split: OpenAI's GPT-5.5 doubled to $30, Anthropic launched Fable 5 at $50, while DeepSeek V4 Pro collapsed to $0.87. The gap between cheapest frontier and most expensive is now 57×.

📊 Key Milestones

Mar 2024: Claude 3 Opus scores 33.4% on SWE-bench Verified at $75/1M output. The starting line.
Oct 2025: Claude Opus 4.5 breaks 80% for the first time (80.9%). Anthropic cuts Opus price from $75 to $25.
Feb 2026: OpenAI stops reporting Verified scores — contamination confirmed. SWE-bench Pro becomes the trusted benchmark.
Apr 2026: Claude Opus 4.7 at 87.6%. GPT-5.5 launches and dominates Terminal-Bench at 83.4% — a record that stands until Fable 5.
May 2026: DeepSeek makes 75% discount permanent: $0.87/1M. Claude Opus 4.8 at 88.6% Verified, 69.2% Pro.
🆕 Jun 9, 2026: Claude Fable 5 — first model above 90% Verified (93.9%), first above 80% Pro (80.3%), first above 85% Terminal-Bench (88.0%), dethroning GPT-5.5 as CLI king. 57× spread vs DeepSeek V4 Pro.

The SWE-bench Verified Ascent: From 33.4% to 93.9%

SWE-bench Verified — 500 real GitHub issues from 12 Python repositories — has been the industry's north star since August 2024. Claude has held the record at all 8 competitive checkpoints:

SWE-bench Verified ascent: 33.4% to 93.9%

Date	Record-Setting Model	SWE-bench Verified	Significance
Mar 2024	Claude 3 Opus	33.4%	Starting line.
Jun 2024	Claude 3.5 Sonnet	49.0%	First near 50%.
Feb 2025	Claude 3.7 Sonnet	62.3%	First to break 60%.
Oct 2025	Claude Opus 4.5	80.9%	First to break 80%. Anthropic cuts Opus 67%.
Feb 2026	Claude Opus 4.6	80.8%	Verified saturating. OpenAI withdraws.
Apr 2026	Claude Opus 4.7	87.6%	Biggest jump before Fable 5.
May 2026	Claude Opus 4.8	88.6%	Former record.
🆕 Jun 2026	Claude Fable 5	93.9%	First above 90%. Mythos-class.

The Three-Lane Race: Claude vs GPT-5.5

By late 2025, a single benchmark was no longer sufficient. SWE-bench Pro launched as the contamination-resistant successor. Terminal-Bench emerged as the standard for CLI coding. GPT-5.5 dominated Terminal-Bench from April 2026 at 83.4% — until Fable 5 arrived at 88.0% and took the crown. Here's the full timeline:

Claude owns Verified (yellow) and Pro (green). GPT-5.5 (♦ markers) entered in April 2026 and immediately led Terminal-Bench at 83.4% — holding that position for 47 days until Fable 5's 88.0% in June. Fable 5 now leads all three lanes.

Fable 5 vs Previous Champions

Benchmark	Claude Fable 5	Previous Best	Gap
SWE-bench Pro	80.3%	Opus 4.8 (69.2%)	+11.1
Terminal-Bench 2.1	88.0%	GPT-5.5 (83.4%)	+4.6
SWE-bench Verified	93.9%	Opus 4.8 (88.6%)	+5.3
GPQA Diamond	94.5%	Gemini 3.1 Pro (94.3%)	+0.2
HLE (no tools)	56.8%	Gemini 3.1 Pro (44.4%)	+12.4
FrontierCode Diamond	29.3%	Opus 4.8 (13.4%)	+15.9

Pricing: Premium $50, Stable $30, Freefall $0.87

Three diverging strategies: OpenAI holding at $30, Anthropic launching a premium $50 Mythos tier, open-weight collapsing below $1:

Date	Highest-Priced	$/1M	Lowest (Frontier)	$/1M	Spread
Mar 2024	Claude 3 Opus	$75.00	—	—	—
Jan 2025	Claude Opus 4.1	$75.00	DeepSeek V3	$1.10	68×
Oct 2025	Claude Opus 4.5	$25.00	DeepSeek V3.2	$0.42	60×
May 2026	GPT-5.5	$30.00	DeepSeek V4 Pro	$0.87	34×
🆕 Jun 2026	Claude Fable 5	$50.00	DeepSeek V4 Pro	$0.87	57×

The 27 Months That Changed Everything

In March 2024, AI coding was an experiment: 33.4% Verified, $75/1M, no open-weight alternatives. In June 2026, AI coding is infrastructure. The best model scores 93.9% Verified, 80.3% Pro, 88.0% Terminal-Bench. Open-weight delivers frontier-adjacent coding at $0.87/1M. Premium Mythos-class capability costs $50/1M. GPT-5.5's 47-day Terminal-Bench reign is over. Fable 5 leads every lane.

🚀 Experience Fable 5 on CodingFleet →

Sources: Anthropic Fable 5 Announcement (Jun 9, 2026) | Vals.ai | SWE-bench Official | DeepSeek V4 Model Card. Fable 5/Mythos 5 share the same model. Verified score from Mythos Preview system card — Anthropic has not published a separate Fable 5 Verified score. ⚠️ Verified contaminated per OpenAI (Feb 2026).

🆕 Claude Fable 5 — The Mythos Milestone

📊 Key Milestones

The SWE-bench Verified Ascent: From 33.4% to 93.9%

The Three-Lane Race: Claude vs GPT-5.5

Fable 5 vs Previous Champions

Pricing: Premium $50, Stable $30, Freefall $0.87

The 27 Months That Changed Everything

Continue reading

Gemini 3.6 Flash vs GPT-5.6 Terra: Complete Benchmark Comparison (July 2026)

Gemini 3.6 Flash vs Claude Sonnet 5: Complete Benchmark Comparison (July 2026)

Claude Opus 5 vs Kimi K3: The $25 Workhorse vs the Open-Weight Disruptor

FrontierBench v0.1 Leaderboard 2026: AI Agents Ranked by Professional Computer-Work