Claude Sonnet 5 vs Sonnet 4.6: The Biggest Sonnet Leap Ever

June 30, 2026. Anthropic retires the "4.6" suffix and ships Claude Sonnet 5 — same list price ($3/$15), same 1M context, same Sonnet tier. But the benchmarks tell a different story. Terminal-Bench 2.1 jumps from 67.0% to 80.4% (+13.4). HLE with tools from 46.8% to 57.4% (+10.6). FrontierCode more than doubles from 15.1% to 38.8%. And on knowledge work, Sonnet 5 adds 223 Elo points — jumping from 1395 to 1618, which actually beats Opus 4.8. This is not an incremental refresh. It's the biggest generation-over-generation leap in Sonnet history. Here's every benchmark, sourced from Anthropic's Sonnet 5 System Card. Test both on CodingFleet.

TL;DR — Sonnet 5 vs Sonnet 4.6

Same price, massive gains: $3/$15 list price unchanged. Introductory $2/$10 through Aug 31, 2026.
+13.4 Terminal-Bench 2.1: 67.0% → 80.4%. The biggest single-benchmark jump.
+10.6 HLE (with tools): 46.8% → 57.4%. Agentic reasoning nearly matches Opus 4.8 (57.9%).
+5.1 SWE-bench Pro: 58.1% → 63.2%. Within 6 points of Opus 4.8.
+223 GDPval-AA v2: 1395 → 1618. Sonnet 5 beats Opus 4.8 on knowledge work.
FrontierCode more than doubles: 15.1% → 38.8%. Structural improvement in frontier coding.
Tokenizer caveat: Same text = 1.0–1.35× more tokens vs 4.6. Effective cost slightly higher at scale.

Full Benchmark Comparison

Benchmark	Sonnet 5	Sonnet 4.6	Gain
Agentic coding (Terminal-Bench 2.1)	80.4%	67.0%	+13.4
HealthBench Professional	57.8%	44.2%	+13.6
FrontierCode v1	38.8%	15.1%	+23.7
CursorBench (independent)	61.2%	49.0%	+12.2
Reasoning (HLE, with tools)	57.4%	46.8%	+10.6
Reasoning (HLE, no tools)	43.2%	34.6%	+8.6
Agentic search (BrowseComp)	84.7%	76.2%	+8.5
Knowledge work (GDPval-AA v2)	1618	1395	+223
AutomationBench	13.5%	5.3%	+8.2
SWE-bench Verified	85.2%	79.6%	+5.6
Agentic coding (SWE-bench Pro)	63.2%	58.1%	+5.1
Toolathlon (multi-app agent)	54.3%	49.4%	+4.9
USAMO 2026 (math olympiad)	79.5%	55.0%	+24.5
Computer use (OSWorld-Verified)	81.2%	78.5%	+2.7

Source: All benchmark scores from Anthropic's Claude Sonnet 5 System Card, Table 8.1.A and relevant subsection details. All Sonnet 5 results use adaptive thinking at max effort, default sampling, averaged over 5 trials. Sonnet 4.6 scores from the same evaluation harness. CursorBench scores independently measured by Cursor in their production harness. SWE-bench Verified Sonnet 4.6 score (79.6%) from Anthropic's Sonnet 4.6 launch announcement.

Sonnet 5 vs Sonnet 4.6 benchmark comparison bar chart — Claude Sonnet 5 (indigo) leads Sonnet 4.6 (light purple) on every benchmark. The widest gaps are on Terminal-Bench 2.1 (+13.4), FrontierCode (+23.7), and CursorBench (+12.2). The narrowest is OSWorld-Verified (+2.7) — computer use capabilities were already strong in 4.6. Every single number improved. There are no regressions.

Sonnet 5 vs Sonnet 4.6 radar chart — Sonnet 5 (solid indigo) encloses Sonnet 4.6 (dashed light purple) on every axis. The radar makes the structural improvement visible — FrontierCode and HLE (with tools) show the largest area expansion. Sonnet 5 is a strict superset of Sonnet 4.6 on every measured dimension.

Sonnet 5 gains over Sonnet 4.6 horizontal bar chart — Gains sorted by magnitude. Green = double-digit gains (structural improvements). Amber = solid gains (5–10 pts). Red = modest gains (<5 pts). Terminal-Bench (+13.4), HealthBench (+13.6), and FrontierCode (+23.7) are the headline numbers. OSWorld (+2.7) is the smallest gain — computer use was already strong in 4.6.

The +13.4 Terminal-Bench Leap: Agentic Coding Transformed

The single biggest percentage-point gain on any benchmark. Sonnet 4.6 at 67.0% on Terminal-Bench 2.1 was respectable — competitive with GPT-5.4 (75.1%). Sonnet 5 at 80.4% is a different tier entirely. It now sits within 2.3 points of Opus 4.8 (82.7%) and ahead of Opus 4.7 (69.7%). Terminal-Bench measures real CLI agentic coding — package management, git operations, build system debugging, server configuration. This is the benchmark that most directly reflects what developers experience in Claude Code. Anthropic's System Card confirms the evaluation uses the same mini-SWE-agent harness for both models, making this a clean apples-to-apples comparison.

The practical implication: tasks that Sonnet 4.6 would fail on 33% of the time, Sonnet 5 now handles successfully 80% of the time. For developers using Claude Code as a daily driver, this is the difference between constant hand-holding and genuine autonomy.

FrontierCode: More Than Doubled

FrontierCode v1 tests cutting-edge coding challenges that require novel algorithmic thinking — the kind of problems where models can't pattern-match from training data. Sonnet 4.6 at 15.1% was barely functional. Sonnet 5 at 38.8% is +23.7 points — more than double. This isn't just incremental tuning. It's a structural improvement in how the model handles unfamiliar coding problems. Anthropic's framing: "Sonnet 5 is built to be the most agentic Sonnet model yet." FrontierCode is where that claim is most visible.

HLE with Tools: Now Matches Opus 4.8

Humanity's Last Exam with tools is the hardest reasoning benchmark with tool access. Sonnet 4.6 scored 46.8%. Sonnet 5 scores 57.4% — a +10.6 point jump that puts it within 0.5 points of Opus 4.8 (57.9%). That's functionally identical. For developers building agentic research workflows — where the model uses browsers, code execution, and file manipulation to solve complex problems — Sonnet 5 delivers Opus-level reasoning at 60% of the price.

Without tools, the gap to Opus remains wider (43.2% vs 49.8%), but Sonnet 5 still gained +8.6 points over Sonnet 4.6 (34.6%). Raw reasoning improved substantially — tools just make it competitive with the flagship.

Knowledge Work: Sonnet 5 Beats Opus 4.8

GDPval-AA v2 is an independent benchmark from Artificial Analysis that measures real-world professional task completion across 220 tasks and 44 occupations. Sonnet 4.6 scored 1395 Elo — solidly mid-pack. Sonnet 5 scores 1618 Elo — a +223 point gain that places it ahead of Opus 4.8 (1615). This is the first time a Sonnet-tier model has outscored the concurrent Opus flagship on any benchmark. The System Card notes: "Claude Sonnet 5 ranks second (ELO 1618), statistically tied with Opus 4.8 (ELO 1615) and trailing only Fable 5 (ELO 1783)."

For everyday professional use — document analysis, slide creation, spreadsheet work, research synthesis — Sonnet 5 delivers better-than-Opus quality at 60% of the cost. There is no tradeoff.

SWE-bench: Solid Gains on Both Pro and Verified

On SWE-bench Pro (the harder, contamination-resistant variant): +5.1 points (58.1% → 63.2%). On SWE-bench Verified (the classic 500-problem set): +5.6 points (79.6% → 85.2%). These are meaningful but not structural gains — SWE-bench is becoming a saturated benchmark at the frontier. The more interesting story is in the independent evaluation: CursorBench, measured by Cursor in their production harness, shows Sonnet 5 at 61.2% vs Sonnet 4.6 at 49.0% — a +12.2 point gain. Different tasks, different harness, but a consistent signal: Sonnet 5 is materially better at real-world coding.

Math: USAMO Jumps +24.5 Points

The USA Mathematical Olympiad 2026 — a proof-based competition held March 21–22, after Sonnet 5's training data cutoff — shows the most dramatic gain of any benchmark: Sonnet 4.6 at 55.0% vs Sonnet 5 at 79.5% (+24.5). This is a completely uncontaminated evaluation (the 2026 USAMO took place after pretraining data collection ended). The gain suggests genuine improvements in mathematical reasoning, not memorization. For reference, Opus 4.8 scored 96.7% and Mythos 5 scored 99.8% — Sonnet 5 still trails the frontier, but the gap has narrowed dramatically.

Specification Comparison

Feature	Claude Sonnet 5	Claude Sonnet 4.6
Released	June 30, 2026	February 17, 2026
API ID	`claude-sonnet-5`	`claude-sonnet-4-6`
Context Window	1,000,000 tokens	1,000,000 tokens
Max Output	128K (300K batch)	64K
Thinking	Adaptive (effort: high default)	Adaptive (effort levels)
Extended Thinking	No	Yes
Knowledge Cutoff	Jan 2026	Jan 2026 (reliable: Aug 2025)
Comparative Latency	Fast	Fast
Pricing (API)	$3 / $15 per MTok*	$3 / $15 per MTok
Tokenizer	New (Opus 4.7+ tokenizer)	Old (Sonnet 4.6 tokenizer)
Max Output (batch beta)	300K	Not supported

* Introductory pricing of $2/$10 per MTok through August 31, 2026. Sources: Claude Platform Docs, Sonnet 5 System Card.

The Tokenizer Caveat: Same List Price, Higher Token Count

The headline prices are identical ($3/$15), but Sonnet 5 uses the updated tokenizer Anthropic introduced with Opus 4.7. The same text produces 1.0× to 1.35× more tokens than Sonnet 4.6, depending on content. Simon Willison's analysis: English ~1.33–1.42×, Python code ~1.27–1.28×, Spanish ~1.33×, Simplified Chinese ~1.01×.

In practice: if your workload averaged $1,000/month on Sonnet 4.6, expect ~$1,270–$1,420/month on Sonnet 5 for English-heavy tasks — a real cost increase of 27–42%. On the introductory $2/$10 pricing, even with inflation you're paying effective rates of ~$2.60/$13.00 — still well below the $3/$15 list. Anthropic's announcement: "We've increased rate limits across all surfaces to accommodate the higher token usage."

Safety & Alignment: Better Across the Board

The System Card's alignment assessment (Section 6) reports that Sonnet 5 improves over Sonnet 4.6 on most measures:

Metric	Sonnet 4.6	Sonnet 5	Change
MASK (sycophantic lying rate)	13.3%	3.1%	−10.2
AA-Omniscience (factuality net score)	0.14	0.20	+0.06
AA-Omniscience (incorrect rate)	35.0%	26.5%	−8.5
Malicious request refusal (Claude Code)	76.6%	92.4%	+15.8
Prompt injection ASR (coding, with thinking)	12.71%	0.31%	−12.4

Source: Sonnet 5 System Card, Sections 5, 6. MASK = Model Alignment between Statements and Knowledge. ASR = Attack Success Rate (lower is better).

Sycophancy dropped from 13.3% to 3.1% — the lowest of any tested Claude model. Factual hallucination fell from 35.0% to 26.5%. Malicious request refusal in Claude Code jumped from 76.6% to 92.4%. And prompt injection robustness in coding environments improved dramatically — from 12.71% attack success rate to 0.31%. Mashable's coverage: "Anthropic reports Sonnet 5 shows lower rates of hallucination, sycophancy, and other undesirable behaviors."

One regression: Sonnet 5 has a slightly higher "wet blanket" rate — dismissive or discouraging responses. And its over-refusal rate on benign prompts is marginally higher (0.59% vs 0.40% on API). But overall, the safety picture is overwhelmingly positive.

Should You Upgrade from Sonnet 4.6?

If you...	Decision
Use Claude Code daily for agentic coding	✅ Upgrade immediately. +13.4 TB 2.1 is transformational.
Run knowledge work / document analysis	✅ Upgrade. +223 GDPval. Beats Opus 4.8.
Need max output tokens (64K → 128K)	✅ Upgrade. Double the output ceiling.
Care about sycophancy / hallucination	✅ Upgrade. 3.1% lying rate (lowest ever).
Build prompt injection-resistant agents	✅ Upgrade. 0.31% ASR vs 12.71%.
Are cost-sensitive at high volume	⚖️ Measure. Tokenizer inflation = 1.0–1.35× more tokens.
Use extended thinking explicitly	⚖️ Note: Sonnet 5 drops extended thinking. Adaptive only.
Need proven production stability	⚖️ Stay. Sonnet 4.6 has 4+ months of hardening.

Conclusion: The No-Brainer Upgrade (With One Asterisk)

Claude Sonnet 5 is the most decisive generation-over-generation improvement in Sonnet history. Every benchmark went up — several by double-digit margins. The gains are concentrated where they matter most: agentic coding (+13.4 TB), tool-augmented reasoning (+10.6 HLE), knowledge work (+223 GDPval), and safety (sycophancy −10.2, injection resistance −12.4). At the same list price, this is as close to a no-brainer upgrade as enterprise AI gets.

The one asterisk is the tokenizer. If you're running high-volume English workloads, the effective cost increase is 27–42% due to token inflation — not the price-per-token, but the tokens-per-request. On the introductory $2/$10 pricing through August 31, the real cost still lands below the $3/$15 list. After that, measure your specific workload before committing at scale.

Anthropic's official line: "Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models." The data backs it up. Sonnet 4.6 was already the workhorse of the Claude lineup. Sonnet 5 makes it nearly obsolete — at the same price.

🔬 Side-by-Side Test

Run Claude Sonnet 5 and Sonnet 4.6 on your own code. See the +13.4 Terminal-Bench improvement in practice. Sandboxes stay alive even when you close your laptop.

🔄 Compare Side by Side →

Sources & Links

Anthropic — Claude Sonnet 5 System Card (PDF) — Table 8.1.A, all benchmark scores, safety evaluations
Anthropic — Introducing Claude Sonnet 5 — official launch announcement, pricing, tokenizer notes
Claude Platform Docs — Models Overview — spec comparison table
Claude Platform Docs — What's New in Claude Sonnet 5
Simon Willison — What's New in Claude Sonnet 5 — tokenizer analysis
Mashable — Anthropic Finally Launches Claude Sonnet 5
Coursiv — Claude Sonnet 5: Release Date, Pricing, API & Benchmarks
Handy AI — Model Drop: Claude Sonnet 5
Cursor — CursorBench — independent production harness evaluation
Morphllm — Claude Benchmarks (2026) — third-party benchmark tracker