July 1, 2026. Anthropic drops Claude Sonnet 5 — and across the board, it beats GPT-5.5. Not just on one benchmark. On every single directly comparable benchmark. SWE-bench Pro: 63.2% vs 58.6%. Terminal-Bench 2.1: 80.4% vs 78.2%. HLE with tools: 57.4% vs 52.2%. And it does it at 40% cheaper input and 50% cheaper output. Anthropic's mid-tier model just leapfrogged OpenAI's flagship. Here's the complete comparison, sourced from Anthropic's Sonnet 5 System Card and OpenAI's GPT-5.5 announcement.

TL;DR — Sonnet 5 vs GPT-5.5

  • Sonnet 5 leads on all 6 comparable benchmarks: +4.6 Pro, +2.2 Terminal-Bench, +1.8 HLE, +5.2 HLE tools, +2.5 OSWorld, +3 GDPval. No exceptions.
  • 40-50% cheaper: $3/$15 vs $5/$30 per MTok. Introductory $2/$10 through Aug 31 makes it 2.5x cheaper on input, 3x cheaper on output.
  • GPT-5.5 leads ARC-AGI-2 by ~0.3 pts: 85.0% vs ~84.7%. The only benchmark where GPT-5.5 holds an edge — and it's within noise range.
  • GPT-5.5 dominates on ecosystem: Codex CLI (4M weekly devs), browser verification, 2.5 months of production hardening. Sonnet 5 is days old.
  • Sonnet 5 uses a new tokenizer that produces 1.3-1.4x more tokens — effective cost is ~$3.90/$19.50 for English workloads. Still cheaper than GPT-5.5.
  • Same 1M context: Identical context window. GPT-5.5 edges to 1.05M. Sonnet 5 has faster latency.

Full Benchmark Comparison

The table below shows all directly comparable benchmarks — where both models were evaluated on the same benchmark with the same metric. All Sonnet 5 scores are from Anthropic's System Card (Table 8.1.A). GPT-5.5 scores are from OpenAI's official announcement and independent verification on tbench.ai.

BenchmarkClaude Sonnet 5GPT-5.5DeltaWinner
Agentic coding (SWE-bench Pro)63.2%58.6%+4.6Sonnet 5
Agentic coding (Terminal-Bench 2.1)80.4%78.2%+2.2Sonnet 5
Reasoning (HLE, no tools)43.2%41.4%+1.8Sonnet 5
Reasoning (HLE, with tools)57.4%52.2%+5.2Sonnet 5
Computer use (OSWorld-Verified)81.2%78.7%+2.5Sonnet 5
Knowledge work (GDPval-AA v2)*161884.9%Different scales
Science (GPQA Diamond)Not published93.6%Only GPT-5.5 data
Abstract reasoning (ARC-AGI-2)~84.7%85.0%GPT +0.3GPT-5.5 (tie)
Tool orchestration (MCP Atlas)Not published75.3%Only GPT-5.5 data

* GDPval uses different scales: Anthropic reports an Elo score (1618 for Sonnet 5), OpenAI reports a win-rate percentage (84.9% for GPT-5.5). Direct comparison is not meaningful. Sources: Anthropic Sonnet 5 System Card, OpenAI GPT-5.5 announcement, Morphllm benchmarks, BuildFastWithAI GPT-5.5 review, BenchLM GPQA leaderboard.

Claude Sonnet 5 vs GPT-5.5 benchmark comparison bar chart

Coding: Sonnet 5's Uncontested Lead

SWE-bench Pro is the benchmark that matters most for professional software engineering. It tests real GitHub issue resolution across open-source repositories — the kind of work developers actually do. Sonnet 5 scores 63.2% vs GPT-5.5's 58.6%. That's a 4.6-point gap — not enormous, but consistent and meaningful.

What makes this remarkable is the pricing context. Sonnet 5 achieves this 4.6-point lead at 40% cheaper input cost and 50% cheaper output cost. For teams running AI-assisted code review, bug fixing, or autonomous PR resolution, that's a double win: better accuracy at lower cost.

On Terminal-Bench 2.1 — terminal-based agentic coding — Sonnet 5 leads 80.4% vs 78.2%. A narrower 2.2-point margin, but again: cheaper. For DevOps workflows involving package management, git operations, build systems, and Docker commands, Sonnet 5 is both more capable and more cost-effective.

The DeepSWE Wildcard

One important caveat: on DeepSWE — a third-party benchmark testing harder, longer-horizon coding tasks — the available data compares Opus 4.8, not Sonnet 5. GPT-5.5 scores 70% on DeepSWE vs Opus 4.8's 58%, suggesting GPT-5.5 may have an edge on the most complex, multi-hour coding tasks. Sonnet 5 hasn't been independently tested on DeepSWE yet.

Reasoning: The HLE Gap

Humanity's Last Exam is the benchmark designed to be the hardest test ever created — 3,000 expert-level questions across math, science, and humanities that models can't simply memorize. Two variants exist: without tools (raw reasoning) and with tools (browsers, code execution, shell access).

Without tools, Sonnet 5 leads 43.2% vs 41.4% — a slim 1.8-point margin. With tools, the gap widens to 5.2 points (57.4% vs 52.2%). This pattern mirrors what we saw with Opus 4.8: Anthropic models are particularly strong at tool-augmented reasoning.

HLE with Tools — The Agent-Relevant Metric

For developers building agentic workflows, HLE with tools is the more realistic measure. Sonnet 5's 5.2-point lead here — from 52.2% to 57.4% — is the single largest gap on any comparable benchmark.

Computer Use: Both Beat Humans, Sonnet Leads

OSWorld-Verified measures desktop automation. The human expert baseline is 72.4%. Both models clear it: Sonnet 5 at 81.2% (8.8 pts above humans), GPT-5.5 at 78.7% (6.3 pts above humans). For computer-use agents, Sonnet 5 delivers better automation at lower cost.

Claude Sonnet 5 vs GPT-5.5 capability radar chart

Where GPT-5.5 Fights Back

1. Abstract Reasoning (ARC-AGI-2)

GPT-5.5 scores 85.0%. Sonnet 5's score (~84.7%) is within the margin of error. A tie — but the only benchmark where GPT-5.5 isn't behind.

2. Ecosystem and Production Hardening

GPT-5.5 launched 68 days before Sonnet 5. It powers Codex CLI (4M weekly devs) with browser verification, persisted goals, and self-repair loops. Sonnet 5 is brand new and unproven in production.

3. Long-Context Retrieval

GPT-5.5 scores 74.0% on MRCR v2 at 512K-1M token contexts. Anthropic hasn't published comparable long-context benchmarks for Sonnet 5.

Specification Comparison

FeatureClaude Sonnet 5GPT-5.5
ReleasedJune 30, 2026April 23, 2026
API IDclaude-sonnet-5gpt-5.5
Context Window1,000,000 tokens1,050,000 tokens
Max Output128K (300K batch)128K
Thinking ModeAdaptive (effort: high default)xHigh reasoning effort
Knowledge CutoffJan 2026Dec 2025
Multimodal InputText + ImageText + Image + Audio
Comparative LatencyFastModerate
Pricing (Input)$3 / MTok*$5 / MTok
Pricing (Output)$15 / MTok*$30 / MTok
Prompt Caching90% discount90% discount ($0.50/MTok)
Batch Processing50% discount50% discount ($15/MTok output)

* Sonnet 5 introductory pricing of $2/$10 per MTok through August 31, 2026. Sources: Claude Platform Docs, Anthropic blog, OpenAI GPT-5.5 announcement.

The Tokenizer Caveat: Real Costs Matter

Sonnet 5 uses the updated tokenizer that Anthropic introduced with Opus 4.7. The same text produces roughly 1.3-1.4x more tokens compared to Sonnet 4.6 and GPT-5.5's tokenizer. Simon Willison's analysis: English text ~1.33-1.42x more tokens, Python code ~1.27-1.28x, Spanish ~1.33x, Simplified Chinese ~1.01x (essentially unchanged).

At effective rates of ~$3.90/$19.50 for English workloads, Sonnet 5 is still meaningfully cheaper than GPT-5.5 at $5/$30. With the introductory $2/$10 pricing, even with token inflation you're paying ~$2.60/$13.00 — roughly half the cost of GPT-5.5.

Should You Use Sonnet 5 or GPT-5.5?

If you...Decision
Need the best SWE-bench Pro coding performanceSonnet 5. +4.6 pts. 40-50% cheaper.
Run terminal-based agentic coding (DevOps, CLI)Sonnet 5. +2.2 pts on TB 2.1. Cheaper.
Build agents that use tools (browsers, terminal, code)Sonnet 5. +5.2 pts on HLE with tools.
Do computer-use / desktop automationSonnet 5. +2.5 pts on OSWorld.
Are cost-sensitive but need frontier qualitySonnet 5. $2/$10 intro through Aug 31.
Need abstract visual reasoning (ARC-AGI-2 tasks)GPT-5.5. Essentially tied, but proven.
Need proven production stabilityGPT-5.5. 68 days of hardening vs 1 day.
Use the Codex CLI ecosystem heavilyGPT-5.5. 4M weekly devs. Browser verification.
Process very long contexts (512K-1M tokens)GPT-5.5. 74% MRCR v2. Sonnet 5 unproven.
Need audio input for your workflowsGPT-5.5. Sonnet 5 is text+image only.

The Ecosystem Reality Check

Benchmarks tell one story. Production tells another. GPT-5.5 has been in the wild for 68 days. It powers Codex CLI, which serves 4 million weekly developers. It's integrated into ChatGPT, Azure, and the OpenAI API with battle-tested reliability. The Codex ecosystem includes browser verification, persisted goals, and self-repair loops.

Sonnet 5 launched yesterday. The benchmarks are undeniable — it's the better model on paper. But Anthropic's developer ecosystem for Sonnet-tier models lags behind OpenAI's. Claude Code is excellent, but it doesn't have the scale, integrations, or production track record of Codex CLI.

Conclusion: The Best Model at the Best Price

Claude Sonnet 5 is the best mid-tier model ever released. It beats OpenAI's current flagship on every directly comparable benchmark. It does it at 40-50% lower cost. The only areas where GPT-5.5 holds ground are abstract reasoning (a tie), long-context retrieval (uncontested), and ecosystem maturity (temporary).

  • Starting a new project today? Use Sonnet 5. Better benchmarks, lower price. Introductory $2/$10 through Aug 31 is the best deal in frontier AI.
  • Deep in the OpenAI ecosystem? Wait a week for early adopter reports, then switch. The 4.6-point Pro lead at half the cost is too large to ignore.
  • On GPT-5.4 or earlier? Upgrade to Sonnet 5 immediately. Better than GPT-5.5 at lower cost than GPT-5.4.

Anthropic's framing was that Sonnet 5 is "close to Opus 4.8 at lower prices." The data shows it's more than that: it's better than GPT-5.5 at lower prices. The mid-tier just became the new flagship.

Side-by-Side Test

Run Claude Sonnet 5 and GPT-5.5 on your own code. See the benchmark advantage translate to your stack.

Compare Side by Side

Sources and Links

Read This Next