In March 2024, Claude 3 Opus made headlines by scoring 60.9% on SWE-bench Verified — the first model to break 60%. It cost $75 per million output tokens. Fourteen months later, Claude Opus 4.8 scores 88.6% — a 27.7-point leap. But the price story goes two directions: OpenAI's GPT-5.5 doubled to $30 while DeepSeek's open-weight V4 Pro collapsed to $0.87. The gap between the most expensive and cheapest frontier-tier coding model is now 34× — and widening in both directions. This is the story of the most intense period of progress in AI history, told through the benchmarks that actually matter.

📊 Key Milestones

  • Mar 2024: Claude 3 Opus breaks 60% SWE-bench at $75/1M. The "this might actually work" moment.
  • Aug 2024: Claude 3.5 Sonnet hits 69.1%. SWE-bench Verified introduced. Prices drop to $15/1M.
  • Jan 2025: DeepSeek V3 launches at $1.10/1M output — the first major price disruption. Open-weight era begins.
  • Oct 2025: Claude 4 Sonnet reaches 77.2%. SWE-bench Pro launches. Models plateau as agent harnesses become the bottleneck.
  • Feb 2026: Claude Opus 4.6 breaks 80% for the first time. OpenAI stops reporting Verified scores due to contamination.
  • Apr 2026: GPT-5.5 doubles price to $30/1M. Claude Opus 4.8 debuts at $25/1M — stable since Opus 4.5. DeepSeek, Kimi, GLM launch as open-weight challengers.
  • May 2026: DeepSeek makes 75% discount permanent: $0.87/1M. The price spread hits 34×. Claude Opus 4.8 reaches 88.6% Verified, 69.2% Pro.

The SWE-bench Verified Ascent: Only the Records Matter

SWE-bench Verified — 500 real GitHub issues from Python repositories — has been the industry's north star since August 2024. Here are the peak scores at each point in time — the records that defined each era, not the fluctuations in between:

SWE-bench Verified ascent: record-breaking scores March 2024 to May 2026
DateRecord-Setting ModelSWE-bench VerifiedSignificance
Mar 2024Claude 3 Opus60.9%First model to break 60%. The threshold of "usefulness."
May 2024GPT-4o67.3%OpenAI enters the race; multimodal. First non-Claude record.
Aug 2024Claude 3.5 Sonnet69.1%SWE-bench Verified introduced. Sonnet challenges Opus.
Mar 2025Claude 3.7 Sonnet70.3%Slow grind through the 70% barrier.
Jun 2025Claude 4 Sonnet~72.0%2025 Acceleration Era.
Oct 2025Claude 4 Sonnet (refined)77.2%8.1-point jump. Then an 8-month plateau as agent harnesses became the bottleneck.
Feb 2026Claude Opus 4.680.8%2026 Breakout. First model to break 80%. 1M context arrives.
Apr 2026Claude Opus 4.787.6%GPT-5.5 launches. 6.8-point leap — the biggest single jump.
May 2026Claude Opus 4.888.6%Current record. 27.7 points higher than 14 months ago.

Sources: OpenLM SWE-bench tracker; Vals.ai; vendor system cards. ⚠️ Verified is now considered contaminated by OpenAI (Feb 2026). Claude has set every SWE-bench Verified record — 7 of 9 milestones.

Three eras, one constant: Claude has held the SWE-bench Verified record at every single milestone shown — 7 of 9 peak scores. The only interruption was GPT-4o briefly taking the lead in May 2024 for three months. The 8-month plateau between October 2025 and February 2026 wasn't a capability ceiling — it was an agent harness bottleneck. The models could reason better, but the scaffolding hadn't caught up. When it did (OpenHands, Claude Code), scores jumped 10 points in months.

The Three-Lane Race: Pro, Terminal, and Verified

By late 2025, relying on a single benchmark was no longer sufficient. SWE-bench Pro launched as the contamination-resistant successor. Terminal-Bench 2.0 emerged as the standard for agentic CLI coding:

Three-lane race: SWE-bench Pro vs Terminal-Bench vs SWE-bench Verified

The key insight: SWE-bench Verified (dashed line) is approaching saturation. SWE-bench Pro (solid green) remains the best differentiator — 69.2% best vs ~25% average, a 44-point spread. Terminal-Bench (blue) is where GPT-5.5 dominates at 78.2% vs Claude Opus 4.7 at ~69%. Different benchmarks measure fundamentally different capabilities.

Two Price Stories: OpenAI Goes Up, Open-Weight Goes Down

The AI pricing narrative is often told as "prices are collapsing." That's only half true. OpenAI has been raising prices, not lowering them. The real collapse is happening in open-weight models from Chinese labs — while Anthropic has remained stable:

Two diverging price paths: OpenAI rising, open-weight collapsing
DateHighest-Priced Model$/1M OutputLowest-Priced Model$/1M OutputPrice Spread
Mar 2024Claude 3 Opus$75.00— (no open-weight)
Aug 2024GPT-4o / Claude 3.5 Sonnet$15.00— (no open-weight)
Jan 2025GPT-4o$15.00DeepSeek V3$1.1014×
Aug 2025Claude Opus 4.1$25.00DeepSeek V3.2$0.4260×
Feb 2026Claude Opus 4.6$25.00MiniMax M2.5$1.2021×
Apr 2026GPT-5.5$30.00DeepSeek V4 Pro*$3.48
May 2026GPT-5.5$30.00DeepSeek V4 Pro$0.8734×

* DeepSeek V4 Pro launched at $3.48 in April; cut to $0.87 permanently in May. Claude Opus stable at $25 since August 2025. Sources: DeepSeek pricing; TokenMix comparison; AI Pricing Guru.

Three completely different pricing strategies:

  • OpenAI is scaling up — and charging for it. GPT-5 briefly dropped to $10 in August 2025. Then GPT-5.5 doubled to $30 — a deliberate bet that frontier capability commands premium pricing. More reasoning = more compute = higher cost passed to users.
  • Anthropic is stable. Claude Opus has held at $25/1M output since Opus 4.1 (August 2025). Sonnet at $15. Only Haiku has seen price movement. Anthropic's strategy isn't price competition — it's capability leadership and ecosystem lock-in via Claude Code.
  • Open-weight is in freefall. DeepSeek V3: $1.10 → V3.2: $0.42 → V4 Pro: $0.87 permanent. Architectural efficiency gains (CSA+HCA, MoE sparsity) passed directly to users. The direction is relentlessly down.

The spread is now 34× and widening. GPT-5.5 at $30 vs DeepSeek V4 Pro at $0.87. Both are frontier-tier coding models. The open-weight disruptors aren't just competing on price — they're creating an entirely separate pricing universe.

Claude's SWE-bench Dominance: 7 of 9 Records

Every SWE-bench Verified milestone in this timeline belongs to Anthropic. Several theories why:

  • Training data alignment: SWE-bench is 100% Python (Django is 46% of tasks). Claude's training skews toward software engineering — exactly the text that teaches Django debugging.
  • Agent ecosystem: Claude Code and MCP give Anthropic models a scaffolding advantage over raw API calls.
  • Reasoning style: Claude's chain-of-thought is more methodical than GPT's. For multi-file bug fixes requiring dependency tracking, this matters.
  • OpenAI's withdrawal: Since February 2026, OpenAI hasn't published Verified scores, reporting only Pro. The comparison is incomplete.

The Open-Weight Revolution

In March 2024, there were exactly zero competitive open-weight coding models. By May 2026:

DateModelSWE-bench VerifiedLicenseSignificance
Jan 2025DeepSeek V3~39%MITFirst open-weight model on SWE-bench. 14× cheaper.
Dec 2025DeepSeek V3.270.0%MITFirst open-weight to hit 70%. 60× cheaper than Opus.
Feb 2026MiniMax M2.580.2%Apache 2.0Open-weight breaks 80%. Gap to proprietary: 0.6 points.
Apr 2026Kimi K2.680.2%Modified MIT#1 open-weight on AA Intelligence Index (54).
Apr 2026DeepSeek V4 Pro80.6%MIT1M context, 93.5% LiveCodeBench, permanently $0.87/1M.

The open-weight gap to proprietary collapsed from 30+ points in late 2024 to just 7.4 points by early 2026 (CodeSota). Meanwhile, the price advantage exploded from 14× to 34×.

The Contamination Crisis (February 2026)

No timeline is complete without the moment the industry's most trusted benchmark broke. In February 2026, OpenAI stopped reporting SWE-bench Verified scores after finding training data contamination across all frontier models. Claude Opus 4.5 dropped from 80.9% (Verified) to 45.9% (Pro). GPT-5 fell from ~55% to 23.3%. The 35-point gap wasn't a capability cliff — it was the difference between memorization and genuine problem-solving. As CodeAnt put it: "Your codebase is not Django. Your bugs are not in scikit-learn's issue tracker."

What Comes Next

  1. SWE-bench Verified will saturate. At 88.6% with a narrowing gap, it loses discriminative power within months. Pro and LiveCodeBench become the standards.
  2. The open-weight gap will close to zero. At current trajectory, an open-weight model matches proprietary on SWE-bench Pro by Q3 2026.
  3. Price will diverge further. OpenAI/Anthropic charge more for reasoning depth. Open-weight gets cheaper through architecture. The spread could reach 100×.
  4. Release velocity keeps accelerating. 10 days between launches may become weekly. Selection will be about workflow fit, not raw capability.

The 14 Months That Changed Everything

In March 2024, AI coding was an experiment. One model broke 60%. No open-weight alternatives existed. Prices were prohibitive.

In May 2026, AI coding is infrastructure. The best model fixes 9 out of 10 real GitHub issues. Open-weight models deliver competitive coding at $0.87/1M — while flagship prices have doubled to $30. A heavy user builds a tiered stack for $200/month. A new model launches every 10 days.

The story isn't one trend — it's three: capabilities rising relentlessly (60.9% → 88.6%), proprietary prices going up (OpenAI doubled, Anthropic stable), and open-weight prices collapsing (DeepSeek at $0.87). The AI coding revolution didn't happen gradually. It happened in 14 months — and it's still accelerating in every direction at once.


Sources: OpenLM.ai — SWE-bench Tracker | Vals.ai — SWE-bench Verified | Local AI Master — SWE-bench History | CodeAnt — SWE-bench Analysis | CodeSota — Open vs Closed Gap | DeepSeek — Permanent 75% Discount | TokenMix — Pricing History | AI Pricing Guru | OpenAI GPT-5.5 | Anthropic system cards. ⚠️ Verified considered contaminated by OpenAI (Feb 2026).