Two models launched 41 days apart. Claude Sonnet 5 — Anthropic's new mid-tier king, June 30, 63.2% Pro, $3/$15. Qwen 3.7 Max — Alibaba's "Agent Frontier," May 19, 60.6% Pro, $1.25/$3.75. On shared benchmarks, Sonnet leads by slim margins (+2.6 Pro, +4.8 Verified, +1.8 HLE). But Qwen fights back with world-class math (GPQA 92.4%, HMMT 97.1%), proven 35-hour autonomous operation, and a price tag that's 2.7× cheaper on output. One model writes your code. The other runs marathons. Here's the complete comparison, sourced from Anthropic's Sonnet 5 System Card and Qwen's official blog. Test both on CodingFleet.

TL;DR — Sonnet 5 vs Qwen 3.7 Max

  • Sonnet leads shared coding benchmarks: +2.6 Pro, +4.8 Verified, +1.8 HLE. Slim but consistent.
  • Qwen dominates math: 92.4% GPQA, 97.1% HMMT, 98.3% AIME. Math reasoning is not close.
  • Qwen is 2.7× cheaper on output: $3.75 vs $15 standard. 2.4× cheaper on input ($1.25 vs $3).
  • Qwen: the Agent Frontier: 35-hour autonomous kernel optimization. 96% Kernel Bench win rate. Unmatched agentic stamina.
  • Sonnet: the Anthropic ecosystem: OSWorld 81.2%, BrowseComp 84.7%, 145-page System Card, Claude Code native.
  • Terminal-Bench versions differ: Sonnet 80.4% on TB 2.1 vs Qwen 69.7% on TB 2.0. Not directly comparable.

Head-to-Head: Shared Benchmarks

BenchmarkClaude Sonnet 5Qwen 3.7 MaxWinner
SWE-bench Pro63.2%60.6%Sonnet (+2.6)
SWE-bench Verified85.2%80.4%Sonnet (+4.8)
HLE (no tools)43.2%41.4%Sonnet (+1.8)

Sonnet 5 from Anthropic System Card. Qwen 3.7 Max from Qwen Official Blog and DataCamp analysis. Both vendor-reported. Cross-vendor comparisons are directional — different scaffolds introduce uncertainty. Terminal-Bench excluded: Sonnet uses TB 2.1, Qwen uses TB 2.0 (different task sets, not comparable).

Sonnet 5 vs Qwen 3.7 Max shared benchmarks bar chart
Claude Sonnet 5 (indigo) leads Qwen 3.7 Max (red) on all three shared benchmarks. The margins are consistent but modest — +2.6 on Pro, +4.8 on Verified, +1.8 on HLE. Sonnet is the better coder, but not by a landslide. Qwen holds its own surprisingly well for a model that's 2.7× cheaper on output.

Coding: Sonnet's Modest Edge

On SWE-bench Pro — the benchmark that matters most for production coding — Sonnet 5 leads 63.2% to 60.6%. That's a 2.6-point gap. On SWE-bench Verified (the classic 500-problem set): Sonnet at 85.2% vs Qwen at 80.4% (+4.8). On HLE without tools (raw reasoning): Sonnet at 43.2% vs Qwen at 41.4% (+1.8).

These are real leads, but they're not decisive. A 2.6-point Pro gap on 1,865 tasks is ~48 tasks. For most development workflows, both models will handle the vast majority of coding tasks competently. Sonnet gets the edge on the hardest bugs; Qwen gets the edge on affordability.

The Terminal-Bench comparison is unfortunately apples-to-oranges. Sonnet 5 scores 80.4% on Terminal-Bench 2.1 (the current standard, same harness used for Opus 4.8 and GPT-5.5). Qwen 3.7 Max scores 69.7% on Terminal-Bench 2.0 (an older, different task set). The version gap makes direct comparison unreliable. DataCamp's analysis: "This benchmark tests autonomous terminal-based software engineering with a 5-hour timeout and 12 CPU cores. On SWE-Pro, it scores 60.6, the highest in the comparison table."

Math: Qwen's Dominance

This is where the comparison flips. Qwen 3.7 Max is a math powerhouse:

Math BenchmarkClaude Sonnet 5Qwen 3.7 Max
GPQA Diamond (PhD science)— (not published)92.4%
HMMT Feb 2026 (competition math)— (not published)97.1%
AIME 2026— (not published)98.3%
IMOAnswerBench— (not published)90.0%
LiveCodeBench (competitive coding)— (not published)91.6%
USAMO 2026 (proof-based)79.5%— (not published)

GPQA Diamond at 92.4% is elite — competitive with Opus 4.8 and GPT-5.5. HMMT at 97.1% and AIME at 98.3% are ceiling scores. For math-heavy coding (scientific computing, algorithmic work, formal verification), Qwen 3.7 Max has a clear advantage. Sonnet 5's USAMO score (79.5%) is impressive in its own right, but Anthropic didn't publish GPQA Diamond or competition math benchmarks for this model — they focused the System Card on agentic coding and safety instead.

Agentic Autonomy: Qwen's 35-Hour Marathon

This is Qwen's signature capability — and it's something no other model in this class can claim. Yotta Labs' coverage of Qwen's launch:

"A 35-hour autonomous kernel optimization run on T-Head ZW-M890 PPUs. Across 432 kernel evaluations and 1,158 tool calls, the model finished with a 10x geometric mean speedup. Most agent models stop making progress after a few hours. Qwen 3.7-Max sustained meaningful progress past 30 hours."

And during an 86-hour RL training session, the model autonomously flagged 1,618 reward hacking cases and added 13 new heuristic rules to its own training loop. Kernel Bench L3: 96% win rate. This is the "Agent Frontier" label in action — Qwen 3.7 Max is built for unattended, long-horizon autonomous operation.

Sonnet 5's agentic capabilities are strong (80.4% TB 2.1, 81.2% OSWorld), but Anthropic hasn't demonstrated the kind of multi-hour autonomous endurance that Qwen has proven. If your workflow involves overnight agent runs or continuous autonomous optimization, Qwen's track record matters.

Pricing: Qwen is 2.7× Cheaper on Output

Sonnet 5 vs Qwen 3.7 Max pricing comparison
Qwen 3.7 Max at $1.25/$3.75 (50% promotional pricing through June 22, 2026) is 2.4× cheaper on input and 2.7× cheaper on output vs Sonnet 5 standard pricing ($3/$15). Even vs Sonnet's introductory $2/$10, Qwen is still 1.6× cheaper on input and 2.7× cheaper on output.

For a workload of 10M input + 1M output tokens per day:

ModelDaily CostMonthly Cost
Qwen 3.7 Max$16.25~$488
Sonnet 5 (introductory)$30.00~$900
Sonnet 5 (standard)$45.00~$1,350

Assumes 90% input / 10% output split, 0.5× thinking multiplier, no caching. Qwen promotional pricing.

At $488/month for Qwen vs $1,350/month for Sonnet 5 standard, the annual difference is ~$10,300. That's real money for teams running production agents.

The Radar: Coding vs Math + Autonomy

Sonnet 5 vs Qwen 3.7 Max strengths radar chart
Sonnet 5 (solid indigo) leads the coding axes (Pro, Verified, HLE). Qwen 3.7 Max (dashed red) dominates math, agentic autonomy, and cost efficiency. The shapes are sharply complementary — Sonnet is a coding spike; Qwen is a broad autonomous agent platform. *Sonnet GPQA Diamond not published; cost normalized inverse of output price.

Where They Don't Overlap

BenchmarkClaude Sonnet 5Qwen 3.7 Max
OSWorld-Verified (computer use)81.2%— (not published)
BrowseComp (agentic search)84.7%— (not published)
GDPval-AA v2 (knowledge work)1618 Elo— (not published)
MCP Atlas (tool orchestration)— (not published)76.4% (Qwen harness)
GPQA Diamond— (not published)92.4%
Kernel Bench L3— (not published)96% WR
LiveCodeBench— (not published)91.6%

Specification Comparison

FeatureClaude Sonnet 5Qwen 3.7 Max
ProviderAnthropic (San Francisco)Alibaba / Qwen (Hangzhou)
ReleasedJune 30, 2026May 19, 2026
LicenseProprietaryProprietary
Context Window1,000,000 tokens1,000,000 tokens
Max Output128K (300K batch)128K
ThinkingAdaptive (effort levels)Thinking model (reasoning)
MultimodalText + Image inputText only
API CompatibilityNative Anthropic APIOpenAI + Anthropic compatible
Input Price$2 intro / $3 std$1.25 (50% promo)*
Output Price$10 intro / $15 std$3.75 (50% promo)*
Agentic HighlightClaude Code ecosystem35hr autonomous runs

*Qwen 3.7 Max promotional pricing: 50% off through June 22, 2026. Standard pricing: $2.50/$7.50. Both OpenAI and Anthropic API compatible. Sources: Qwen Official Blog, Yotta Labs.

Should You Use Sonnet 5 or Qwen 3.7 Max?

If you...Decision
Do heavy coding / Claude Code daily🔷 Sonnet 5. +2.6 Pro, +4.8 Verified. Better coder.
Need the cheapest near-frontier coding✅ Qwen 3.7 Max. 2.7× cheaper output. $488/mo vs $1,350.
Do math-heavy work / scientific computing✅ Qwen 3.7 Max. 92.4% GPQA. Math dominance.
Run long-horizon autonomous agents✅ Qwen 3.7 Max. 35-hr proven. 96% Kernel Bench.
Need computer use / GUI automation🔷 Sonnet 5. 81.2% OSWorld. Qwen unpublished.
Value safety transparency🔷 Sonnet 5. 145-page System Card.
Want Anthropic ecosystem integration🔷 Sonnet 5. Claude Code, Cowork, claude.ai default.
Need competitive coding / algorithms✅ Qwen 3.7 Max. 91.6% LiveCodeBench.
Run both strategically (routing)✅ Best of both. Sonnet for code, Qwen for math + autonomy.

Conclusion: The Coder vs The Marathon Runner

Claude Sonnet 5 and Qwen 3.7 Max represent two different answers to the question "what should a near-frontier model excel at?" Sonnet 5 is the better coder — +2.6 Pro, +4.8 Verified, deep Anthropic ecosystem integration, and the safety infrastructure that enterprises trust. It's the model you want reviewing your pull requests.

Qwen 3.7 Max is the better autonomous agent — 35-hour proven endurance, world-class math, 2.7× cheaper, and OpenAI/Anthropic API compatible. It's the model you want running overnight optimization jobs and tackling algorithmic heavy lifting.

Qwen's positioning: "The Agent Frontier." Anthropic's positioning: "The best combination of speed and intelligence." Both true. The 2.6-point Pro gap makes Sonnet the better coder. The 2.7× price gap makes Qwen the better value. Your choice depends on whether you value marginal coding quality or substantial cost savings — and whether you need a model that writes code or one that runs marathons.

🔬 Side-by-Side Test

Run Claude Sonnet 5 and Qwen 3.7 Max on your own code. Coder vs marathon runner — your benchmarks are the only ones that matter.

🔄 Compare Side by Side →

Sources & Links

Read This Next