Tutorials, deep dives and product notes — built for developers.
Claude Sonnet 5 vs GLM 5.2: near-ties on every benchmark (±0.6-2.7 pts). GLM 3.4x cheaper on output, MIT open-weight, self-hostable. Sonnet has OSWorld, BrowseComp, Anthropic safety ecosystem. Proprietary premium vs open-weight value.
Claude Sonnet 5 ($3/$15, June 30) beats GPT-5.5 ($5/$30, April 23) on every directly comparable benchmark: +4.6 SWE-bench Pro, +2.2 Terminal-Bench 2.1, +5.2 HLE with tools. At 40% cheaper input and 50% cheaper output. Full benchmark comparison.
Claude Sonnet 5 vs Sonnet 4.6: every benchmark, every gain. +13.4 Terminal-Bench 2.1, +10.6 HLE tools, +5.1 SWE-bench Pro, +223 GDPval (beats Opus 4.8). Same $3/$15 list price. Tokenizer caveat explained. Full comparison with bar charts, radar, and gains chart — all sourced from Anthropic's Sonnet 5 System Card.
Claude Sonnet 5 (63.2% Pro, $15/1M) vs Opus 4.8 (69.2%, $25/1M). Sonnet 5 beats Opus on knowledge work (GDPval 1618 vs 1615), ties on HLE with tools (57.4% vs 57.9%), and delivers 93% of Opus capability at 60% of the price. Full benchmark comparison from Anthropic's Sonnet 5 System Card.
Interactive MCP Atlas leaderboard: 10 AI models ranked by multi-server tool orchestration. Gemini 3.5 Flash leads at 83.6%. Claude Opus 4.8 at 77.8%. GLM-5.2 at 77.0%. MCP Atlas measures how well models chain tools across MCP servers — the benchmark for real-world agentic reliability.
GLM-5.2 vs GLM-5.1: the full sibling comparison. DeepSWE +28.2 (18.0→46.2), HMMT +9.9, GPQA +5.0, Pro +3.7. 200K→1M context (5×). Single→dual thinking modes. Anthropic API native. Same MIT license, same $4.40/1M. All data from Z.ai official blog.
Claude Opus 4.8 (69.2% Pro, $25/1M) dominates every benchmark vs Kimi K2.6 (58.6%, $4/1M) by 3-11 pts. But Kimi fights back on BrowseComp (-3.9), Agent Swarm (300 sub-agents), DeepSearchQA (92.5%), and is 6.25× cheaper. Full comparison with real benchmark data, 10-point verdict.
GPT-5.5 dominates agentic coding (+14.2 Terminal-Bench, +4.4 SWE-bench Pro). Gemini 3.1 Pro wins on price (2.5× cheaper), reasoning (GPQA 94.3%), and multimodal breadth. Real benchmarks, pricing analysis, and a 9-point decision matrix for choosing the right enterprise model.
MiniMax M3 (59.0% SWE-bench Pro, $1.20/1M) beats GPT-5.5 (58.6%, $30/1M) on the hardest coding benchmark at 25× less cost. But GPT-5.5 dominates Terminal-Bench (+16.7), OSWorld (+8.7), GPQA and HLE. 1M context, native video, MSA architecture, open-weight vs proprietary. Full comparison.
Anthropic's new Mythos-class Fable 5 (80.3% SWE-bench Pro, $50/1M) vs the outgoing flagship Opus 4.8 (69.2%, $25/1M). Fable 5 dominates every benchmark — but costs 2× more, hallucinates more, and sometimes falls back to Opus 4.8 anyway. Full 30-benchmark comparison.
The complete Claude Fable 5 review. Mythos-class for everyone. 80.3% Pro, 88.0% Terminal-Bench, 93.9% Verified. Stripe's 50M-line migration in a day. Karpathy: "major-version-bump-deserving." Simon Willison: "a beast." Safety classifiers, $10/$50 pricing, and why this is the biggest step toward AGI yet.
Claude Fable 5 ($50/1M) vs GPT-5.5 ($30/1M). Fable 5 leads all 8 coding benchmarks (+11.8 avg). GPT-5.5 counters with lower price and Batch/Flex at $15. 5× better Pro value from Fable 5. The definitive head-to-head comparison.