MCP Atlas Leaderboard
Multi-server tool orchestration. 36 real MCP servers, 1,000 tasks, claims-based scoring. The benchmark that tests what AI agents actually do — chain tools together correctly.
Last updated: June 20, 2026 · 🆕 Claude Fable 5 added (83.3%) · SWE-bench Pro → · Terminal-Bench →
⚠️ Methodology note: Scores below come from different vendors using their own evaluation scaffolds. Scores from different sources are NOT directly comparable — a 83.6% from Google's harness vs 83.3% from Anthropic's harness doesn't mean one model is definitively better. Within a single lab's table (e.g., Z.AI's cross-model table), scores ARE comparable.
🆕 Claude Fable 5 at 83.3% — from Anthropic system card (June 9, 2026). Up from Opus 4.8 at 82.2% (Anthropic harness). Gemini 3.5 Flash at 83.6% from Google model card — highest published MCP Atlas score from any vendor. GLM-5.2 at 77.0% from Z.AI table — highest open-weight, 0.8 pts behind Opus 4.8 in the same evaluation.
Test tool orchestration on real MCP servers
20+ LLMs on CodingFleet. Run your own MCP tool chains. Benchmarks are directional — your codebase is the real test.
🚀 Try on CodingFleet →Sources & Links
- Anthropic System Card — Claude Fable 5 & Mythos 5 — Fable 5: 83.3%, Opus 4.8: 82.2% (Anthropic harness)
- Google Model Card — Gemini 3.5 Flash — 83.6% (Google harness)
- Z.AI Cross-Model Table (VentureBeat) — GLM-5.2, Opus 4.8, GPT-5.5, GLM-5.1, Gemini 3.1 Pro (Z.AI harness)
- Qwen Official Blog — Qwen 3.7 Max (76.4%), Opus 4.6 (Qwen harness)
- VentureBeat — MiniMax M3 analysis
- DeepSeek V4 Pro Model Card
- Scale SEAL MCP Atlas Leaderboard — standardized evaluation methodology
- Zvi Mowshowitz — Fable 5 & Mythos 5 capabilities