MCP Atlas Leaderboard

Multi-server tool orchestration. 36 real MCP servers, 1,000 tasks, claims-based scoring. The benchmark that tests what AI agents actually do — chain tools together correctly.

Last updated: June 20, 2026 · 🆕 Claude Fable 5 added (83.3%) · SWE-bench Pro → · Terminal-Bench →

#
Model
Prov
Score
License
$/1M Out
Released
Src
About MCP Atlas: 1,000-task benchmark across 36 real MCP servers. Tests multi-step tool orchestration — calling the right API, in the right order, with the right parameters. Scale's April 2026 update replaced a 20-turn limit with a 100 tool-call budget and claims-based scoring.
⚠️ Methodology note: Scores below come from different vendors using their own evaluation scaffolds. Scores from different sources are NOT directly comparable — a 83.6% from Google's harness vs 83.3% from Anthropic's harness doesn't mean one model is definitively better. Within a single lab's table (e.g., Z.AI's cross-model table), scores ARE comparable.
🆕 Claude Fable 5 at 83.3% — from Anthropic system card (June 9, 2026). Up from Opus 4.8 at 82.2% (Anthropic harness). Gemini 3.5 Flash at 83.6% from Google model card — highest published MCP Atlas score from any vendor. GLM-5.2 at 77.0% from Z.AI table — highest open-weight, 0.8 pts behind Opus 4.8 in the same evaluation.

Test tool orchestration on real MCP servers

20+ LLMs on CodingFleet. Run your own MCP tool chains. Benchmarks are directional — your codebase is the real test.

🚀 Try on CodingFleet →

Sources & Links