Terminal-Bench 2.1 Leaderboard
Command-line agentic coding. Real shell tasks — install packages, debug builds, configure servers, manage git repos. 31 models tracked with verified scores from official sources.
Last updated: June 17, 2026 · 🆕 Claude Fable 5 added at #1 (88.0%) · Sources: Anthropic · tbench.ai · DeepSeek · SWE-bench Pro →
🆕 Claude Fable 5 leads at 88.0% (2.1) — first model to break 85%. +4.6 pts over GPT-5.5 (83.4%). Anthropic-reported. Opus 4.8 also updated to 82.7% (Anthropic's TB 2.1 number).
GPT-5.5 at 83.4% (Codex CLI, 2.1) remains the strongest non-Mythos model on terminal tasks. GPT-5.3 Codex (77.3%, 2.0) still beats GPT-5.4 on terminal — Codex specialization holds.
Test these models on real terminal tasks
20+ LLMs on CodingFleet. Run them on your own CLI workflows. Benchmarks tell you what won — your code tells you what works.
🚀 Try on CodingFleet →