Terminal-Bench 2.1 Leaderboard

Command-line agentic coding. Real shell tasks — install packages, debug builds, configure servers, manage git repos. 31 models tracked with verified scores from official sources.

Last updated: June 17, 2026 · 🆕 Claude Fable 5 added at #1 (88.0%) · Sources: Anthropic · tbench.ai · DeepSeek · SWE-bench Pro →

#
Model
Prov
Score
Ver
Size
License
$/1M Out
Harness
Src
About Terminal-Bench: Evaluates AI agents on real command-line tasks — package management, build systems, git, server config, file manipulation, shell scripting. 🟢 2.1 = official tbench.ai or Anthropic-verified. 🟡 2.0 = vendor-reported from model cards. Scores NOT directly comparable across versions — 2.1 is harder.
🆕 Claude Fable 5 leads at 88.0% (2.1) — first model to break 85%. +4.6 pts over GPT-5.5 (83.4%). Anthropic-reported. Opus 4.8 also updated to 82.7% (Anthropic's TB 2.1 number).
GPT-5.5 at 83.4% (Codex CLI, 2.1) remains the strongest non-Mythos model on terminal tasks. GPT-5.3 Codex (77.3%, 2.0) still beats GPT-5.4 on terminal — Codex specialization holds.
📊 See also: SWE-bench Pro Leaderboard — real-world bug fixing  |  AI Pricing Calculator — compare costs

Test these models on real terminal tasks

20+ LLMs on CodingFleet. Run them on your own CLI workflows. Benchmarks tell you what won — your code tells you what works.

🚀 Try on CodingFleet →