Terminal-Bench 2.1 Leaderboard

Command-line agentic coding. Real shell tasks — install packages, debug builds, configure servers, manage git repos. 30 models tracked with verified scores from official sources.

Last updated: June 9, 2026 · Sources: tbench.ai · DeepSeek · Google · Qwen · OpenAI · NVIDIA

#
Model
Prov
Score
Ver
Size
License
$/1M Out
Harness
Src
About Terminal-Bench: Evaluates AI agents on real command-line tasks — package management, build systems, git, server config, file manipulation, shell scripting. 🟢 2.1 = official tbench.ai. 🟡 2.0 = vendor-reported from official model cards. Scores NOT directly comparable across versions — 2.1 is harder.
GPT-5.5 leads at 83.4% (Codex CLI, 2.1). GPT-5.3 Codex at 77.3% (2.0) beats the newer GPT-5.4 at 75.1% — terminal is Codex's home turf. GPT-5.4 Mini punches above its weight at 60.0%, just 1.4 points behind GPT-5.2.
📊 See also: SWE-bench Pro Leaderboard — real-world bug fixing  |  AI Pricing Calculator — compare costs

Test these models on real terminal tasks

20+ LLMs on CodingFleet. Run them on your own CLI workflows. Benchmarks tell you what won — your code tells you what works.

🚀 Try on CodingFleet →