Terminal-Bench 2.1 Leaderboard
Command-line agentic coding. Real shell tasks — install packages, debug builds, configure servers, manage git repos. 30 models tracked with verified scores from official sources.
Last updated: June 9, 2026 · Sources: tbench.ai · DeepSeek · Google · Qwen · OpenAI · NVIDIA
GPT-5.5 leads at 83.4% (Codex CLI, 2.1). GPT-5.3 Codex at 77.3% (2.0) beats the newer GPT-5.4 at 75.1% — terminal is Codex's home turf. GPT-5.4 Mini punches above its weight at 60.0%, just 1.4 points behind GPT-5.2.
Test these models on real terminal tasks
20+ LLMs on CodingFleet. Run them on your own CLI workflows. Benchmarks tell you what won — your code tells you what works.
🚀 Try on CodingFleet →