Terminal-Bench 2.1 Leaderboard: Kimi K3 Added (2026)

Terminal-Bench 2.1 Leaderboard

Command-line agentic coding. Real shell tasks — install packages, debug builds, configure servers, manage git repos. Models tracked with published scores from official sources.

Last updated: July 17, 2026 · 🆕 Kimi K3 (88.3%) & Muse Spark 1.1 (80.0%) added · Sources: Moonshot AI · Meta · tbench.ai · SWE-bench Pro →

Model

Prov

Score

Ver

Size

License

$/1M Out

Harness

Src

Terminal-Bench Score Comparison

Bar length uses the full 0–100 scale. Hover a bar for its source harness.

About Terminal-Bench: Evaluates AI agents on real command-line tasks — package management, build systems, git, server config, file manipulation, shell scripting. 🟢 2.1 = official tbench.ai or Anthropic-verified. 🟡 2.0 = vendor-reported from model cards. Scores NOT directly comparable across versions — 2.1 is harder.
🆕 Kimi K3 scores 88.3% (2.1) with KimiCode at max reasoning — only 0.5 points behind GPT-5.6 Sol's 88.8% single-agent result. Moonshot's official launch table; $3/$15 per 1M tokens.
🆕 Muse Spark 1.1 scores 80.0% (2.1) in Meta's official evaluation. It costs $1.25/$4.25 per 1M tokens with $0.15 cached input.
Harness warning: Results combine vendor and leaderboard harnesses. Compare directionally unless the same evaluator and scaffold were used. Kimi K3's weights were promised by July 27 and were not yet downloadable on this July 17 update.

📊 See also: SWE-bench Pro Leaderboard — real-world bug fixing | AI Pricing Calculator — compare costs

Test these models on real terminal tasks

20+ LLMs on CodingFleet. Run them on your own CLI workflows. Benchmarks tell you what won — your code tells you what works.

🚀 Try on CodingFleet →