SWE-bench Pro Leaderboard: Muse Spark 1.1 Added (2026)

SWE-bench Pro Leaderboard

The hardest coding benchmark for AI. Real GitHub issues, multi-file diffs, production repositories — not memorized answers. Models ranked by published scores.

Last updated: July 17, 2026 · 🆕 Muse Spark 1.1 (61.5%) & Kimi K3 added · Compare pricing → · Terminal-Bench →

Model

Prov

Pro

Verif

Size

License

$/1M Out

Released

Src

SWE-bench Pro Score Comparison

Published Pro scores on a consistent 0–100 scale. Models without a score are omitted.

About SWE-bench Pro: Tests whether an AI model can resolve real GitHub issues end-to-end. Unlike SWE-bench Verified (contaminated — see OpenAI's Feb 2026 withdrawal), Pro uses actively maintained repositories with no public ground-truth leakage.
🆕 Muse Spark 1.1: Meta reports 61.5% on SWE-bench Pro — above Qwen 3.7 Max (60.6%) in this cross-source table, at $1.25/$4.25 per 1M tokens.
🆕 Kimi K3: Moonshot did not publish a SWE-bench Pro result in its July 16 launch table. It is listed with “—” rather than assigning the unrelated 58.4% score sometimes repeated online; 58.4% belongs to GLM-5.1 in this dataset.
Scores: Vendor-reported unless otherwise noted. “—” means not published. Cross-vendor harnesses can differ, so treat small gaps as directional.

Test these models on real code

20+ LLMs on CodingFleet. Side-by-side testing on your own repos. Benchmarks are a compass, not a map.

🚀 Try on CodingFleet →