SQL is the most-used programming language on Earth after JavaScript — and the one where AI benchmarks are most misleading. Models score 85–92% on Spider 1.0 but collapse to 6–21% on Spider 2.0 (enterprise-scale). The BIRD benchmark is the only one that matters — and Gemini dominates it. But text-to-SQL is just one piece of database coding. ORM queries (Django, SQLAlchemy), migration scripts, indexing, stored procedures — these are the tasks developers actually do. Here's the definitive guide to which AI model is best for every database task, using BIRD scores, proxy benchmarks, and real-world developer testing. Generate SQL with all these models on CodingFleet's SQL Code Generator.
📊 Key Findings
- Gemini dominates text-to-SQL. Gemini-SQL + Gemini 2.5 Pro: 77.14% on BIRD test. Reddit testing confirms Gemini 2.0 Flash at 92.5% success rate with 40× better cost-performance than Claude. Google trained on massive SQL corpora — it shows.
- Claude Opus 4.8 is best for ORM queries. 69.2% SWE-bench Pro — which covers Django ORM (46% of tasks). When you're writing SQLAlchemy or Django queries embedded in Python, Claude's multi-file coding lead transfers directly.
- GPT-5.5 is best for database administration. 78.2% Terminal-Bench — the closest proxy for pg_dump, index rebuilds, migration execution, and CLI database workflows.
- Spider 1.0 is dead. 85–92% saturation across all frontier models. Like HumanEval. The BIRD benchmark is the SWE-bench of SQL — and Spider 2.0 (6–21%) is too hard to be useful yet.
- The BIRD benchmark has a dirty secret. MotherDuck found 49 out of 151 gold-standard SQL answers were wrong in the train set. Strict execution accuracy disagrees with human judgment ~40% of the time. The LLM-judge tier at 94% is more realistic.
- Most frontier models haven't been tested on BIRD. GPT-5.5, Claude Opus 4.8, DeepSeek V4 Pro, Qwen 3.7 Max, Kimi K2.6, and MiniMax M2.7 all launched after the Feb 2026 BIRD snapshot. We map proxy benchmarks to fill the gap.
🔥 CodingFleet Unlimited Plan: Generate SQL Without Per-Query Cost Anxiety
DeepSeek V4 Pro is available on CodingFleet's Unlimited plan — no weekly, daily, or hourly quotas. For SQL generation at scale (schema exploration, query drafting, migration generation), flat-rate access eliminates the per-query cost that makes API billing unpredictable on database workflows.
All models analyzed here are available on CodingFleet. Test them on your database queries →
The SQL Benchmark Landscape: One Good, Two Useless
Unlike coding benchmarks (SWE-bench, Terminal-Bench, LiveCodeBench), SQL has a fragmented evaluation landscape. Here's what exists — and what's worth paying attention to:
| Benchmark | What It Tests | Top Score | Status |
|---|---|---|---|
| Spider 1.0 | Cross-domain SQL generation, clean schemas | 85–92% | ❌ Dead. Saturated. |
| BIRD | Real-world databases, external knowledge, messy schemas | 77.14% (baseline LLM) 81.95% (agent system) Human: 92.96% | ✅ The standard. |
| Spider 2.0 | Enterprise-scale: 1,000+ columns, multi-dialect, workflow-based | 6–21% | ⚠️ Too hard. |
| LiveSQLBench | Dynamic, contamination-free, industrial-scale | — | 🆕 Emerging. |
Spider 1.0 is the HumanEval of SQL. Clean schemas, small tables (3–10), human-readable column names. Every frontier model scores 85–92%. It tells you a model can write basic SQL — which we already know. It tells you nothing about which model is better.
Spider 2.0 is the opposite extreme. Enterprise databases with 1,000+ columns across BigQuery and Snowflake. Multi-dialect. Workflow-based tasks. The best model scores 21%. At this difficulty level, the benchmark can't differentiate between models — they all fail most tasks. Useful for research, useless for model selection.
BIRD is the Goldilocks benchmark. 500 real-world questions across finance, sports, healthcare, and education databases. It requires external knowledge (a hospital database question might need medical terminology). Human performance: 92.96%. Best AI agent system: 81.95%. Best baseline LLM (no scaffold): 77.14%. This is the only SQL benchmark with discriminative power for frontier models.
BIRD Benchmark: The Only SQL Ranking That Matters
| Model / System | BIRD Dev | BIRD Test | Type | Date |
|---|---|---|---|---|
| Human (Data Engineers + DB Students) | — | 92.96% | Human baseline | — |
| AskData + GPT-4o (AT&T) | 77.64% | 81.95% | Agent system | Dec 2025 |
| Gemini-SQL + Gemini 2.5 Pro (Google) | 73.27% | 77.14% | Baseline LLM | Mar 2026 |
| Q-SQL (AWS, 30B MoE) | 72.99% | 76.47% | Fine-tuned | Dec 2025 |
| Qwen3-Coder-480B (Baseline) | 66.17% | 68.14% | Baseline LLM | Feb 2026 |
| Claude 4.5 Sonnet (Baseline) | 67.34% | 66.85% | Baseline LLM | Feb 2026 |
| GLM-4.7 (Baseline) | 63.82% | 62.94% | Baseline LLM | Feb 2026 |
| DeepSeek-R1 (Baseline) | 61.67% | 60.93% | Baseline LLM | Feb 2026 |
| Kimi-K2-Thinking (Baseline) | 60.63% | 59.87% | Baseline LLM | Feb 2026 |
Source: BIRD Leaderboard. Baseline LLM scores = raw model output, no agent scaffolding. Agent systems add schema linking, self-consistency, and error correction. GPT-5.5, Claude Opus 4.8, DeepSeek V4 Pro, Qwen 3.7 Max, Kimi K2.6, MiniMax M2.7 all launched after the Feb 2026 snapshot and have not been independently evaluated on BIRD.
The critical gap: Every model in our recommended list launched after the February 2026 BIRD snapshot. GPT-5.5 (April), Claude Opus 4.8 (May), DeepSeek V4 Pro (April), Qwen 3.7 Max (May), Kimi K2.6 (April), MiniMax M2.7 (May) — none have independent BIRD scores. The table above is the best we have, but it's 3–4 months behind the frontier. For the newest models, we use proxy benchmarks.
Proxy Benchmarks: Mapping SQL Skills to Existing Benchmarks
Database coding isn't just text-to-SQL. Developers do five distinct things. Here's how they map to available benchmarks:
| Database Task | What It Involves | Best Proxy Benchmark | Why |
|---|---|---|---|
| Text-to-SQL | "How many orders in January?" → SELECT COUNT(*)... | BIRD (direct) | Actual SQL generation benchmark |
| ORM queries | Django ORM, SQLAlchemy, Active Record | SWE-bench Pro | Django = 46% of SWE-bench tasks |
| DB administration | pg_dump, migration execution, index rebuilds | Terminal-Bench 2.1 | CLI workflows, chained commands |
| Schema design & reasoning | Understanding 1,000+ column schemas, cross-table relationships | MRCR v2 long-context | Large schema = long-context challenge |
| Analytics & BI queries | Window functions, CTEs, complex aggregations | GPQA + LiveCodeBench | Math-heavy analytical reasoning |
Best Models for Text-to-SQL (BIRD + Community Testing)
Based on the BIRD leaderboard and independent community testing, here are the models best suited for raw SQL query generation:
🥇 Gemini — The SQL King
Every data point points in the same direction: Google's models dominate text-to-SQL. Gemini-SQL + Gemini 2.5 Pro leads BIRD at 77.14% (baseline LLM, no scaffold). A Reddit tester found Gemini 2.0 Flash at 92.5% success rate with 40× better cost-performance than Claude 3.7 Sonnet. Google trained on massive SQL corpora — Stack Overflow, GitHub SQL repos, internal BigQuery logs — and it shows in the model's ability to handle messy schemas and external knowledge requirements.
Gemini 3.5 Flash is the newest Google model (May 2026) and hasn't been independently evaluated on BIRD yet, but it beats Gemini 3.1 Pro on most coding and agent benchmarks. It likely pushes Google's SQL lead even further. At $1.50 input / $9.00 output per 1M tokens, it's also cost-effective for high-volume query generation.
🥈 Claude Opus 4.8 — When Quality Matters More Than Coverage
Claude's SQL story is nuanced. On raw BIRD accuracy, Claude 4.5 Sonnet scored 66.85% — 10 points behind Gemini. But independent testing reveals a pattern: when Claude generates SQL, it's higher quality. The Reddit tester found Claude 3.7 Sonnet scored 57.5% perfect scores — the highest of any model — but failed more queries overall. Claude's conservative approach (refuse rather than generate wrong SQL) is an advantage for production systems where a wrong query is worse than no query. For our hallucination analysis, Claude Opus 4.7 led on AA-Omniscience — it's the best at admitting when it doesn't know.
🥉 DeepSeek V4 Pro — The Cost-Effective Volume Option
DeepSeek-R1 scored 60.93% on BIRD — the lowest of the tested models. But DeepSeek V4 Pro is a generational leap over R1 (80.6% SWE-bench Verified vs R1's ~50%). At $0.87/1M output, it's the cheapest model that can handle SQL generation at scale. For schema exploration — where you're generating dozens of queries to understand a new database — DeepSeek's cost advantage is decisive. See our heavy user's stack guide for cost analysis.
🔬 Why Google Wins SQL
Google's training data gives Gemini a structural advantage in SQL. BigQuery logs, Stack Overflow's SQL tag, and Google's internal data infrastructure provide orders of magnitude more SQL training examples than any other lab has access to. Google Cloud's database products (Cloud SQL, Spanner, BigQuery) create a feedback loop: internal SQL usage → training data → better SQL models → better products. This is the same dynamic that gives Claude its Django lead (Anthropic's engineering-heavy training data) and GPT-5.5 its terminal lead (OpenAI's infrastructure self-hosting).
Best Models for ORM Queries (Django, SQLAlchemy, Active Record)
ORM queries are a different skill from raw SQL. You're not writing SELECT * FROM users WHERE... — you're writing User.objects.filter(date_joined__gte=last_month).annotate(...). The model needs to understand the ORM's API, the underlying database schema, and Python/TypeScript simultaneously.
This is where Claude Opus 4.8's SWE-bench Pro lead (69.2%) becomes directly relevant. 46% of SWE-bench Verified tasks are Django. When Claude fixes a Django bug on SWE-bench, it's often fixing an ORM query. The 10.6-point gap over GPT-5.5 on SWE-bench Pro translates directly to ORM query generation accuracy.
See our Python coding comparison for the full analysis of which models handle Django and SQLAlchemy best.
| ORM Task | Best Model | Why |
|---|---|---|
| Django ORM queries | Claude Opus 4.8 | 69.2% SWE-bench Pro; 46% Django tasks |
| SQLAlchemy queries | Claude Opus 4.8 | Multi-file Python reasoning; SQLAlchemy mirrors Django patterns |
| Active Record (Rails) | Claude Opus 4.8 | SWE-bench Multilingual leader (84.4%) |
| Prisma / TypeORM | Claude Opus 4.8 | Strongest cross-language code understanding |
| Cost-effective ORM generation | Kimi K2.6 | 58.6% SWE-bench Pro at $4.00/1M output |
Best Models for Database Administration (Migrations, Indexes, Backups)
Database administration is terminal work. pg_dump, CREATE INDEX CONCURRENTLY, migration execution, replication setup. This is where GPT-5.5's Terminal-Bench lead (78.2%) matters:
| DB Admin Task | Best Model | Why |
|---|---|---|
| Migration scripts | GPT-5.5 | 78.2% Terminal-Bench; best CLI workflow model |
| Index optimization | Claude Opus 4.8 | 69.2% SWE-bench Pro; understands schema impact |
| Backup & restore procedures | GPT-5.5 | Terminal-Bench covers chained pg_dump/pg_restore |
| Replication setup | GPT-5.5 | Complex terminal workflows with error handling |
| Performance troubleshooting | Claude Opus 4.8 | HLE 49.8% — best at multi-step diagnostic reasoning |
The BIRD Controversy: 49 of 151 Gold Answers Are Wrong
Here's the dirty secret of SQL benchmarking that nobody talks about. MotherDuck hand-reviewed 151 BIRD train questions and found 49 gold-standard SQL answers were incorrect. That's 32% of the reference data — wrong.
They then built a four-tier scoring system:
| Tier | Train Accuracy | Test Accuracy | What It Measures |
|---|---|---|---|
| Gold Match Only | 64.0% | 58.2% | Official BIRD: strict result-set equality |
| + Platinum Answers | 73.1% | 58.5% | Corrects known errors in gold SQL |
| + Formatting Tolerance | 78.8% | 65.5% | DISTINCT differences, extra columns, rounding |
| + LLM Judge | 94.9% | 94.4% | "Would a human accept this answer?" |
The LLM-judge tier at 94.4% test accuracy is dramatically closer to real-world utility than the 58.2% official score. A 2025 study found BIRD's strict execution accuracy disagrees with human judgment nearly 40% of the time. The benchmark penalizes correct SQL that uses DISTINCT differently, rounds numbers differently, or returns extra helpful columns.
The implication: AI text-to-SQL is significantly better than BIRD's leaderboard suggests. When you account for formatting differences and known gold-answer errors, practical accuracy jumps from ~58% to ~94%. The models are already approaching human-level SQL — the benchmark just isn't measuring it properly.
The Performance Cliff: Why Enterprise SQL Is Still Unsolved
The gap between Spider 1.0 (90%) and Spider 2.0 (6–21%) tells the real story. When you move from clean academic schemas to real enterprise databases with 1,000+ columns, messy metadata, and multiple SQL dialects, AI collapses.
The 2026 Text-to-SQL Performance Cliff analysis identifies the key failure modes:
- Schema hallucination: Models invent column names that don't exist
- Dialect confusion: PostgreSQL syntax in BigQuery, MySQL LIMIT in SQL Server
- Join explosion: Models generate 12-way JOINs when a 3-way join suffices
- External knowledge gaps: Medical databases require domain terminology models lack
- Silent errors: Queries that execute but return wrong results — the most dangerous failure mode
The best defense: always paste your schema. Builder.ai2sql testing found that schema-pasted accuracy jumps from 72% to 83%. The gap between "knows your database" and "guesses your database" is 11 points. Use CodingFleet's AI Chat with file upload to paste your schema directly.
Which Model for Which Database Task?
| Database Task | Best Model | Why |
|---|---|---|
| Text-to-SQL (raw queries) | Gemini 3.5 Flash | BIRD leader; 77.14% baseline; 92.5% community success rate |
| Django ORM / SQLAlchemy | Claude Opus 4.8 | 69.2% SWE-bench Pro; Django = 46% of tasks |
| Migration generation & execution | GPT-5.5 | 78.2% Terminal-Bench; best CLI workflow model |
| Schema design & normalization | Claude Opus 4.8 | HLE 49.8%; best multi-step reasoning |
| Stored procedures & functions | Claude Opus 4.8 | SWE-bench Pro multi-file; PL/pgSQL mirrors Python patterns |
| Volume SQL generation (cost-sensitive) | DeepSeek V4 Pro | $0.87/1M output; good enough for schema exploration |
| BI & analytical queries | Qwen 3.7 Max | 91.6% LiveCodeBench; 53.5% SciCode; best analytical reasoning |
| Index & performance tuning | GPT-5.5 | Terminal workflow + analytical query understanding |
| Multi-dialect SQL | Gemini 3.5 Flash | Google's multi-database training (MySQL, PostgreSQL, BigQuery, Spanner) |
| PR review of SQL changes | Kimi K2.6 | 58.6% SWE-bench Pro; 89.6% LiveCodeBench; $4.00/1M output |
The Bottom Line
- Text-to-SQL has a clear winner: Gemini. BIRD leader (77.14% baseline LLM, 81.95% agent system). Community testing confirms 92.5% success rate at 40× better cost-performance than Claude. Google's SQL training data advantage is structural and likely to persist.
- ORM queries are Claude Opus 4.8's domain. 69.2% SWE-bench Pro with 46% Django tasks. When SQL is embedded in Python, Claude's multi-file coding lead transfers directly. For Django, SQLAlchemy, and Active Record, Claude is the default.
- Database administration belongs to GPT-5.5. 78.2% Terminal-Bench. Migrations, backups, replication setup — these are terminal workflows where GPT-5.5 leads by 3.6–10 points.
- BIRD has a 32% error rate in its own answers. The LLM-judge tier (94.4% practical accuracy) is more realistic than the official 58.2% score. AI text-to-SQL is better than the leaderboard says.
- Enterprise SQL is unsolved. Spider 2.0 at 6–21% shows the gap between academic SQL and real enterprise databases is still enormous. Schema hallucination remains the #1 failure mode. Always paste your schema.
- Most frontier models haven't been tested on BIRD. GPT-5.5, Claude Opus 4.8, Qwen 3.7 Max, DeepSeek V4 Pro, Kimi K2.6 all lack independent BIRD scores. The February 2026 snapshot is the best data available but 3–4 months stale.
The SQL AI landscape is more nuanced than coding. No single model wins everything. Gemini for raw SQL. Claude for ORM. GPT-5.5 for admin. DeepSeek for volume. The optimal strategy is a multi-model database workflow — and CodingFleet gives you all of them in one place. Generate SQL with any model on the SQL Code Generator, or use the Code Converter to migrate between SQL dialects.
📚 Related Articles
Sources: BIRD Leaderboard (Feb 2026 snapshot) | MotherDuck — BIRD Audit | Text-to-SQL Performance Cliff 2026 | Reddit — SQL Model Testing | AIMultiple — Text-to-SQL LLM Comparison | CIDR 2026 — Text-to-SQL Benchmarks Are Broken | Gemini 3.5 Flash model card | Anthropic Opus 4.8 system card. BIRD baseline LLM scores from February 2026 snapshot — newer frontier models not yet independently evaluated.