Benchmarks measure intelligence. But there's a more fundamental question: does the model tell the truth? We analyzed the two leading independent hallucination benchmarks — Vectara's HHEM and Artificial Analysis' Omniscience Index — to create the most comprehensive honesty comparison of current-generation frontier models. The findings challenge several assumptions: smaller models often fabricate less, and "thinking harder" paradoxically makes models lie more.
📊 Key Findings
- GPT-5.4 Mini (5.5%) and GPT-5.5 (9.3%) lead the Vectara HHEM hallucination rankings among the current-gen models tested
- Gemini 3.1 Pro tops AA-Omniscience at 32.9 — best at admitting ignorance rather than fabricating
- The Reasoning Tax is real: thinking/reasoning modes amplify hallucination 2–3× across all model families
- Claude Opus 4.7 takes the opposite trade: higher Vectara rate (12.0%) but AA-Omniscience Index of 26.2 (second only to Gemini 3.1 Pro)
- Qwen 3.7 Max debuts at 22.9% AA-Omniscience hallucination rate — the best calibration of any non-Anthropic model at scale
All models in this analysis are available on CodingFleet. Test them yourself →
What These Benchmarks Actually Test
There are two fundamentally different ways to measure hallucination, and they often disagree. Understanding the difference is essential to reading any honesty ranking:
| Benchmark | What It Measures | Good Score | Risk |
|---|---|---|---|
| Vectara HHEM | How often a model invents facts when summarizing a document. Tests grounding to source material. | Lower is better <5% excellent, >10% concerning |
Models can "game" this by refusing to summarize — answer rate must be checked |
| AA-Omniscience | Whether a model admits ignorance when it doesn't know an answer. 6,000 questions across 42 topics. | Higher Index is better >10 good, >25 excellent |
Models with very low scores either fabricate aggressively or refuse too much |
The key insight: Vectara tests "does it stick to the source?" while AA-Omniscience tests "does it know what it doesn't know?" A model can ace one and fail the other. Claude Opus 4.7 is the clearest example: 12.0% Vectara (concerning) but 26.2 AA-Omniscience Index (excellent). It fabricates more in summarization than most peers, but it's one of the best at saying "I'm not sure" when it genuinely lacks knowledge.
Vectara HHEM: The Summarization Truthfulness Ranking
Vectara's Hughes Hallucination Evaluation Model (HHEM) is the industry standard for factual grounding. The benchmark evaluates models on 7,700+ documents spanning law, medicine, finance, and education — summarization tasks where source fidelity is critical. All scores below are from the new (tougher) dataset where answer rate ≥ 95%.
| Model | Provider | Vectara HHEM | Fact. Consistency | Tier |
|---|---|---|---|---|
| GPT-5.4 Mini | OpenAI | 5.5% | 94.5% | ✅ Excellent |
| DeepSeek V4 Pro | DeepSeek | 8.6% | 91.4% | ✅ Good |
| MiniMax M2.7 | MiniMax | 9.1% | 90.9% | ✅ Good |
| GPT-5.5 | OpenAI | 9.3% | 90.7% | ✅ Good |
| Claude Haiku 4.5 | Anthropic | 9.8% | 90.2% | ⚠️ Borderline |
| GLM-5* | zAI | 10.1% | 89.9% | ⚠️ Borderline |
| Gemini 3.1 Pro | 10.4% | 89.6% | ⚠️ Borderline | |
| Qwen 3.6 Flash** | Alibaba | 10.5% | 89.5% | ⚠️ Borderline |
| Claude Sonnet 4.6 | Anthropic | 10.6% | 89.4% | ⚠️ Borderline |
| Kimi K2.6 | Moonshot | 10.8% | 89.2% | ⚠️ Borderline |
| Claude Opus 4.7 | Anthropic | 12.0% | 88.0% | 🔴 Concerning |
* GLM-5 score shown; GLM-5.1 not yet on Vectara leaderboard. ** Qwen 3.5 Flash score shown; Qwen 3.6 Flash not yet independently verified. Source: Vectara HHEM Leaderboard, May 2026 snapshot.
Models not yet on Vectara's new dataset: Claude Opus 4.8 (released too recently), Grok 4.3, Grok 4.1 Fast, Gemini 3.5 Flash, Gemini 3 Flash, DeepSeek V4 Flash, GPT-5.4 (Instant), Qwen 3.7 Max, Qwen 3.6 Plus, GLM-5V Turbo. These will be added as Vectara updates their leaderboard.
AA-Omniscience: The "Knows What It Doesn't Know" Ranking
Artificial Analysis' Omniscience Index measures something different: calibration. It rewards models for correct answers, penalizes hallucinations aggressively, and gives zero penalty for admitting "I don't know." A score of 0 means the model gets as many right as wrong; negative means more wrong than right. This is the better metric for high-stakes factual Q&A.
| Model | Provider | AA-Omniscience Index | Accuracy | Halluc. Rate |
|---|---|---|---|---|
| Gemini 3.1 Pro | 32.9 | 55.3% | 50% | |
| Claude Opus 4.7 (Adaptive) | Anthropic | 26.2 | ~47% | 36% |
| Gemini 3.5 Flash | 22.7 | N/A | N/A | |
| GPT-5.5 (xhigh) | OpenAI | 20.1 | 57% | 86% |
| Grok 4.3 | xAI | 18.3 | N/A | N/A |
| Gemini 3 Pro | 15.8 | N/A | N/A | |
| Qwen 3.7 Max | Alibaba | 14.1 | N/A | 22.9% |
| GPT-5.4 | OpenAI | 5.7 | N/A | N/A |
| Kimi K2.6 | Moonshot | 6.4 | N/A | N/A |
| MiMo V2.5 Pro | Xiaomi | 3.6 | N/A | 24.5% |
| Qwen 3.6 Plus | Alibaba | 2.7 | N/A | N/A |
| Claude Opus 4.7 | Anthropic | 14.2 | N/A | N/A |
Source: BenchLM / Artificial Analysis Omniscience Index, May 2026 snapshot. Hallucination rates from Suprmind compilation of AA data.
Models not yet independently evaluated on AA-Omniscience: Claude Opus 4.8, Claude Haiku 4.5, GPT-5.5 (Instant), GPT-5.4 Mini (Instant), Gemini 3 Flash, DeepSeek V4 Pro, DeepSeek V4 Flash, Grok 4.1 Fast, GLM-5V Turbo, Qwen 3.6 Flash.
The Reasoning Tax: Why "Think Harder" Makes Models Lie More
Across every model family that offers both standard and reasoning modes, the pattern is consistent and alarming:
| Model Family | Base Mode (Vectara) | Reasoning Mode | Increase |
|---|---|---|---|
| GPT-5.4 → GPT-5 (high reasoning) | ~7%* | >10% | ~1.5× |
| DeepSeek V3.2 → R1 | 5.3% | 11.3% | 2.1× |
| Grok 3 → Grok 4 Fast (reasoning) | 5.8% | 20.2% | 3.5× |
* GPT-5.4 Pro score (8.3%); base GPT-5.4 not directly on Vectara. Source: Suprmind analysis of Vectara data.
Why? Reasoning models generate long internal chains that can drift from source material. More tokens = more opportunities to invent plausible-sounding but unfounded details. The practical rule: use base/instant mode for summarization and fact-grounded tasks; reserve reasoning mode for math, coding, and logic where fabrication risk is lower.
⚠️ The GPT-5.5 Paradox
GPT-5.5 simultaneously achieves the highest AA-Omniscience accuracy ever recorded (57%) and one of the worst hallucination rates (86%) among frontier models. It knows more than any other model — but when it's wrong, it's confident and convincing. For agentic workflows where errors compound, this is dangerous. For self-evaluating pipelines that verify outputs before acting, it's manageable. Calibration matters as much as capability.
Provider-Level Hallucination: Same Model, Different Truthfulness
An emerging 2026 finding: the same model served by different providers (OpenRouter vs DeepInfra vs Together.ai vs official API) can show meaningfully different hallucination rates. Quantization choices, system prompt injection, and inference configuration all affect grounding:
- DeepSeek V4 Pro on OpenRouter: 8.6% Vectara
- DeepSeek V3.2 on DeepInfra: 5.8% (vs 6.3% on OpenRouter) — 8% better
- Llama 3.3 70B on Together.ai: 4.9% vs 4.1% on OpenRouter — ~20% worse
Takeaway: When building production RAG pipelines, benchmark the specific provider endpoint you'll use, not just the model card.
Composite Honesty: Who Wins?
No single model leads both benchmarks. Here's how they stack up when we consider both Vectara HHEM and AA-Omniscience together:
| Model | Vectara HHEM | AA-Omniscience Index | Overall Honesty |
|---|---|---|---|
| 🥇 GPT-5.4 Mini | 5.5% (best) | N/A | Best summarization honesty |
| 🥈 GPT-5.5 | 9.3% | 20.1 | Best all-rounder: strong on both |
| 🥉 Gemini 3.1 Pro | 10.4% | 32.9 (best) | Best calibration & ignorance admission |
| Claude Opus 4.7 | 12.0% | 26.2 (Adaptive mode) | Best refusal strategy; worst Vectara |
| Gemini 3.5 Flash | N/A | 22.7 | Strong calibration, Vectara pending |
| DeepSeek V4 Pro | 8.6% | N/A | Strong summarization, Omniscience pending |
| Grok 4.3 | N/A | 18.3 | Solid calibration, Vectara pending |
| Qwen 3.7 Max | N/A | 14.1 | Good debut; Vectara pending |
What This Means for Developers
✅ DO
- Use GPT-5.4 Mini or GPT-5.5 (Instant) for high-fidelity summarization pipelines
- Use base/instant mode (not reasoning) for any fact-grounded generation
- Use Gemini 3.1 Pro or Claude Opus 4.7 (Adaptive) for high-stakes Q&A where "I don't know" is better than a confident lie
- Benchmark your specific provider endpoint — hallucination varies by host
- Layer a verification pass (second model checking the first) for production pipelines
❌ DON'T
- Don't use reasoning mode for summarization or fact retrieval — 2-3× more hallucination
- Don't assume "more expensive = more truthful" — GPT-5.4 Mini (cheap) beats Opus 4.7 (expensive) on Vectara
- Don't trust a single benchmark in isolation — Vectara and AA-Omniscience measure different things
- Don't use GPT-5.5 reasoning mode for unverified agentic actions — 86% hallucination rate on AA-Omniscience
- Don't assume open-weight = less truthful — DeepSeek V4 Pro (8.6%) beats several proprietary models on Vectara
The Bottom Line
Three clear patterns emerge from the May 2026 data:
- Smaller models are often more truthful. GPT-5.4 Mini (5.5%) leads Vectara; DeepSeek V4 Pro (8.6%) beats Claude Opus 4.7 (12.0%). Efficient architectures produce models that stick closer to source material than their larger siblings.
- The reasoning-honesty tradeoff is real and measurable. Across GPT, DeepSeek, and Grok, reasoning mode amplifies hallucination 1.5–3.5×. Chain-of-thought is powerful for math and coding, but actively harmful for factual tasks.
- Calibration and factual accuracy are different skills. GPT-5.5 knows the most (57% accuracy) but lies most confidently (86% hallucination rate). Claude Opus 4.7 knows less but refuses more gracefully (36% hallucination rate). Gemini 3.1 Pro strikes the best balance (32.9 Index). Choose based on whether your application tolerates false confidence or prefers graceful refusal.
The hallucination problem isn't solved — but we now have the data to choose the right model for the right task. Select based on your failure tolerance, not the brand name.
Sources: Vectara HHEM Hallucination Leaderboard (GitHub, May 2026 snapshot) | BenchLM / Artificial Analysis Omniscience Index (May 21, 2026 snapshot) | Suprmind — AI Hallucination Rates & Benchmarks (comprehensive cross-benchmark compilation) | Artificial Analysis — AA-Omniscience Evaluation | Vectara Blog — New Leaderboard Methodology. Scores are point-in-time (May 2026) and may shift with model updates. Dashes/N/A indicate no published data on that specific benchmark for that model.