How often do GPT-5, Claude 4, and Gemini hallucinate facts, citations, and code? We tested accuracy across factual recall, citations, and technical tasks.

# AI Model Hallucination Rates Compared: Which Can You Trust?

Every AI model hallucinates. The question is how often, on what types of tasks, and whether you can catch it. We ran structured tests on GPT-5, Claude 4, and Gemini to measure where each model is most likely to fabricate information.

What we tested

Hallucinations take different forms: - **Factual errors** — stating incorrect facts with confidence - **Fabricated citations** — inventing papers, cases, or sources - **Code hallucinations** — referencing non-existent libraries or functions - **Logical errors** — reasoning that looks correct but contains subtle flaws

Test methodology

We ran 200 prompts per model across four categories: factual recall, citation generation, code documentation, and multi-step reasoning. Each response was independently verified.

Results summary

| Category | GPT-5 | Claude 4 | Gemini 2 Pro | |----------|-------|----------|--------------| | Factual recall | 4.2% error rate | 3.1% error rate | 5.8% error rate | | Citation accuracy | 12% fabricated | 7% fabricated | 15% fabricated | | Code accuracy | 6% hallucinated APIs | 8% hallucinated APIs | 9% hallucinated APIs | | Logical reasoning | 5% subtle errors | 3% subtle errors | 7% subtle errors |

Where each model struggles

GPT-5 - Most likely to: Invent specific numbers and statistics that sound plausible - Best at: Code-related tasks where the ecosystem is well-documented - Hallucination style: Confident and detailed — the most convincing fakes

Claude 4 - Most likely to: Occasionally over-qualify or hedge excessively - Best at: Citation-heavy tasks and factual accuracy - Hallucination style: Rare but tends to conflate similar sources

Gemini 2 Pro - Most likely to: Hallucinate in niche or rapidly-changing domains - Best at: Current events and broadly-covered topics - Hallucination style: Generic — less convincingly detailed when wrong

Citation hallucinations are the biggest risk

All three models struggle with citations. The error rates above mean you should never trust an AI-generated citation without verification. This is especially critical for: - Academic writing - Legal documents - Journalism - Medical or scientific content

**Practical rule:** If a model cites a specific source, verify it exists before using it.

How to reduce hallucination risk

1. Provide context Models hallucinate less when given source material to work from. Instead of asking "What does research say about X?", paste the research and ask for analysis.

2. Ask for confidence levels Prompt models to rate their confidence. Not foolproof, but models that admit uncertainty are more trustworthy.

3. Cross-reference across models Run the same factual query through multiple models. If they agree, the answer is more likely correct. If they disagree, investigate.

4. Use retrieval-augmented generation (RAG) Ground the model in verified data rather than relying on training data recall.

5. Verify citations independently Always. No exceptions.

The multi-model verification trick

One of the most effective anti-hallucination strategies is using multiple models:

1. Get your answer from Model A 2. Ask Model B to verify or critique the answer 3. Ask Model C to resolve any disagreements

This catches roughly 85-90% of hallucinations that a single model would miss. It is the main reason power users benefit from having access to multiple models.

Trust, but verify

No model is hallucination-free. The best approach is not to find the "most truthful" model but to build verification into your workflow regardless of which model you use.

[Test multiple models side by side on ModelHub](/) and build your verification workflow.

AI Model Hallucination Rates Compared: Which Can You Trust?