Longitudinal Analysis of LLM Hallucination Rates

For the past two years, the narrative around Large Language Models (LLMs) has been clear:

“Hallucinations are improving.”

New models are released, benchmarks improve, and each iteration promises greater factual reliability. At the frontier, this is largely true.

But when you step back and look at the ecosystem as a whole, a different picture begins to emerge.

I recently set out to answer a simple question:

Are hallucination rates actually improving over time?

Surprisingly, I couldn’t find any research that examined this longitudinally. Most studies look at performance at a single point in time. So I decided to build the dataset myself.

What I found was not what I expected.

The Approach

The starting point was the excellent Vectara Hallucination Leaderboard.

This leaderboard evaluates LLMs on factual consistency and hallucination-related metrics. It’s widely referenced, but only provides a snapshot of current performance.

However, there’s an interesting detail:

The leaderboard is maintained as a table inside a GitHub README.

That means every change to the leaderboard is preserved in Git history.

So instead of taking a single snapshot, I reconstructed the entire history of the leaderboard over time by:

Traversing the Git commit history of the README
Extracting the leaderboard table at each commit
Building a time series dataset of model performance

The result is a longitudinal dataset capturing how hallucination rates evolved across:

multiple model generations
different providers
a rapidly expanding ecosystem

You can find the dataset and code in the accompanying repository.

What the Data Shows

Figure: (a) Temporal evolution of mean and median hallucination rates across LLMs (smoothed), (b) Frontier hallucination rate (best-performing model) over time, (c) Temporal evolution of variance (standard deviation) in hallucination rates

To make sense of the data, I looked at three perspectives:

1. The Average (Mean)
What does the ecosystem look like overall?

2. The Median
What does the typical model look like?

3. The Frontier
What is the best-performing model at any given time?

The Good News: Frontier Models Are Improving

At the cutting edge, things look exactly as expected.

Hallucination rates drop significantly through 2024 and early 2025
Improvements happen in stepwise jumps, not smooth progress
Performance stabilises at relatively low levels

If you only look at top models, the story is one of clear progress.

The Expected News: Typical Models Improve Slowly

The median tells a more grounded story:

Gradual improvement over time
No dramatic breakthroughs
Incremental gains rather than leaps

Most models are getting better, just not dramatically so.

The Surprising News: The Ecosystem Gets Worse

This is where things get interesting.

Around late 2025 / early 2026:

The average hallucination rate increases
The variance between models increases
Even the median begins to rise

This is a temporal reversal.

After a period of improvement, the overall ecosystem becomes less reliable.

What’s Going On?

At first glance, this seems counterintuitive.

How can models be improving while performance is getting worse?

The answer lies in ecosystem dynamics.

The Key Insight: More Models Does Not Mean a Better Ecosystem

We’re seeing two things happen simultaneously:

1. Frontier improvement
Leading models continue to improve.

2. Ecosystem expansion
There has been an explosion of:

open-source models
fine-tuned variants
experimental architectures

The result:

The distribution of model quality widens.

In other words:

The best models get better
The average model gets diluted

Three Phases of LLM Hallucination Evolution

Looking at the timeline, three distinct phases emerge:

Phase 1 — Rapid Improvement (2023 → mid-2024)
Alignment breakthroughs and strong reductions in hallucination.

Phase 2 — Stabilisation (mid-2024 → mid-2025)
Convergence across models and slower, incremental progress.

Phase 3 — Divergence (late 2025 → 2026)
Rapid model proliferation, increasing variability, and decline in average reliability.

Why This Matters

This has important implications for how we think about LLMs.

1. Benchmarks Can Be Misleading

Most discussions focus on:

“What is the best model capable of?”

But in practice, organisations don’t interact with the frontier. They interact with a chosen model within a messy ecosystem.

The average matters.

2. Reliability Is Now a Selection Problem

We’re moving into a world where:

Capability is abundant
Reliability is uneven

The challenge is no longer:

“Can LLMs do this?”

But:

“Which LLM should I trust?”

3. Continuous Evaluation Becomes Essential

If the ecosystem is diverging:

Static benchmarking is insufficient
Point-in-time evaluation becomes outdated quickly

We need continuous monitoring, context-specific evaluation, and governance around model usage.

A Note on Methodology

This analysis is based on reconstructed leaderboard data, which introduces some limitations:

Commit timing is irregular
Model sets change over time
Data reflects leaderboard snapshots, not controlled experiments

So the results should be interpreted as indicative ecosystem trends, not precise longitudinal measurements.

Final Thought

The dominant narrative has been:

“LLMs are getting better.”

The data suggests a more nuanced reality:

LLMs are getting better at the top — but the ecosystem is becoming more chaotic.

And that shift may be more important than any single model improvement.

Explore the Data

GitHub Repository: https://github.com/ralfepoisson/llm-hallucination-trends

Full dataset
Extraction script
Graph generation code

If You Found This Interesting

I’d be curious to hear your thoughts:

Are we entering a post-benchmark era?
How should organisations manage model selection?
Is reliability becoming the new competitive advantage?

Ralfe Poisson

Natural Medicine | Software Engineering | Big ideas