{"id":298,"date":"2026-03-30T05:45:05","date_gmt":"2026-03-30T05:45:05","guid":{"rendered":"https:\/\/www.ralfepoisson.com\/?p=298"},"modified":"2026-03-30T05:45:06","modified_gmt":"2026-03-30T05:45:06","slug":"longitudinal-analysis-of-llm-hallucination-rates","status":"publish","type":"post","link":"https:\/\/www.ralfepoisson.com\/index.php\/2026\/03\/30\/longitudinal-analysis-of-llm-hallucination-rates\/","title":{"rendered":"Longitudinal Analysis of LLM Hallucination Rates"},"content":{"rendered":"<p class=\"lead\">For the past two years, the narrative around Large Language Models (LLMs) has been clear:<\/p>\n<blockquote><p>\u201cHallucinations are improving.\u201d<\/p><\/blockquote>\n<p>New models are released, benchmarks improve, and each iteration promises greater factual reliability. At the frontier, this is largely true.<\/p>\n<p>But when you step back and look at the ecosystem as a whole, a different picture begins to emerge.<\/p>\n<p>I recently set out to answer a simple question:<\/p>\n<blockquote><p>Are hallucination rates actually improving over time?<\/p><\/blockquote>\n<p>Surprisingly, I couldn\u2019t find any research that examined this longitudinally. Most studies look at performance at a single point in time. So I decided to build the dataset myself.<\/p>\n<p>What I found was not what I expected.<\/p>\n<h2>The Approach<\/h2>\n<p>The starting point was the excellent Vectara Hallucination Leaderboard.<\/p>\n<p>This leaderboard evaluates LLMs on factual consistency and hallucination-related metrics. It\u2019s widely referenced, but only provides a snapshot of current performance.<\/p>\n<p>However, there\u2019s an interesting detail:<\/p>\n<blockquote><p>The leaderboard is maintained as a table inside a GitHub README.<\/p><\/blockquote>\n<p>That means every change to the leaderboard is preserved in Git history.<\/p>\n<p>So instead of taking a single snapshot, I reconstructed the entire history of the leaderboard over time by:<\/p>\n<ol>\n<li>Traversing the Git commit history of the README<\/li>\n<li>Extracting the leaderboard table at each commit<\/li>\n<li>Building a time series dataset of model performance<\/li>\n<\/ol>\n<p>The result is a longitudinal dataset capturing how hallucination rates evolved across:<\/p>\n<ul>\n<li>multiple model generations<\/li>\n<li>different providers<\/li>\n<li>a rapidly expanding ecosystem<\/li>\n<\/ul>\n<p>You can find the dataset and code in the accompanying repository.<\/p>\n<h2>What the Data Shows<\/h2>\n<table>\n<tbody>\n<tr>\n<td><img decoding=\"async\" src=\"https:\/\/github.com\/ralfepoisson\/llm-hallucination-trends\/raw\/main\/assets\/hallucination-trends.png\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/github.com\/ralfepoisson\/llm-hallucination-trends\/raw\/main\/assets\/frontier-trends.png\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/github.com\/ralfepoisson\/llm-hallucination-trends\/raw\/main\/assets\/variation.png\" \/><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><i>Figure: (a) Temporal evolution of mean and median hallucination rates across LLMs (smoothed), (b) Frontier hallucination rate (best-performing model) over time, (c) Temporal evolution of variance (standard deviation) in hallucination rates<\/i><\/p>\n<p>To make sense of the data, I looked at three perspectives:<\/p>\n<p><strong>1. The Average (Mean)<\/strong><br \/>\nWhat does the ecosystem look like overall?<\/p>\n<p><strong>2. The Median<\/strong><br \/>\nWhat does the typical model look like?<\/p>\n<p><strong>3. The Frontier<\/strong><br \/>\nWhat is the best-performing model at any given time?<\/p>\n<h2>The Good News: Frontier Models Are Improving<\/h2>\n<p>At the cutting edge, things look exactly as expected.<\/p>\n<ul>\n<li>Hallucination rates drop significantly through 2024 and early 2025<\/li>\n<li>Improvements happen in stepwise jumps, not smooth progress<\/li>\n<li>Performance stabilises at relatively low levels<\/li>\n<\/ul>\n<p>If you only look at top models, the story is one of clear progress.<\/p>\n<h2>The Expected News: Typical Models Improve Slowly<\/h2>\n<p>The median tells a more grounded story:<\/p>\n<ul>\n<li>Gradual improvement over time<\/li>\n<li>No dramatic breakthroughs<\/li>\n<li>Incremental gains rather than leaps<\/li>\n<\/ul>\n<p>Most models are getting better, just not dramatically so.<\/p>\n<h2>The Surprising News: The Ecosystem Gets Worse<\/h2>\n<p>This is where things get interesting.<\/p>\n<p>Around late 2025 \/ early 2026:<\/p>\n<ul>\n<li>The average hallucination rate increases<\/li>\n<li>The variance between models increases<\/li>\n<li>Even the median begins to rise<\/li>\n<\/ul>\n<p>This is a temporal reversal.<\/p>\n<blockquote><p>After a period of improvement, the overall ecosystem becomes less reliable.<\/p><\/blockquote>\n<h2>What\u2019s Going On?<\/h2>\n<p>At first glance, this seems counterintuitive.<\/p>\n<p>How can models be improving while performance is getting worse?<\/p>\n<p>The answer lies in ecosystem dynamics.<\/p>\n<h2>The Key Insight: More Models Does Not Mean a Better Ecosystem<\/h2>\n<p>We\u2019re seeing two things happen simultaneously:<\/p>\n<p><strong>1. Frontier improvement<\/strong><br \/>\nLeading models continue to improve.<\/p>\n<p><strong>2. Ecosystem expansion<\/strong><br \/>\nThere has been an explosion of:<\/p>\n<ul>\n<li>open-source models<\/li>\n<li>fine-tuned variants<\/li>\n<li>experimental architectures<\/li>\n<\/ul>\n<p>The result:<\/p>\n<blockquote><p>The distribution of model quality widens.<\/p><\/blockquote>\n<p>In other words:<\/p>\n<ul>\n<li>The best models get better<\/li>\n<li>The average model gets diluted<\/li>\n<\/ul>\n<h2>Three Phases of LLM Hallucination Evolution<\/h2>\n<p>Looking at the timeline, three distinct phases emerge:<\/p>\n<p><strong>Phase 1 \u2014 Rapid Improvement (2023 \u2192 mid-2024)<\/strong><br \/>\nAlignment breakthroughs and strong reductions in hallucination.<\/p>\n<p><strong>Phase 2 \u2014 Stabilisation (mid-2024 \u2192 mid-2025)<\/strong><br \/>\nConvergence across models and slower, incremental progress.<\/p>\n<p><strong>Phase 3 \u2014 Divergence (late 2025 \u2192 2026)<\/strong><br \/>\nRapid model proliferation, increasing variability, and decline in average reliability.<\/p>\n<h2>Why This Matters<\/h2>\n<p>This has important implications for how we think about LLMs.<\/p>\n<h2>1. Benchmarks Can Be Misleading<\/h2>\n<p>Most discussions focus on:<\/p>\n<blockquote><p>\u201cWhat is the best model capable of?\u201d<\/p><\/blockquote>\n<p>But in practice, organisations don\u2019t interact with the frontier. They interact with a chosen model within a messy ecosystem.<\/p>\n<p>The average matters.<\/p>\n<h2>2. Reliability Is Now a Selection Problem<\/h2>\n<p>We\u2019re moving into a world where:<\/p>\n<ul>\n<li>Capability is abundant<\/li>\n<li>Reliability is uneven<\/li>\n<\/ul>\n<p>The challenge is no longer:<\/p>\n<blockquote><p>\u201cCan LLMs do this?\u201d<\/p><\/blockquote>\n<p>But:<\/p>\n<blockquote><p>\u201cWhich LLM should I trust?\u201d<\/p><\/blockquote>\n<h2>3. Continuous Evaluation Becomes Essential<\/h2>\n<p>If the ecosystem is diverging:<\/p>\n<ul>\n<li>Static benchmarking is insufficient<\/li>\n<li>Point-in-time evaluation becomes outdated quickly<\/li>\n<\/ul>\n<p>We need continuous monitoring, context-specific evaluation, and governance around model usage.<\/p>\n<h2>A Note on Methodology<\/h2>\n<p>This analysis is based on reconstructed leaderboard data, which introduces some limitations:<\/p>\n<ul>\n<li>Commit timing is irregular<\/li>\n<li>Model sets change over time<\/li>\n<li>Data reflects leaderboard snapshots, not controlled experiments<\/li>\n<\/ul>\n<p>So the results should be interpreted as indicative ecosystem trends, not precise longitudinal measurements.<\/p>\n<h2>Final Thought<\/h2>\n<p>The dominant narrative has been:<\/p>\n<blockquote><p>\u201cLLMs are getting better.\u201d<\/p><\/blockquote>\n<p>The data suggests a more nuanced reality:<\/p>\n<blockquote><p>LLMs are getting better at the top \u2014 but the ecosystem is becoming more chaotic.<\/p><\/blockquote>\n<p>And that shift may be more important than any single model improvement.<\/p>\n<h2>Explore the Data<\/h2>\n<p>GitHub Repository: <a href=\"https:\/\/github.com\/ralfepoisson\/llm-hallucination-trends\">https:\/\/github.com\/ralfepoisson\/llm-hallucination-trends<\/a><\/p>\n<ul>\n<li>Full dataset<\/li>\n<li>Extraction script<\/li>\n<li>Graph generation code<\/li>\n<\/ul>\n<h2>If You Found This Interesting<\/h2>\n<p>I\u2019d be curious to hear your thoughts:<\/p>\n<ul>\n<li>Are we entering a post-benchmark era?<\/li>\n<li>How should organisations manage model selection?<\/li>\n<li>Is reliability becoming the new competitive advantage?<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>For the past two years, the narrative around Large Language Models (LLMs) has been clear: \u201cHallucinations are improving.\u201d New models are released, benchmarks improve, and each iteration promises greater factual reliability. At the frontier, this is largely true. But when &hellip; <a href=\"https:\/\/www.ralfepoisson.com\/index.php\/2026\/03\/30\/longitudinal-analysis-of-llm-hallucination-rates\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[5],"tags":[],"class_list":["post-298","post","type-post","status-publish","format-standard","hentry","category-engineering"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9qLPS-4O","_links":{"self":[{"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/posts\/298","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/comments?post=298"}],"version-history":[{"count":4,"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/posts\/298\/revisions"}],"predecessor-version":[{"id":302,"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/posts\/298\/revisions\/302"}],"wp:attachment":[{"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/media?parent=298"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/categories?post=298"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ralfepoisson.com\/index.php\/wp-json\/wp\/v2\/tags?post=298"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}