Read more of this story at Slashdot.
Read more of this story at Slashdot.
As the frontier model race accelerates, AI devotees are splitting their loyalty across the major providers at both the user and developer levels. Differences in inference are an accepted norm — but most assume that, at the highest level, frontier LLMs would agree on basic, real-world facts.
Except that’s not the case.
An analysis published this month on the claim-verification platform Lenz found that across 1,000 recent real-user fact-check claims — statements about the world asserted as true — a panel of five frontier LLMs split on 67% of them, meaning at least one model dissented from the majority verdict, or no clear majority formed at all.
The five models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, Sonar Pro) were each given the same real-world claim and asked to pick a verdict from a 4-bucket rubric (True / Mostly True / Misleading / False). Because only one bucket can be correct per claim, any disagreement among the panel means at least one model is label-inconsistent.
According to Lenz, the “split across these five models is intentional” because it covers the spread of inference modes that are common in production AI systems.
Spanning from latency-sensitive inference to throughput-aware, resource-constrained and scalable inference, inference is typically divided into low-latency high-throughput inference (e.g. for interactive chatbots) and offline or batch inference, where processes accumulate data before it is subsequently analyzed, once optimized for cost.
“Unlike the standard benchmark questions, the models have not seen these claims during training — i.e., it’s a fresh real-world corpus across science, healthcare, politics, law, and other domains.”
Research informing the May 21 paper was led by Kosta Jordanov, founder of Lenz and co-founder of Wiser, an IT consulting and software engineering group headquartered in Sofia, Bulgaria.
Jordanov tells The New Stack that the claims his team used in the research are real claims that users have fact-checked on Lenz since February 15, 2026.
“We’ve excluded private claims, near-duplicate claims, and any claims containing personally identifiable information (PII),” Joranov says. “The interesting thing about this corpus is that, unlike the standard benchmark questions, the models have not seen these claims during training — i.e., it’s a fresh real-world corpus across science, healthcare, politics, law, and other domains on topics that people care about and fact-check.”
Beyond the 67% dissent metric, 34% of the claims are substantially disagreed on
Beyond the 67% dissent metric, 34% of the claims are substantially disagreed on (2+ buckets apart), and 21% are polar opposites (at least one model says False and at least one says True). At this level, we can start to see the path from dissent to disagreement having a real impact on live production AI systems and tools.
“If a software engineering team operates a system where legal, financial, or reputational risk is involved – and it delivers untrue or hallucinated content to users, you should think about the ways in which you validate the AI-generated content before it reaches users.” —Kosta Jordanov.
The practical takeaway is that on real-world claims, a single frontier LLM gives one opinion from a visibly unstable distribution. A second model often gives another.
“For many applications, that’s fine,” Joranov clarifies. “But if a software engineering team operates a system where legal, financial, or reputational risk is involved — and it delivers untrue or hallucinated content to users — you should think about the ways in which you validate the AI-generated content before it reaches users.”
The question arises, then, why do frontier models converge confidently at True/False poles but fracture badly on middle-ground verdicts? Unfortunately, that’s a hard question to answer based on this research. One hypothesis Joranov puts forward is that the Mostly True and Misleading categories are a bit more ambiguous than the True and False categories.
“What we measured, though, is that some models use the middle buckets way less often than others – Gemini is quite ‘confident’ and classified only 6% of the claims in the two middle buckets vs. 45% for Opus 4.7,” he says.
Looking at the potential howlers here, if Claude Opus 4.7 (which had received early criticism) aligned with the peer majority least often at 70%, should that concern Anthropic?
“Not necessarily,” clarifies Joranov. “Our limited preliminary research shows that the majority is often wrong, and sometimes we see wrong unanimous verdicts; i.e., having a different opinion than the majority does not necessarily mean being wrong.”
This research does not use any “ground truths” (indisputable real-world facts that have been widely validated and verified) and only measures the differences between the models’ verdicts. It cannot answer which model is correct for which claim.
“Our analysis [of LLM accuracy] reveals that apparent convergence in benchmark accuracy can conceal deep epistemic divergence.” – Cornell University’s Yang & Wang.
Academic and commercially underpinned model research appears to be turning to this space right now. A study by Eddie Yang and Dashun Wang at Cornell University published in February notes that benchmarks underpin how progress in large language models (LLMs) is measured and trusted.
“Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks — MMLU-Pro and GPQA — we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models,” wrote Yang & Wang in February.
Joranov confirms that this analysis is the first step.
“We do plan a follow-up where we measure the models against human-provided labels, and also measure the source-based multi-step multi-model Lenz pipeline against those labels and against the frontier models,” Joranov says. “The time-consuming part is the methodologically correct labeling by human experts in all of those domains, but we aim to publish in the coming months.”
This report concluded with a statement to explain that the point of this work isn’t to create a leaderboard.
The point is to map the structure of disagreement, i.e., where do frontier panels systematically diverge from a human consensus, where does Lenz diverge from both, how each individual model and Lenz align with the same human reference, and what categories of claims drive each kind of divergence (rubric ambiguity, temporal framing, domain specialization, calibration drift).
The post Why GPT-5.4, Claude, and Gemini can’t agree on basic, real-world facts appeared first on The New Stack.

JetBrains announced a streamlined default structure for Kotlin Multiplatform (KMP) projects, replacing the allâinâone composeApp module with a shared module and separate app modules for each platform.
The shared module contains common code, while platformâspecific modules such as androidApp, desktopApp and webApp depend on it. This change clarifies module responsibilities and aligns projects with modern build conventions.
Developers can further split shared code into sharedLogic and sharedUI when using native UIs (for example, SwiftUI on iOS). Projects that include a server now gain a dedicated server module and a core module for shared models and validation logic.
The restructuring addresses issues in the previous template, which mixed multiplatform library code and application configuration, making it hard to tell where to place platformâspecific settings. It also prepares projects for Android Gradle Plugin 9.0 (AGP 9), which requires the Android entry point to be in its own module.
The new setup is already available via JetBrainsâ project wizard, and migration of existing projects is optional except for AGPÂ 9.0âspecific changes, which are mandatory for Android targets.
Developers can explore the new template at JetBrainsâ KMP wizard and reference the migration guide for existing projects.
More information can be found on the official blog post:
https://blog.jetbrains.com/kotlin/2026/05/new-kmp-default-structure/