AI SkillsJune 4, 2026·5 min read

If You're Using One AI to Fact-Check, You're Not Fact-Checking

Research published last week showed frontier AI models disagree with each other on 67% of real-world fact-checking claims — 77% in legal contexts, 71% in health. A second model isn't verification. It's a second opinion that contradicts the first most of the time.

By Forge Team

If you're asking a second AI to check the output of the first, the research published last week suggests you're not verifying anything. Frontier large language models disagree with each other on 67% of real-world fact-checking claims. In legal contexts that figure rises to 77%. In health, 71%. A second model doesn't confirm the first — it gives you another opinion that contradicts the first roughly two-thirds of the time.

The practical implication arrives before the methodology: any workflow that depends on AI-checked AI output for factual claims is not a verification workflow. It's sequential delegation.

What the research found

Research published on Hacker News on May 28 (504 points) examined how frontier models handle real-world fact-checking tasks — the kind of claims professionals routinely use AI to assess: figures, characterisations, summaries, and attributed statements.

The models agreed most reliably on extreme verdicts — claims clearly true or clearly false. Where they diverged was on nuanced claims: figures taken slightly out of context, characterisations that depend on which data you weight, summaries that compress in ways that may or may not be accurate. The researchers rated single-model reliability as "nontrivial but limited."

That phrase matters. The models aren't randomly wrong. They fail in predictable domains — and they fail inconsistently with each other on precisely the judgments that are hardest to call.

What this changes on Monday

The workflow that breaks first is the research summary review: you receive a long document, ask AI to summarise it, then ask a second AI whether the summary is accurate. That's not a check. Both models reason from the same surface patterns; the second won't reliably catch what the first missed.

What does work:

For important factual claims: go to the primary source. AI can find it for you — "which paragraph in this document does that figure come from?" — but reading the passage yourself is the verification step, not asking a model to confirm the passage.

For AI-generated content you'll put your name on: run the specific claims — not the whole document — through multiple models with the prompt "what is the most likely error in this?" Compare what surfaces. Disagreement tells you where to look; it doesn't tell you which model is right.

For high-stakes domains (legal, health, finance): treat AI output as a first draft of research, not a verified finding. The disagreement rates are highest in these domains precisely because the claims are most consequential.

Jamie: the research summary problem

Jamie is a senior marketing manager at a 60-person B2B software company. Her team uses AI to distill analyst reports and industry studies into internal briefing notes. After a product launch briefing where a market size figure turned out to be from a 2023 study — the model hadn't flagged the date — she changed the workflow.

She now asks the model to produce summaries with inline source citations for each key figure: not footnotes, but "per Gartner Q3 2025" in the sentence itself. For any figure her team will use externally, she opens the cited source and reads the specific paragraph. The AI still does the heavy lifting. She stops treating it as the final authority on what the document says.

Practice spotting the mistakes that pass through AI-generated research summaries before they reach your team.

Priya: the compliance verification problem

Priya is a compliance manager at a 150-person financial services firm. Her team uses Claude to review regulatory update summaries before distributing them to advisors. When she learned about the 77% disagreement rate in legal fact-checking, she ran a test: she gave the same summary to three different models and asked each to flag errors.

They flagged different things. Two caught a date inconsistency; one missed it and flagged something the other two hadn't. The majority-vote approach — act on what two models agree — caught the date error but produced a false positive on the third flag. What the test showed her: AI can expand the review surface, but final authority on regulatory accuracy stays with the person who reads the source document.

Run the same claim through two verification methods and compare what each one catches — and misses.

The verification step that still works

The models are good at one part of this: generating the right question to take to the primary source. "What does this document actually say about X?" is a useful prompt. The answer points you to a paragraph. Reading that paragraph is the verification step — not asking a model whether the paragraph says what you think it says.

The 67% disagreement finding doesn't mean AI is useless for factual work. It means using AI to verify AI is like asking two witnesses who were in the same room to independently confirm what they heard — they'll agree on obvious things and fracture on the details that matter most. The skill is knowing which step you still own.

Run two AI outputs on the same input and build the judgment to know which one to trust — and when neither is enough.

Like this post?

Get the next one in your inbox. Practical AI skills, no filler.