AI Found 18 Rare Diagnoses Specialists Had Missed. The Workflow Is Worth Stealing.
A NEJM AI study of 376 unresolved cases shows what the best AI collaboration pattern looks like in practice: AI generates hypotheses, humans decide what to do with them. Here is the structure worth applying to any analysis-heavy work.
By Forge Team
The collaboration model that found 18 previously unsolved rare diagnoses is worth stealing — not because your work involves rare diseases, but because the structure that made it work applies to any task where AI can generate better candidates than a human would produce alone, but cannot make the final call.
What the study actually showed
On June 18, 2026, OpenAI and Boston Children's Hospital published a study in NEJM AI covering 376 cases where specialist physicians had reached a diagnostic dead end. The researchers used o3 Deep Research as a hypothesis generator: the model surfaced evidence-linked possibilities, ranked by the strength of supporting evidence, with citations attached to each.
Physicians then took those hypotheses into the real world. They ordered tests, applied clinical judgment, ruled out candidates that did not fit the full picture. The result was a 4.8% additional diagnostic yield — 18 patients who had been stuck got answers.
No autonomous diagnosis. Every call was made by a human physician. The AI's job was to make the hypothesis pool bigger and better sourced than any individual specialist's memory could, in less time than a literature search would allow.
The four-position workflow
The pattern in the study has a structure:
Generate: AI produces a list of candidates — hypotheses, patterns, possibilities — faster than a human would, and without the recency bias that shapes human recall.
Filter: A human decides which candidates are worth pursuing. This is where domain experience is non-negotiable. The model does not know which hypothesis is worth the cost of testing it; the physician does.
Test: The human (or their team) does the verification work — running an experiment, ordering a scan, reviewing a clause, making a call.
Confirm: A human makes the final determination. The AI does not close the loop.
Professionals who use AI well have internalized this sequence. They do not hand Filter back to the model. They do not let Generate run straight into Confirm.
Marcus: the researcher who stops before acting
Marcus runs market research at an 85-person B2B SaaS company. His team uses AI to analyze customer interview transcripts — 200 interviews from the previous two quarters. The AI surfaces patterns: specific friction points, feature requests that cluster, language customers use repeatedly.
Marcus does not act on everything the AI returns. He reads the list, removes the obvious noise, flags the patterns that match what he has been hearing in discovery calls, and presents two or three to the product team with his own interpretation layered on top. The AI gave him a better starting list than he would have built from memory after reading the same transcripts himself. Marcus is still the one who decides which patterns are real and what to do about them.
For a research or analysis task you run regularly, map where human review currently happens — and where it should happen but currently doesn't.
Priya: the compliance manager who triages, not rubber-stamps
Priya manages legal operations at a 300-person financial services firm. Her team uses AI to review supplier contracts, flagging potentially problematic clauses across 60 agreements per month. The model returns lists: jurisdiction mismatches, liability caps, IP ownership ambiguities, auto-renewal traps.
Her legal team does not treat every flag equally. Senior counsel reads the list, dismisses the clear false positives, and escalates the flags that warrant detailed review. Without AI, this screening step averaged three days per contract. With it, the flagging takes hours. The human work — deciding which flags are real risks — takes roughly the same amount of time it always did. But it is now better sourced and more consistent.
The model did not reduce the amount of human judgment required. It increased the quality of the evidence that judgment acts on.
Map one AI-assisted workflow your team runs. Mark every point where a human confirms, filters, or decides before the process continues.
The boundary that defines the skill
The physicians in the Boston Children's Hospital study knew exactly where o3's job ended and theirs began. Generating ranked hypotheses was the AI's work. Clinical judgment — which hypotheses to test, how to test them, what the results meant — was theirs. That clarity was not a constraint on the AI. It was a design choice by the researchers, and it is what made the collaboration produce 18 answers that 376 cases of specialist effort had not.
The question for your work is the same one they answered before starting: where in this workflow does your judgment stay non-negotiable?
Like this post?
Get the next one in your inbox. Practical AI skills, no filler.