A Top Law Firm Shipped 42 AI Hallucinations to a Judge. Here's What That Means for You.
Sullivan & Cromwell apologized to a federal judge after a court filing contained 42 AI-generated errors: fabricated citations, misquoted statutes, authorities that don't exist. The opposing firm caught them. Here's what that means for anyone using AI for research, reports, or client-facing work.
By Forge Team
If one of the most resourced law firms in the world can submit a court filing with 42 AI-generated errors — fabricated case citations, misquoted statutes, legal authorities that do not exist — and have the opposing counsel catch them, the gap between "AI produced it" and "it is ready to send" is a professional liability. Verifying AI output before submission is not extra work. It is the skill that protects your credibility now that AI tools are standard equipment in most professional roles.
What happened and why it matters
Sullivan & Cromwell — which represents clients including Goldman Sachs and Morgan Stanley — apologized to a federal bankruptcy judge after a court filing contained 42 AI-generated inaccuracies (Bloomberg, The Neuron, Apr 21). The errors were not minor formatting issues. They included fabricated case citations, misquoted statutory language, and legal authorities that do not exist anywhere in the record. The opposing firm found them. A formal court apology followed.
The same week, Anthropic published a postmortem on Claude Code: three engineering bugs had silently degraded the quality of outputs for a full month. Users noticed the work felt worse but could not confirm it — Anthropic's initial response denied any problem. The resulting Hacker News thread hit 930 points. Separately, developer Simon Willison highlighted that OpenAI's own prompting guide for GPT-5.5 advises users not to carry over old prompts to new models — the same prompt can behave meaningfully differently across model versions.
The common thread across all three: AI output quality can shift without announcement, and trusting consistent-looking output is no longer a safe strategy. The model that worked well last week may be producing something subtly different today. Your process has to account for that.
What to do differently on Monday
Build one verification check into every AI-assisted work product before it leaves your desk — not a full audit of every interaction, but a targeted check on the three categories that carry the most professional risk if wrong:
- Specific numbers or statistics. Trace them to a named source. If you cannot locate the source, the number does not go out.
- Named external citations. Check that they exist, that the title and author are correct, and that the document actually says what the AI claims.
- Summaries of documents the reader will not see. Re-read the original long enough to confirm the framing is right, not just the phrasing.
For most professionals, this check takes under ten minutes. It is the ten minutes that keeps your credibility intact.
When the output goes to a client
A research analyst at a 55-person strategy consultancy prepares four or five competitive intelligence summaries each week, each pulling from three to five industry reports. She uses AI to draft the synthesis — structure is clean, language is tight, and the turnaround is fast.
The error pattern she caught: the AI occasionally swaps statistics between sources. A revenue figure from the European segment appears attributed to North America. Both numbers exist in the underlying documents. The attribution is wrong. A client would not know it; she would.
Her fix: for every specific number that appears in a final summary, she traces it to its source document before it goes out. The AI still does the drafting. She runs one tracing pass. It takes eight minutes and has caught something on roughly one in five summaries.
Practice catching source errors before they leave your desk.
When polished format hides incomplete work
A contracts administrator at a 280-person insurance carrier uses AI to draft preliminary reviews of vendor agreements before they go to in-house counsel. The drafts are explicitly intended as a starting point — not finished analysis, just the first pass.
Three months in, she noticed counsel was returning the drafts with fewer markups. She assumed her prompts had improved. The real reason: the AI's formatting — structured headers, numbered clause references, confident tone — had started signaling "finished work" even when the underlying review was incomplete. Counsel was reading less carefully because the document looked authoritative.
She now adds a standard header to every AI-assisted draft: "First-pass extraction only. All cited clauses require verification before reliance." That one line resets the review threshold. The drafts are still useful. The person reading them now treats them correctly.
Define what your output needs to get right before you write the prompt.
The professional standard has shifted
Sullivan & Cromwell has the resources to catch 42 errors before they reach a judge. They did not catch them. The opposing counsel did. If that filing had gone unchallenged, those fabrications might have held — at least for a while.
Your verification habit should not depend on the model being consistent or on someone else catching the error. The professional who checks is the one whose work holds up.
Run two outputs side by side and see where the differences actually matter.
Like this post?
Get the next one in your inbox. Practical AI skills, no filler.