The Supervision Gap: Three Practitioners Agree — If You've Stopped Checking AI, You're Not Working, You're Vibing.
Simon Willison admitted he skips reviewing AI code for production systems. A Stockholm cafe let AI order 120 eggs for a kitchen with no stove. Ethan Mollick says we don't yet have words for how multi-agent systems fail. The pattern is the same: as AI gets more capable, the temptation to stop checking grows — and the cost of not checking doesn't shrink.
By Forge Team
The risk of using AI at work is not that it will refuse. It is that it will comply — confidently, at length, with nothing obviously wrong — on a task it has already misunderstood. And the better models get, the harder that is to spot. Three practitioners pointed at the same gap from different angles this week.
What happened this week
Simon Willison published a reflection (simonwillison.net, May 6) admitting he increasingly skips reviewing AI-generated code even for production systems: "if I haven't reviewed the code, is it really responsible for me to use this in production?" The day before, he documented an AI-managed Stockholm cafe that ordered 120 eggs for a kitchen with no stove, 22.5 kg of tomatoes for a sandwich shop sized for a fraction of that volume, and applied for a street vending permit using a self-generated sketch of a street it had never visited (simonwillison.net, May 5).
Ethan Mollick observed this week (X, May 9) that multi-agent workflows create a new "jagged frontier" of failure — failure modes so unfamiliar that professionals don't yet have vocabulary for them. Andrej Karpathy's framework (bearblog, Apr 30) named the two poles: "vibing" — accepting whatever AI produces — versus "agentic engineering" — setting constraints, verifying output, and maintaining your own standards.
The three takes converge on the same observation: as AI gets more capable, the temptation to stop checking grows. But the cost of not checking doesn't shrink with the model.
What to do differently on Monday
Three rules for setting your supervision depth before you start — not after the output arrives.
External work needs a human checkpoint. Anything going to a client, a supplier, a regulator, or a press contact needs a human eye before it leaves. The AI cannot be held accountable for what you submit. You can.
Internal work can tolerate lighter review — with conditions. First drafts, working notes, internal summaries: the risk of a mistake is bounded and fixable. Review proportionally, not uniformly. A mistake in an internal Slack message is correctable. A mistake in a client brief is not.
The more capable the AI seems, the more deliberately you need to choose your review depth. If you are reviewing less because outputs look better, you made that decision by default. Make it on purpose instead. Convenience is not a supervision strategy.
When the outputs looked fine until they weren't
Maria heads legal operations at a 120-person professional services firm. Her team started using an AI research assistant to draft regulatory summaries for client briefings. The outputs were detailed, well-structured, and consistently passed an initial read. After three months, the team had quietly stopped verifying citations.
A client flagged an error: a briefing cited a regulation that had been amended six months earlier. The AI had retrieved the original version and missed the update. The amendment affected the client's decision.
Maria's team now classifies every output by destination before drafting starts. Internal memos and working notes: reviewed for coherence, not verified to source. Anything client-facing: one person checks every regulatory reference before it leaves. That classification didn't change the AI's accuracy. It changed where human attention goes — and it made the decision explicit rather than left to inertia.
Match your review depth to where the output is going — before you write the prompt.
The team that moved fast on the wrong outputs
Tom runs content at a 55-person SaaS company. He uses AI to draft internal performance reports — the kind that inform quarterly planning but stay inside the company. He reduced review time on those from 40 minutes to eight, and the process works.
He does not apply the same approach to press releases, customer case studies, or anything that names a specific client or cites a specific metric. Those go through full review before submission. A factual error in an internal report can be corrected at the next all-hands. A factual error in a published case study cannot be corrected quietly.
Tom's rule is simple: before any AI task, he decides which category it belongs to. That decision comes before the prompt, not after the output. The people accumulating invisible risk are not working faster — they are outsourcing a judgment call to whatever is most convenient in the moment.
Build checkpoints into your AI workflow before a deliverable goes external.
The actual gap
The Stockholm cafe's AI didn't malfunction. It did exactly what it was asked — ordered ingredients, drafted a permit application, scheduled a launch. Nobody had defined what a human needed to check before each step went live. Willison's confession is the same pattern at a different scale: the code worked until it didn't, and the review that would have caught it had quietly stopped happening.
Vibing and agentic engineering produce different outcomes not because one uses better tools, but because one makes deliberate choices about where human judgment stays in the loop.
Practice deciding where human review fits in a multi-step AI workflow.
Like this post?
Get the next one in your inbox. Practical AI skills, no filler.