AI SkillsJune 13, 2026·5 min read

80% of Anthropic's Code Is Written by AI. The Supervision Problem Is Yours Now Too.

Anthropic revealed that AI writes 80% of its own code — and called for a global pause option in case something goes wrong at scale. If you're using AI for any workflow output that matters, you already face the same question: how do you maintain oversight when volume outpaces review?

By Forge Team

When the volume of AI-generated work in your workflows exceeds what you can read before it goes out, you have two choices: design a review system, or accept that you're not reviewing. The first requires intention. The second happens by default.

What Anthropic published this week

On June 4, Anthropic published engineering details revealing that AI now writes 80% of the code at Anthropic — the company building the AI. In the same post, they called for a global "pause option": a mechanism to halt AI systems if something goes wrong at a scale that outpaces human ability to catch it in real time. The Hacker News thread reached 520 points. One comment cut through the noise: "lines merged" isn't the same as "thinking done." Even Anthropic's engineers aren't treating AI-merged code as reviewed code. They're designing for the fact that review won't always keep up — not assuming it will. (Source: Anthropic engineering blog, June 4, 2026.)

The supervision problem isn't special to AI companies

If you're managing any workflow where AI contributes output — reports, client-facing communications, data analysis, product copy — you're already in a version of this position. The question isn't whether AI makes mistakes. It does. The question is what your review process looks like when you can't read everything before it goes out.

Most teams' default is informal: check what you get to, trust the rest, fix problems reactively. That's not oversight. That's hope.

Three tiers to decide about any AI-assisted workflow:

  1. Full review. What output types require human approval before they leave? (Usually: anything irreversible, high-stakes, or public-facing with your name on it.)
  2. Sampling. What can you review 20% of and flag anomalies? (Usually: repetitive, lower-stakes outputs where a pattern of errors is more dangerous than a single error.)
  3. Post-send audit. What can you catch after the fact from outcome data? (Usually: volume outputs where downstream signals — complaint rates, escalations, open rates — surface problems faster than pre-send review.)

The tier isn't determined by how much you trust the AI. It's determined by the consequences of a specific output being wrong.

Laura: the marketing manager running 40 emails a quarter

Laura manages content at a 180-person healthcare software company. She uses AI to draft the quarterly product update emails — 40 versions across customer segments, each adjusted for role and company size. She can't read all 40 before they send.

Her oversight design: enterprise-tier emails (8 per quarter) get full review — she reads every line before the send. Mid-market emails (20 per quarter) get sampling — she reads 4, checks tone and clinical accuracy, flags anything that reads as templated rather than specific. SMB-tier emails (12 per quarter) go out on sampling plus post-send: she monitors open rates and support escalations in the 48 hours after.

This wasn't complicated to build. It took one working session to agree on the categories and checkpoints. Before she designed it, her review process was "whatever I got to before deadline."

Practice deciding which AI outputs in your work need full review, sampling, or post-send audit — and what determines the tier.

Ravi: the operations lead who caught the gap

Ravi is operations manager at a 55-person import logistics firm. His team uses AI to draft supplier update emails, extract key terms from freight contracts, and generate weekly variance reports from spreadsheet data.

The problem he noticed: he was applying the same level of attention — moderate, time-pressured — to all three. Supplier emails are recoverable if wrong. A missed contract clause is not. Variance report errors stay invisible until they compound.

He mapped each output type to a review tier. Contract clause extraction: full review, every time. Supplier emails: 1-in-5 sampling, with a standing rule that any email about a dispute goes to full review regardless of the cadence. Variance reports: a short checklist of three numbers that always get verified against source data; the rest reviewed monthly.

The mapping took an afternoon. Most of the time was spent agreeing on what "wrong" actually meant for each output type, and who was accountable for catching it before it mattered.

Design a supervision plan for an AI workflow in your team — decide what you review, what you sample, and what you catch after the fact.

The practical version of the pause option

Anthropic's pause option is designed for a scale problem — AI systems producing output faster than any team of humans can review. Your version is smaller, but it's the same shape: AI outputs reaching clients, colleagues, or decision-makers without anyone reading them first.

The cases where oversight fails aren't usually cases of obvious error. They're cases where volume, speed, or habit meant something went out that no one read. Designing for that in advance — before the incident that makes you wish you had — is what supervision looks like when volume exceeds your review capacity.

Like this post?

Get the next one in your inbox. Practical AI skills, no filler.