AI SkillsApril 7, 2026·6 min read

You're Not Using AI Anymore. You're Supervising It.

AI just passed human baselines on desktop tasks. The people building with it most still review every output. The skill that matters now isn't prompting — it's supervision.

By Forge Team

GPT-5.4 just scored 75% on OSWorld-V, a benchmark that tests whether AI can complete real desktop tasks — navigating spreadsheets, editing documents, managing files. The human baseline on the same test is 72.4%. AI is now, by at least one measure, better than the average person at operating a computer.

Here's the part that should make you pay attention: developer Simon Willison, one of the most visible AI-assisted builders working today, estimates that 95% of the code he produces, he didn't type himself. But he reviews and tests all of it before it ships.

If someone writing almost exclusively with AI still checks everything, what does that tell you about the actual nature of AI work right now?

It tells you the job has changed. You're not using a tool anymore. You're supervising a coworker.

The supervision shift

Wharton professor Ethan Mollick has been describing this shift for months — from AI as chatbot to AI as agent. He frames the new challenge as management and delegation: the moment AI stops behaving like a calculator and starts behaving like a fast, confident coworker who is occasionally brilliant and completely unable to tell you when they've made something up.

This distinction matters because it changes what skills you need. When AI was a tool, the core skill was knowing what to ask. Now that AI is closer to a coworker, the core skill is knowing whether to trust what it gives back.

Think about what happens when you delegate to a new team member. You don't hand them a project and walk away. You check their work. You spot-check their reasoning. You ask them to explain decisions that seem off. You verify claims against sources you trust.

That's supervision. And it requires a specific set of skills that most people have never practised in the context of AI.

Willison is explicit about this. His consistent advice: treat AI as "yet another unreliable source." Not useless — unreliable. The way you'd treat a Wikipedia article or a confident intern. Useful as a starting point. Dangerous as a final answer.

Three supervision skills that actually matter

The supervision shift isn't abstract. It breaks down into concrete skills you can practise today.

1. Output comparison

When AI gives you one answer, you have nothing to evaluate it against. When you generate two or three versions of the same output, patterns emerge. You start to see where AI is consistent (probably reliable) and where it diverges (worth checking).

If you're a project manager comparing two AI-drafted stakeholder updates, the differences between drafts reveal what the model is uncertain about. If you're an analyst looking at two summaries of the same dataset, divergence points are your audit targets.

This isn't extra work — it's the fastest path to knowing which parts of AI output you can trust.

Can you tell which AI output is better? Try this.

Compare two outputs

Intermediate·75s·25 XP

Practice →

2. Flaw detection

AI outputs sound confident regardless of accuracy. A perfectly structured paragraph with clean grammar and a professional tone can contain a fabricated statistic, a logical contradiction, or a recommendation that ignores a constraint you specified two prompts ago.

The skill here isn't general scepticism. It's pattern recognition — learning the specific ways AI fails. Hallucinated citations (a model might cite "a 2024 Deloitte study on hybrid work" that sounds authoritative but doesn't exist). Plausible-sounding numbers that don't add up. Conclusions that don't follow from the evidence presented. Confident assertions in domains where the model has thin training data.

If you're a communications lead reviewing an AI-drafted press release, you're not checking whether it reads well (it will). You're checking whether the claims are accurate, the quotes are real, and the tone matches your organisation's actual position — not a more interesting one the model invented.

Practice finding what AI got wrong.

Spot the flaws in this AI analysis

Intermediate·90s·30 XP

Practice →

3. The draft-critique-refine loop

The most effective AI supervisors don't try to get perfect output on the first pass. They use AI to generate a draft, then critique it (sometimes with AI help, sometimes manually), then refine. This loop — draft, critique, refine — is the core workflow of supervision.

It's different from "prompt until you get something good." That's slot-machine thinking. The supervision loop is structured: you know what you're looking for in the critique step, and you make specific, targeted improvements in the refine step.

Willison doesn't rewrite his AI-generated code from scratch. He reads it, identifies what's wrong or suboptimal, and either fixes it directly or asks the AI to fix specific issues. That's the loop in practice.

The core supervision loop: draft, critique, refine.

Draft → Critique → Refine

Intermediate·75s·25 XP

Practice →

The real gap

There's a popular framing that the world is splitting into AI users and non-users. That the important divide is between people who've adopted AI tools and people who haven't.

That framing is already outdated. Access is widespread and growing. The divide that matters now is between people who can verify AI work and people who can't.

Consider the data from SHRM (the Society for Human Resource Management): 67% of HR professionals say their organisations are not proactively upskilling employees to work with AI. Meanwhile, AI implementation is creating new skill demands — 57% of HR professionals report that AI adoption led to upskilling or reskilling needs for employees. The tools are arriving. The training isn't. That gap isn't about access or prompt templates. It's about the verification and supervision skills that determine whether AI makes your work better or just faster and wrong.

The better AI gets, the more dangerous this gap becomes. When AI was obviously limited, people stayed cautious. They double-checked instinctively because the outputs were rough enough to trigger scepticism. As outputs become polished — as models pass human baselines on more benchmarks — the temptation to trust without verifying grows.

That's the paradox at the centre of the supervision shift. AI's increasing capability doesn't reduce the need for human oversight. It increases it. Because the mistakes become harder to spot, and the cost of missing them goes up.

The people who will thrive in this environment aren't the ones who can write the best prompts. They're the ones who know what to check after the AI finishes writing. That's not a personality trait. It's a skill. And like any skill, it responds to practice.