AI Agents Arrived This Week. Here's What You Actually Need to Know.
In one week: Codex connected to 90+ business tools, Claude Code Routines went plain-English, and nine AI agents outperformed human researchers at $22 an hour. Nobody has the playbook yet. Here's where to start.
By Forge Team
Your company is probably already running AI agents somewhere — a research workflow here, an automated report there, a pipeline someone built quietly on a Friday afternoon. What almost certainly isn't in place: any deliberate framework for scoping what those agents should do, what rules they operate under, or where a human should actually check the work. That gap just got more expensive to ignore.
What happened in one week
Between April 14 and 20, three products shipped that individually would have been noteworthy. Together, they mark a clear before-and-after.
OpenAI updated Codex on April 17 to connect to over 90 business tools — Jira, Salesforce, Slack, Google Workspace — and added persistent memory and multi-day task scheduling. An agent briefed on Monday can complete a multi-step workflow by Wednesday without you doing anything in between. That is not a chatbot. That is an autonomous coworker.
The day before, Anthropic launched Claude Code Routines — plain-English workflow automation that competes directly with Zapier. No code required. You describe the workflow in a sentence and the tool builds it. According to the Anthropic blog (April 16), non-technical teams can now automate multi-step processes through conversation.
On April 15, newsletters including The Neuron reported that Anthropic demonstrated nine Claude agents working a research task in parallel, scoring 0.97 against human researchers who scored 0.23 — researchers earning $22 an hour. That is not a synthetic benchmark. It is a cost-structure argument for anyone with a research budget.
The counterweight: Ethan Mollick noted on April 20 that despite all of this, there is essentially no real data on agents — no reliable benchmarks on what works, what fails silently, or what produces systematic mistakes at scale. Companies are deploying faster than they're learning.
What to do differently on Monday
The skill that mattered most twelve months ago was writing a clear prompt. That still matters. But the skill that separates effective AI users right now is a different one: being able to manage an agent the way you would manage a capable but uncritical new hire.
That means three specific things:
Scoping. Agents fail when the task is too vague, too large, or ambiguous about where to stop. "Help me with the competitive analysis" produces low-quality output. "Produce a one-page table of our four main competitors, one row each, covering pricing changes and product announcements in the last thirty days — flag anything you're uncertain about" gives the agent a shape to work inside.
Guardrails. What should the agent never do? What should it surface for human review instead of deciding itself? A brief without constraints is a brief that will produce confident mistakes. The constraints are not limits on capability — they are what makes the output safe to send.
Checkpoints. You do not need to review every line the agent produces. You need to identify which outputs carry the most risk if they're wrong, and review those. A checkpoint is not a lack of trust. It is the professional habit that keeps a silent systematic error from becoming a visible one.
What it looks like in practice
A marketing operations manager at a 45-person B2B SaaS wanted to automate her weekly competitive briefing. Previously: three hours of reading, collating, and writing. Her first agent attempt produced a summary — but it pulled press releases instead of product changes and mistook a competitor's fundraising round for a feature launch.
The problem was not the tool. It was the brief. Her second version specified: pull from these four sources only, flag anything categorised as a product announcement or pricing change, output as a table with one row per competitor and one column for "what changed this week." The agent produces that in about four minutes now. She checks the sources on anything that surprises her.
The brief took forty minutes to write well. It has saved three hours a week since March.
Define what the agent handles — and where it stops.
A harder example
A head of research at a 22-person strategy consultancy ran a client intelligence workflow for three months before noticing a pattern: the agent was consistently underweighting qualitative signals — executive departures, strategic statements in earnings calls — relative to quantitative ones like revenue and headcount. The output looked thorough. The bias was quiet.
The fix was a checkpoint: every synthesis gets a human pass before going into a client-facing document, specifically checking that qualitative and quantitative signals are balanced for that client's brief. The agent still handles 80% of the work. The checkpoint takes eight minutes. It catches something roughly once a month.
Agent mistakes tend to be systematic rather than random. That makes them easy to miss until they aren't.
Write the constraints that keep agent output safe to send.
The actual takeaway
Mollick is right that there is no playbook yet. But that is not a reason to wait — it is a reason to start small enough that you can see what you are learning from.
Pick one recurring task. Write a brief clear enough to hand to someone on their first day. Add a checkpoint for anything that would embarrass you if it was wrong. That is a playbook. Run it for two weeks and you will know more than most people deploying agents right now.
Design the checkpoint pattern for your most important AI workflow.
Put this into practice
Reading is a start — but skill comes from doing. Try these drills now.
Like this post?
Get the next one in your inbox. Practical AI skills, no filler.