Verification Is the Next Bottleneck in AI-Assisted Development
I've run Claude Code workshops for over 100 engineers in the last six months. Teams that used to merge 10 PRs a week are now merging 40-50. The engineers doing the reviews are the same ones who were already stretched. Nobody budgeted for that.
A Latent Space analysis found that teams with high AI adoption merge 98% more pull requests but spend 91% more time in review. The output doubled. The review burden nearly doubled too.
A recent economics paper, "Some Simple Economics of AGI" by Catalini, Hui, and Wu, explains why this is hard to fix. The cost to automate is falling fast. The cost to verify is biologically bounded. You can 10x your output. You can't 10x your reviewers.
Why AI-generated code is harder to review
When a human writes a bug, they leave traces. The variable name is weird, the comment reveals confusion, the structure doesn't fit the file. AI writes clean, idiomatic, well-commented code. The surface is smooth. The bug is buried. Reviewers have to go deeper, not shallower.
There's also a confidence gap. A frontend engineer asks Claude to write a database query. They get back something that looks correct to them. They're not qualified to know if it is.
The METR study from mid-2025 found that developers thought AI was making them 20% faster. It was actually making them 19% slower on measurable tasks. Clean-looking output feels like progress even when it isn't.
The objections
The most common: AI writes tests too, so the problem solves itself. It doesn't. AI-written tests have the same blind spots as AI-written code. Same agent, same context gaps. They document the happy path the AI already envisioned. You need verification that challenges the code, not confirms it.
The more sophisticated version: the real bottleneck is upstream, figuring out what to build. A CTO I spoke with said his constraint is understanding customer context. He's worried his team will build the wrong 10 things faster. He's right. But even if you're building exactly the right things, you still need to know that what shipped is what you intended. Separate problem.
The third: just hire more reviewers. You can't hire fast enough. Experienced reviewers are scarce. And asking senior engineers to manually review AI-generated boilerplate is the wrong use of their time.
What closes the loop
I've been building this at Opslane. Three things work together.
Tests are the foundation. Without a real test suite, you have no ground truth. Tests are how AI knows when it's done.
The piece most teams miss is acceptance criteria, written by a human before the AI touches anything. "Build a login page" is a prompt. "Users authenticate with email and password, receive a specific error on wrong credentials, land on /dashboard, session expires after 24 hours" is something a machine can check. Human judgment moves upstream, into defining what done means. That's the only work that scales.
The third piece is agents verifying agents. A separate process that tries to break what the coding agent built. One CTO I spoke with has all three running. Critical changes still get a human eye. But most routine work closes automatically.
Teams that don't build verification infrastructure in the next 12 months are accumulating a quiet debt. The MIT paper calls it the "Trojan Horse" externality: deploying unverified systems becomes rational for each team even as the systemic risk grows.
The tools that pull teams ahead aren't the ones that write code fastest. They're the ones that help you trust what got written.