It’s great that you actually ran that side project experiment instead of discussing the problems and forming opinions in the abstract because I think the latter is a trap a lot of people fall into. And I think the result you got is real and valid. That said, I would argue it’s evidence for a different conclusion than the one you’re drawing.
The way you described it, you tried strict review, felt you were leaving the speed gains on the table, then deliberately went the other direction to see what happened. That’s a reasonable empirical approach, and I don’t want to caricature it as “vibe coding.” But I do think the two configurations you tried have something in common: in both of them, the load-bearing activity is review of code that already exists. In the first, you did a lot of it, and in the second, very little. What I’d argue is missing from both is the work that happens before the code exists, which is the specification of what you actually want, and the construction of tests and harnesses that will tell you whether you got it. That work is what stops “side-panel-banana-2-lower-new-large” from happening in the first place, and no amount of post-hoc review (careful or minimal) substitutes for it.
So I want to look at your bullet list through that lens:
-
Tests aren’t testing the right things. In my experience this is a specification failure, not a code quality failure. The fix is basically red/green TDD adapted for an AI workflow: write the test first, in plain English plus a failing assertion, then let the model implement against it until the test goes green. Interestingly, TDD was always a hard sell when you were also the one writing the implementation, because it felt like doing the thinking twice. Once the implementer is AI, that becomes a non-issue. The test is the spec you’re handing off, and the red/green cycle is the verification loop. This catches most of the “tests that test nothing” failure mode for me. Frontier models also make this mistake a lot less than they did a year ago, and an adversarial review pass by a sub-agent catches most of what remains.
-
Names are terrible. Genuinely the cheapest item on the list to fix. One pass with “rename everything in this module for clarity, here’s the domain vocab” and it’s done. Not dismissing the annoyance, but I don’t think it’s worth stressing about at write time anymore.
-
Abstractions are clumsy and leaky. This is the hardest item and I don’t want to hand-wave it. “Do the architecture work upfront” is easy to say and hard to do, and I don’t think AI makes the underlying design problem easier. If anything, it makes bad designs ship faster! What it does change is that the cost of writing down the design explicitly (module boundaries, invariants, what talks to what, what’s allowed to know about what) used to compete with the cost of just writing the code, and now it doesn’t. I’ve started treating a short architectural brief as mandatory before any non-trivial feature, and the abstractions get meaningfully better. But I want to be honest that this is the place where I still feel like I’m figuring it out, not a solved problem.
-
Memory leaks. I think this is fair and I’ll back off the “easy and quick” framing from how I’d normally put this. Leaks are hard to catch in any codebase, AI-written or not! And the honest answer is that you need real load tests and real observability, and building those well is not a weekend job. What I’d say is that AI lowers the cost of building that harness enough that it’s worth doing on projects where it used to feel out of scope. But it doesn’t make the underlying discipline optional, and if you skip it you will eat the leak in production the same way you always would have.
So I’m not saying “let go of the reins.” I’m saying “hold a different set of reins.” I let go of line-level taste and hold much tighter to specifications, tests that verify real behavior, and architectural constraints written down before code gets generated. I’ll be honest: the first felt like loss. I had something like a brief identity crisis over it, because it’s the part that used to feel like craft and I enjoyed it a lot. The second feels like overhead because it used to be optional when a careful human was typing every character and could course-correct mid-keystroke. The trade is doing less of what used to be load-bearing and more of what used to be a nice-to-have.
Now, the part of your post I haven’t addressed yet: the compounding worry. “Hope Claude improvements continue to outpace the spaghetti code it writes.” I’d actually push back on the framing itself. It assumes we’re in a race where debt piles up at a roughly fixed rate and we’re hoping cleanup tools improve fast enough to keep up. But that’s not what I’ve been seeing. Newer models don’t just clean up old debt better. They generate much less of it in the first place. The code Opus 4.6 writes today is meaningfully better than what Sonnet 4.5 was writing a few months ago, which was meaningfully better than the model before it. The accumulation rate is dropping, not holding steady.
And on the cleanup side: the tech debt that older models generated is, in practice, easily refactored by current ones. I’ve watched this on my own codebase. Code that felt like a mess when it was written back in Sept/Oct 2025 gets cleaned up in a single pass by a newer model, often better than I would have done it myself. So both sides of the equation are moving in the direction that makes the compounding worry weaker, not stronger. The mess we’re generating today is less than the mess we were generating six months ago, and the cost of cleaning up yesterday’s mess keeps dropping.
What I’ll fully concede: if you let go of line-level review without picking up specification and verification, you get exactly the dystopia you’re describing, and “Claude operator” is the right term for it. It’s a bad place to land, and it exists right at the transition point for those trying to adopt AI. I don’t think the answer is to go back to hand-authoring, though. The new reins work, and they work better than the old ones. The engineers I know who’ve made this transition are getting more done on harder problems, and they’re not going back.