Engineering leads, what are you doing to stop the slop?

I am joining uninvited for a three-way man hug.

It’s very obvious to me at this point that the only way to win this battle is to go on the offensive. And that’s what I’ll be doing: very soon Zoi will be all over the codebase and the Saga pattern absolutely will be enforced with electric shock collars and laser-guided auto-aiming turrets… as a start.

I think we all should start collaborating – maybe via a community wiki right here on ElixirForum? – on all sorts of lints / checks for Elixir projects. I keep not finding the time but again, it is becoming obvious that I am losing the battle that way and must start aggressively allocate some time for automated quality checks.

I used to be the guy on my team who cared the most about code quality. I always left feedback on PRs about making sure the code is easy to test, complexity is managed properly, Elixir conventions are followed, and so on. I regularly brought up new coding patterns to discuss at weekly team meetings. I even built a separate repo with rules and conventions, which were way more advanced than what Credo can consume, and spent untold hours in my spare time to try to integrate it into our CI pipeline. All that to say: I understand where you’re coming from!

Things have changed a lot in the past year though, and so have my mindset and attitude.

Fundamentally, I realized that a lot of my stress came from holding onto a standard that was already a luxury. And that luxury only made sense when coding was the bottleneck and therefore the most expensive part of the software development lifecycle. After all, getting it right the first time was the thing that would prevent a risky and time-consuming refactor, so optimizing for that was logical.

Today, when I catch myself bristling at how AI wrote or structured something, I try to ask: is this actually going to hurt us in six months, or is it just not how I would have done it? Most of the time it’s the second one. After all, the code works. It is well-tested. It went through manual QA. And the honest answer is that “not how I would have done it” was never a good reason to block a teammate’s PR either. We got away with it mostly because the volume was low enough that taste-policing felt like quality control. At AI-assisted volume though, the same instinct becomes a bottleneck that actively prevents users from getting the features they need.

Stuff like duplicated functions and drifting modules… sure, those are worth flagging and fixing. But these days I try to fix them the way I would fix any tech debt, which is with a refactor pass when it actually starts to cost us, not by trying to prevent every instance at write time. There is AI-generated code in our repo today that was written before Opus 4.5. It’s not great code. It would probably stress out any Elixir developer! But you know what? It works, and the features it powers are really valuable to users. If the time comes when we need to refactor them, we will probably also use AI for it, because the models today are much better than those that wrote the code. The cleanup will be cheaper than the prevention, which is an inversion of how we used to think about it.

Over the holidays a few months ago, I had an epiphany: I’m not a code author anymore. I’m a product author and the code is just the medium. Every minute I spend agonizing over whether a function should exist, or should be written differently, or split into three, or be placed in some other module, etc. is time I’m not spending on whether the feature should exist, whether users will actually use it, whether the thing we’re building is the right thing. That’s where caring truly pays off now. The most difficult thing I’ve had to admit to myself recently, as a professional software engineer, is that code-level stuff is increasingly becoming a hobby. We’re not quite there yet, but it’s just a matter of time.

5 Likes

I sympathize with most of what you say in this post and I wonder myself if this is an opportunity to let go of stress instead of fight but I get stuck here:

I have noticed his myself in a side project I am using to test what this all feels like before making sweeping changes to our org processes. I started out demanding it rewrite and rewrite until I was satisfied but then I felt like I was missing out on the speed gains. So I started more and more to vibe code a feature, making an effort to invest in plan but doing minimal review. Then when something broke I could look closer and suggest refactors as part of fixes. And for the most part it worked, as long as I let go of my obsession “code quality.” As you put it, I’m a “product author” now. The question becomes: how’s the product quality? And that I’m a lot less sure about.

  • Memory leaks and similar issues are a given
  • Tests often aren’t testing the right things (sometimes nothing at all)
  • Names are terrible
  • Abstractions are clumsy, leaky as all hell
  • List goes on

Oh well. Is this really a problem? The product works. Well, until the memory leak actually becomes a server fault and then it’s an emergency fix. Hope Claude is up to the task. Or there’s a bug in production test didn’t catch. Hope Claude can fix it before the complaints come in. Hope Claude improvements continue to outpace the spaghetti code it writes because new features are taking more and more tokens and I don’t do that stuff anymore thankfully, because this div is called “side-panel-banana-2-lower-new-large“–I’m a product author…or really, a Claude operator. Where Claude goes there go I. Not a position I love to be in with a private company’s product.

From your posts I can tell whatever you’re doing is working for you but letting go of the reins seems like huge risk from where I’m standing. And until I do I can’t let go of the stress either.

7 Likes

It is both scary and heartwarming that this thread is turning into a post-apocalyptic support group.

1 Like

I think this is something AI specifically is changing. These decisions are often not expensive. Generating tons of code has become cheap. We still try and work on this stuff ahead of time where possible. But we’ve actually created a separate AI repo (copy of our existing repo) specifically to build multiple prototypes with AI quickly. We use this to review these types of decisions with working code (not necessarily good, though sometimes good, but working). That let’s us run low risk, complex spikes that we learn from much faster. We can then review that and build the path we want. But it’s really important to get that PR/MR that we can see a diff and play with to do a better job of even comparing different methods.

We were recently debating whether we wanted to build a new table per entity, a table with an entity_id and type column, or a table with 4-5 different entity_id foreign keys by type with the other ones being null for each entry. We were able to build all three, look at it, have the AI help with some analysis of the options, then decide what we want. Those conversations were often both more theoretical and more reliant on our most advanced engineers. Now anyone can try a few different methods as part of a spike. More junior folks are more quickly able to develop opinions about the ergonomics of the produced codes, the steps for implementation are far more validated.

I think specifically the change with AI is code is not expensive. Decision making and shared understanding was and remains the expensive part– but the cost of having code now is much lower. I feel like the main thing we’re getting is higher quality decisions and share understanding because we can compare much more authentic different versions of an implementation now. When code is cheap, you can start with the information code provides you to help you get to better decisions.

So I disagree somewhat that the design/gate function moves– I think what moves is code. Code is now upstream and can be used to inform design in a way that was far more costly in the past.

It’s great that you actually ran that side project experiment instead of discussing the problems and forming opinions in the abstract because I think the latter is a trap a lot of people fall into. And I think the result you got is real and valid. That said, I would argue it’s evidence for a different conclusion than the one you’re drawing.

The way you described it, you tried strict review, felt you were leaving the speed gains on the table, then deliberately went the other direction to see what happened. That’s a reasonable empirical approach, and I don’t want to caricature it as “vibe coding.” But I do think the two configurations you tried have something in common: in both of them, the load-bearing activity is review of code that already exists. In the first, you did a lot of it, and in the second, very little. What I’d argue is missing from both is the work that happens before the code exists, which is the specification of what you actually want, and the construction of tests and harnesses that will tell you whether you got it. That work is what stops “side-panel-banana-2-lower-new-large” from happening in the first place, and no amount of post-hoc review (careful or minimal) substitutes for it.

So I want to look at your bullet list through that lens:

  • Tests aren’t testing the right things. In my experience this is a specification failure, not a code quality failure. The fix is basically red/green TDD adapted for an AI workflow: write the test first, in plain English plus a failing assertion, then let the model implement against it until the test goes green. Interestingly, TDD was always a hard sell when you were also the one writing the implementation, because it felt like doing the thinking twice. Once the implementer is AI, that becomes a non-issue. The test is the spec you’re handing off, and the red/green cycle is the verification loop. This catches most of the “tests that test nothing” failure mode for me. Frontier models also make this mistake a lot less than they did a year ago, and an adversarial review pass by a sub-agent catches most of what remains.

  • Names are terrible. Genuinely the cheapest item on the list to fix. One pass with “rename everything in this module for clarity, here’s the domain vocab” and it’s done. Not dismissing the annoyance, but I don’t think it’s worth stressing about at write time anymore.

  • Abstractions are clumsy and leaky. This is the hardest item and I don’t want to hand-wave it. “Do the architecture work upfront” is easy to say and hard to do, and I don’t think AI makes the underlying design problem easier. If anything, it makes bad designs ship faster! What it does change is that the cost of writing down the design explicitly (module boundaries, invariants, what talks to what, what’s allowed to know about what) used to compete with the cost of just writing the code, and now it doesn’t. I’ve started treating a short architectural brief as mandatory before any non-trivial feature, and the abstractions get meaningfully better. But I want to be honest that this is the place where I still feel like I’m figuring it out, not a solved problem.

  • Memory leaks. I think this is fair and I’ll back off the “easy and quick” framing from how I’d normally put this. Leaks are hard to catch in any codebase, AI-written or not! And the honest answer is that you need real load tests and real observability, and building those well is not a weekend job. What I’d say is that AI lowers the cost of building that harness enough that it’s worth doing on projects where it used to feel out of scope. But it doesn’t make the underlying discipline optional, and if you skip it you will eat the leak in production the same way you always would have.

So I’m not saying “let go of the reins.” I’m saying “hold a different set of reins.” I let go of line-level taste and hold much tighter to specifications, tests that verify real behavior, and architectural constraints written down before code gets generated. I’ll be honest: the first felt like loss. I had something like a brief identity crisis over it, because it’s the part that used to feel like craft and I enjoyed it a lot. The second feels like overhead because it used to be optional when a careful human was typing every character and could course-correct mid-keystroke. The trade is doing less of what used to be load-bearing and more of what used to be a nice-to-have.

Now, the part of your post I haven’t addressed yet: the compounding worry. “Hope Claude improvements continue to outpace the spaghetti code it writes.” I’d actually push back on the framing itself. It assumes we’re in a race where debt piles up at a roughly fixed rate and we’re hoping cleanup tools improve fast enough to keep up. But that’s not what I’ve been seeing. Newer models don’t just clean up old debt better. They generate much less of it in the first place. The code Opus 4.6 writes today is meaningfully better than what Sonnet 4.5 was writing a few months ago, which was meaningfully better than the model before it. The accumulation rate is dropping, not holding steady.

And on the cleanup side: the tech debt that older models generated is, in practice, easily refactored by current ones. I’ve watched this on my own codebase. Code that felt like a mess when it was written back in Sept/Oct 2025 gets cleaned up in a single pass by a newer model, often better than I would have done it myself. So both sides of the equation are moving in the direction that makes the compounding worry weaker, not stronger. The mess we’re generating today is less than the mess we were generating six months ago, and the cost of cleaning up yesterday’s mess keeps dropping.

What I’ll fully concede: if you let go of line-level review without picking up specification and verification, you get exactly the dystopia you’re describing, and “Claude operator” is the right term for it. It’s a bad place to land, and it exists right at the transition point for those trying to adopt AI. I don’t think the answer is to go back to hand-authoring, though. The new reins work, and they work better than the old ones. The engineers I know who’ve made this transition are getting more done on harder problems, and they’re not going back.

3 Likes

This does not line up with my experiences at all.

Newer models are just as guilty of creating tests that don’t actually test anything, or super overcomplicated architectures, or code that looks reasonable on a cursory inspection but the wtfs only become apparent once you really dig in. They try to take shortcuts at nearly every opportunity to paper over cracks and issues, and require real diligence and supervision.

Trying to refactor older vibe code ends up in a messy growing loop because the agent can’t pick apart the code and make small incremental updates. It either writes plans to do so that fall apart at step 1 so ends up doing a big-bang refactor (which will have many of the same problems as the original code, just with a different skin) or it makes changes so small that there’s no actual meaningful progress towards the goal and you end up with a behemoth of a bowl of spaghetti.

I’m not saying the tools aren’t useful, they are, but there’s definitely some bias in how these things are being experienced. If you think the mess is less now, are you sure its not only because you haven’t felt the pain of it by digging into it yet? That’s when the biggest problems become apparent, not ‘oh its naming things better now so problems are solved’.

3 Likes

Yeah, I don’t know what to tell you because our experiences differ massively. For one of my personal projects, Opus 4.5 did a phenomenal job cleaning up a lot of code that Sonnet 4.5 had implemented in the two months preceding the release of the former. We’re talking absolute night-and-day difference in terms of code quality: it went from exploratory PoC level to production-ready in a little less than two days.

Regarding tests, as I said, red/green TDD has been an extremely reliable approach for me. It’s not often that all tests written by my AI agents pass but my manual testing reveals a bug, and in those situations the problem is almost always a gap in testing rather than a test that turns out to not actually do what it claims to. It does occasionally use looser assertions than I’d like, but adversarial AI review catches those pretty reliably.

Our different experiences could be due to different harnesses. I’ve been hearing a lot of grumbling about Opus 4.6 recently, and that turned out to specifically be the result of a change Anthropic made to Claude Code where the effort level defaulted to Medium. In those types of situations I can totally see the AI taking shortcuts to stay under the effort threshold. I always use it in Cursor with max effort though so I haven’t run into any problems.

It could also be a matter of problem domains. I’ve said in other threads that at work my team collectively noticed that AI doesn’t perform nearly as well with Ash as it does with other parts of the Elixir codebase, for example (although others said usage_rules resolves this for the most part?). Unfortunately there are a lot of variables and the non-deterministic nature of AI makes it difficult to pin it on anything specific.

1 Like

Just read this post by the creator of the Pi coding agent. Good reading for the people dealing with the issues in this topic.

I can’t post the link because it’s got a blue word in the URL :sweat_smile: - but if you’re interested, go to Mario Zechner’s blog and look for his post about slowing down.

And I would like to suggest that slowing the f*ck down is the way to go. Give yourself time to think about what you’re actually building and why. Give yourself an opportunity to say, f*ck no, we don’t need this. Set yourself limits on how much code you let the clanker generate per day, in line with your ability to actually review the code.

AI gives us all 10x capabilities but does that mean we need to crank the lever to max all the time? Maybe 5x is better? I dunno, but the idea there’s a balance where we can enjoy the benefits of using AI whilst also minimising the slop problem, feels reasonable to me, and like a cultural thing that good dev teams teams should be aiming for.

6 Likes

Code quality still matters. Just like devs, LLMs tend to copy the patterns that exist. Duplicated, confusing, or contradictory code can multiply.

However, I’m having good results using Claude to both get new features done faster and address long-standing technical debt. I’m almost finished with a cleanup that is so much effort, it probably would never have been done without AI - it’s a pervasive issue and it’s “all-or-nothing” to fix; at some point, we’ll flip a feature flag and many, many conditionals will switch us to some new plumbing in prod. Then I’ll start ripping out the old. I’m excited about the ability to get this kind of thing done. And the fact that I can ship features while I’m doing it gives me the time to do the cleanup; nobody is saying “what, 2 months for nothing but a refactor?”

I spent a lot of effort up front writing tests (with AI) and have insisted that they pass whether the feature flag is off or on. Those guardrails have been key.

3 Likes

I just wanted to say we’re also far more rapidly cleaning up these issues with AI as well as sitting on migrations to new patterns we want to introduce far less. It used to be a big effort to, for example, convert every table to a new behavior/component. Now we can practically one shot it.

I don’t understand why writing code is a sunk cost, but design docs, whiteboards, and Slack huddles are not. The code is the design and can be changed or reworked if it’s not right. Sometimes we don’t know the implications of our ideas until we’ve implemented them, sometimes more than once - writing the code is a necessary part of that process, not the final step. Some issues won’t be identifiable until they’re deployed to a running system, but I think a few iterations in code of some features can highlight design issues and be far less costly than going with the first design a person comes up with (or generates with an LLM), and then trying to coerce that into what it should have been by expensive patches to a running system.

I hear this often but I’m not there yet. I still feel that the code is the design and is inextricable from the product.