AI is getting ridiculously productive

80% of the way through a massive refactor where I am neck deep in, about 15 full days now, assisted by opus 4.6.

I’ve thought about doing this architectural change for the last 18 months but could not justify its cost, being alone. My experience is about the same as yours except that I am creating a new prod, doing a lot of manual intervention in the plans and QA, and will migrate each tenant one by one to make it less dramatic. My tenants are fully isolated anyway so I do not have a real choice there.

For the record, I never planned that this specific app would become a real company, so I am refactoring it from single-tenant (« server closet » model), local storage, singleton processes, to a multi-tenant architecture, object storage, and horizontally scalable. Never planned for it to be sold to more than 1 client at the start :man_shrugging:.

Creating a lot of automated « architecture checks » helped a lot. Making the changes as mechanical and TDD-driven as possible helped too. Without this tooling my other options were not doing this migration at all, or seeking funding, despite having planned it carefully for « when I will have help », my plans were only a huge github issue for my own attention, and an (un?)healthy dose of hope.

I also considered throwing everything away and restarting from scratch with a scalable model from the get-go but it wasn’t an option that I was keen about, for obvious reasons. I don’t like wasting an useful app.

4 Likes

Yeah, my motivation for the architectural change was actually the same as yours: eight years ago me and my (non-technical) cofounder’s business plan was to sell to a relatively few large companies, for which a “one-schema-per-tenant” model makes sense. Since then we’ve switched to targeting smaller firms and solo shops. And it turns out having thousands of schemas in Postgres causes a lot of problems and bottlenecks in all kinds of places. I found this HN thread where people shared similar/identical experiences.

3 Likes

I am using mostly Claude an my only problem is that frequently it tries to cheat. E.g. I was trying to make Tesla adapter for Hackney 3 and detected that Hackney ignores timeout option. Claude was so eager that it rewrote tests for Hackney adapter instead of reporting the issue :stuck_out_tongue:

My experience is that, when it is stuck on something, it takes very weird shortcuts and calls it “heuristics”. I needed to add a strict “no heuristics unless approved” rule.

But it is ridiculously good at translating between languages/representations and huge refactors. Those things are usually not worth automating in a deterministic way and LLMs just speed through it :slight_smile:

2 Likes

Yes, Claude is a lazy lying shortcut skipping loving cheater. It is a real problem for me to at times. And you can’t just ask for confirmation or double check if something is done by asking. The answer to that might be a lie too. So my CLAUDE.md has a few strategies to catch and prevent that. Ironically many suggested by Claude, as it is strangely aware of its shortcomings but just can’t help itself. So asking Claude how to avoid such issues in the future actually gives sensible answers.

Proof of work, as in presenting something will force work. Comparing options also seem to help. And certain words as properly, systematically, and production quality/ ready seems to be taken more seriously than others.

2 Likes

Good ready so far. And I can agree with most of it. I am working at a small company and we have so many ideas and far too few people to accomplish them. The business is close to the education sector and money is always scarce. So hiring more people is just not possible.

For me, Claude is a game changer. I am able to work on various smaller projects at the same time. It fixes issues I cannot prioritize higher due to time constraints. But my coworkers are super happy, that those are finally fixed.

At the same time, I can rework our central management system. I am working on it for quite a few days already, but most of it is specifications and thought work. Having Claude to challenge my ideas and suggest potential implementations is so refreshing. We established clear boundaries and implementation phases. And so far we progressed quite far.

I cannot imagine going back. So many ideas are now in hands reach.

And because I am lazy and need more tooling support, I just let Code write me 2 small Mac apps which are perfect for my workflow. And I have never coded a Mac app before. But it just works. They are black boxes, but I do not care, because I will never ship them to a customer. And for me it works. I guess we will see much more of those personal tools in the future.

I just hope, that people start marking their repost as AI only coded, so we understand the risks before using them.

Yesterday for the first time I had a long Claude Code + Sonnet 4.6 session where it never got context drunk. Claude was trying to track down a bug in some complex logic that resulted in a silent failure, so no stack trace to start from. It investigated many, many things, and the conversation was long enough that it auto-compacted twice. Normally this would cause it to start giving crazy answers that have no bearing on reality, or get stuck in circular reasoning, but instead it eventually found and fixed the bug!

While watching the token usage, it looked like Claude Code was doing incremental uploads. Frequently the “up” token count was hundreds, not thousands–is it using server-side caching during the session now?

After they raised from 200k to 1M token contexts my Claude use has improved considerably. I have two long terms projects that has being going on for months, and likely hundreds of compactions. The 1M keep focus much longer at the time so less frustrations, a “smarter” Claude overall, and joy for all I guess. At least me.

One of the things that really annoys me is this very overeager focus on reducing work. For example today I was working on a project with lots of statistics. Part of that was processing raw input data. Then later in the pipeline the code has to pick lots of random samples where more samples help reduce noise impact.

This was a previously confirmed working pipeline. And I ended up spending hours exploring why some new data samples gave very weird results. Long story short: To save time Claude skipped over the raw data processing and instead just passed it right on. And Claude had decided to disregard the coded number of samples and instead reduced it to one third to save processing time. Thus I was really just looking at random noise samples.

I’m sure I’ve allowed edits of the files at some point, fair enough, but silently doing this to save work in an entirely different section of code than actually worked on was not ok. And certainly not without asking or informing.

As a sidenote: Anyone else has gotten a tired Claude lately? “Long session, lets stop here and continue tomorrow?” “Lets stop and continue with this in the next session”. And a few other variants. I think that is new.

1 Like

I make every AI agent use red/green TDD and so this is never a problem for me. I know it fixed the bug it claimed to have fixed, because I can review its code edits and the oututs of its commands to see how it first wrote a test and then ensured it fails before implementing the fix and getting the test to pass.

I think this is why I prefer Cursor over Claude Code. With Cursor, there’s still a clear review process, the UX is clear and everything is transparent. Claude Code though positions you further away from that, especially now that they made tool call results more opaque.

Yes, tests would catch some of my issues for sure, and test driven development in itself require proof of work so that helps on many levels. And a test that the working pipeline actually kept working would have caught one of the Claude introduced issues quickly.

But there are others - how do you test that Claude actually read a file properly before coding? Not just the first few pages and then decide to skip ahead to some random page to be effective.

And my latest issue today where Claude goes in and changes a hard coded value. I shouldn’t have to test that hard coded values or parameters remain as I made them. I expect them to stay unchanged in code we’re not actively working on - even if changing them could make for faster runs.

A more specific strict requirements file might have helped to pin down those values, but I generally try a lean approach to coding with Claude. I sometimes forget that Claude lacks common sense, and well, weird things can follow.

Presumably I don’t need to know, because if Claude does that, then whatever it is implementing will be wrong, and the red TDD tests it has written won’t pass.

I’m not quite sure what you mean here. If you have proper test coverage, changes to hardcoded values like module attributes or constants will be detected immediately, as the tests will start failing.

Not sure I can test for something I don’t know the answer to? The project is an exploration mission of sorts. Claude reading is likely not fully deterministic either so I expect some variety in the solutions to follow even with all read. My main point is simple though - when I ask Claude to read something it should actually just read it.

In this case we potentially have a variable tiny signal in a lot of noise. Sometimes it is just noise, sometimes there is a signal I fail to find, and sometimes there is a signal I do find. The only reliable test I can think of would be to feed in a data stream known to contain a signal with known characteristics and make sure that shows up with a given technique. And maybe the opposite, a data stream known to be just noise, and not have any signal show up. That would catch faults in the pipeline, and maybe that is exactly what I have to do. But I’m opposed to the idea that I should have to test for Claude changing things outside of the working scope.

I think I can sum my issue up fairly concise:

  1. I am not used to computers not doing as asked.
  2. I am not used to computers doing things without being asked or informing me.

This is for better and worse. I have been pleasantly surprised to see Claude take it upon itself to make backups and distribute them without me asking. That was a nice touch.

I guess I need to adjust my mental model and work routines too. More testing is already in place where possible.

Out of interest why did you switch from a multi-tenant architecture to a single schema architecture. I’m planning a project with multi-tenant architecture, so it would be nice to know the downsides

1 Like

I’ve found that if there are existing examples, good documentation and good test libraries available them you can do amazing things with Claude and Elixir

For example I’m working on;-
(1) A complete XSLT 3 implementation in pure Elixir. There is lots of documentation and thousands of conformance tests, so Claude has plenty to work from. As far as I know the only XSLT 3 implementations are Raptor and Saxon.
(2) MCP SDK in Elixir using published standards and conformance tests
(3) ADK SDK in Elixir using published standards and conformance tests
(4) A2A SDK in Elixir using published standards and conformance tests. I’m up to 100% pass rate on conformance tests.

All are waiting on me writing up examples, livebooks, website etc before I publish to hex

1 Like