Questions re experience in using LLM's on non-trivial (non-"flat"), frameworked, large user base projects in production

,

Would be very grateful if someone (even if just a couple of people) who have already used LLM-driven code generation/debugging/design assistance on a large front-end or full stack project in production could share their experience of how useful it is effectively.

Environment expectations:

  1. The person needs to consider themself a super-senior (15+ years of hand-on production experience) and very experienced in the entire stack used (as in otherwise capable of doing everything by themselves without any LLM assistance).
  2. The project is in the ballpark of at least 50,000+ (human written) LoC.
  3. The project is at least relatively complex with a proprietary design of the front-end components and it’s leveraging at least two frameworks (at least one server-side, one client-side; e.g.LiveView + Vue or whatever else), preferably more than two, even if proprietary.
  4. The project has/had to address several architectural (non-functional) challenges otherwise not supported out-of-the-box by the libraries/frameworks used.
  5. The project has a fair share of integration b/w at least two languages (e.g. Elixir + JS) either through frameworks or otherwise (i.e. there’s plenty of state management both server-side and client-side as well as exchange between them).

The answers I’m interested fall in the following groups:

  1. How granular and articulated (as in well-designed) does the “code” in English have to be to achieve the code match to requirements/quality/no bloat comparable to that of super-senior human’s? For granularity I’m referring to what’s addressed in this article by OpenAI: https://openai.com/index/harness-engineering/
  2. How good is the LLM of choice in leveraging the entire stack and “deciding” on where to apply changes when there’s a change in requirements (e.g. the backend vs in Elixir server-side front-end vs. JS client-side vs CSS/Tailwind)?
  3. What are your (currently) “definitive” conclusions on which tasks it can be given to solve virtually autonomously (and at what expense in terms of writing detailed behavioral specifications) vs. the types of tasks it’s better to not even consider letting it deal with?
  4. How beneficial really (quality/time/requirements match -wise) is letting the LLM directly access the code base (and change it) relative to querying it for copy-paste snippets/suggestion in a prompt from where it can’t see the project code base?
  5. Anything else worth mentioning.

Thank you

Edit: Last but not least, what’s the expected cost of using the LLM of choice 8/hrs a day for the said tasks without hitting the limit?

UPDATE:

Just yesterday after posting I’ve found about this ultra useful article (and all the stuff and resources it links to): advanced-context-engineering-for-coding-agents/ace-fca.md at main · humanlayer/advanced-context-engineering-for-coding-agents · GitHub

It partially (in a small part) answers my questions:

  1. Agents can be very efficient at pinpointing and maybe even fixing bugs (assumed prompted/compacted correctly)
  2. Agents do not work well refactoring vertical structures
  3. Agents are still pretty costly for a full-time use, especially if the developer themself need to shell out the money for the tokens churn

Besides the loss of privacy (the tradeoff to weigh the potential benefits against), I’m still having issues with the following:

  1. Guaranteed cost vs unsure efficiency over time: What if the (paid-for) failures in doing the tasks correctly while still having to pay for the humans to eventually redo some of them ultimately end up in more total money/time spent than without using the agents?

  2. The author (being obviously very experienced) draws a parallel of not versioning the prompts to not versioning Java source code once the jar’s are packaged.
    I have another parallel to that: What about not versioning the models? Are we expected to believe that the new (unversioned) models will continue giving equal results for the same (versioned) prompts over time? Would that not be akin to upgrading a compiler with undeclared (but virtually guaranteed) backward compatibility issues?

  3. The necessity to put accent on repetitive DON’Ts and NEVERs in the prompts is an incredible turn off IMO. It emphasizes the relativity of the expected result. The mere sensation that it may disobey (and it may and frequently does) is very discouraging. How many of those DON’Ts and NEVERs is even enough? Will they still be enough for the future model versions? etc, etc.

All this is why I’d really appreciate someone with no vested interest to answer the questions in my original post. I’d really like to know what’s real here and whether losing the privacy (effectively handing over the codebase to Anthropic, Google or whoever else) is worth it.

This reminded me of a post from last year:

1 Like

Thanks for the link!