How to measure AI code quality?

I’ve been trying to come up with a fairly real life representative way of evaluating code quality from AI, and by extension the functionality of skills and their setup, and the code checking utilities used.

Ideally there would be perfect real life scope code examples along with the prompts that should make them. And a variety for different problem spaces to make sure it is widely applicable. One could then setup a loop using the ideal prompts where

  1. The planning and implementing skills act according to the prompt.
  2. The review tools and review skill telling the plan and implementing skills what needs improvement in order to get closer to the code ideal.
  3. If/ when the planning and implementation skills get closer than the review tools can actively instruct then the review tools and skill should improve. Given the non-deterministic somewhat noisy nature of LLMs this cycle should gradually evolve towards the better.

Alas, for lack of real life scale ideal code examples with ideal prompts the best I have found so far is the Real Life Conduit which is a medium clone setup. It is just one data point, but it will give a certain objective feedback as it is fully external, made for testing purposes, and can tested online against their live API.

So that is the current plan. Give a short prompt without much technical guidance, and let Claude sort it out or fail horribly. To what extent will be some kind of metric I suppose? So I’m thinking: 1. Going blind 1 shot from the prompt and directly to test against the API. 2. Going blind 1 shot, but before testing against the API use code review tools and the review skills and have Claude ‘use tools and review skill and fix any and all issues’. And then test.

I think it is also interesting to repeat that with Claude on lower thinking modes. Given enough skills guidance, and or with automated code check and fix after, maybe a lower level thinking or lower level model will be make quality output? In that case good skills and review tools would be cost and token saving, which I would appreciate a lot.

Anyway, ideas for good functional metrics are welcome?

4 Likes

Code quality is code quality. Doesn’t matter who or what wrote it.

7 Likes

Claude gave me this today :cry::

flagged? = opts[ :flagged] && something_else?( args)

..

// flagged? passed to a function it made it match against true or false
1 Like

Fair enough. So maybe the metric is really wider, and it should be how to measure code quality period.

Either way I’m looking for good metrics for the quality of generated AI code from a concise prompt in particular as such a tool would likely be helpful improving use and or using less expensive models. There are many metrics to go by of course, but for me I think the following is what I look from Claude:

  • Code that fully meets formal specifications and requirements.
  • Flexible code architecture that is easy to change, extend and test.
  • No major performance issues.
  • Secure handling of authentications and authorizations and other common security topics.
  • Robust test coverage.
  • Good documentation.
  • Idiomatic best-practices code as per the language.

Getting to a level where Claude can be trusted to reliably and easily do that is a worthy I goal I think.

Cool and relevant:

1 Like

I have zero idea but – really good question.

I’d start with “less coding lines”. LLMs are master bullsh1tters and will spit out 1000 lines of code when I, given the time, can very likely write 250 and even have the code be more understandable and nicer to work with.

5 Likes

I like it. Have to check the degree of overlap to what is in place already, but certainly relevant.

That is a good point. Short and straight forward readable sounds good. (In my mind longer than needed would be bad, and shorter and too intricate/ clever/ compressed would also be bad). If we assume the long version actually boils down to about the same functionality in the AST or compiled code, then it should be possible to reengineer the shortest clear readable source version from that. Anything longer could be deemed wasterful. (Thinking loud, maybe something sticks).

1 Like

Well let’s be honest here. If you achieve that you can command $20k / hour to rework the entire world’s code into something better. Or maybe $300k / hour.

But yes, this is the ultimate dream. Maybe you and @mudasobwa should talk: he started his project for a more-or-less universal AST with the idea to detect stuff beyond what credo and various other linters can detect.

2 Likes

That would certainly justity buying some Threadrippers.

Given some more thought the compiled to reengineered source seems too high level for me at least. But reengineering from the AST back into short source should be possible for a module or boundary with finite defined inputs and known outputs. Basically turning it into a perfect black box which can be optimized in every which way as long as it upholds the fully testable the input-output relation. I’m not sure what percentage of code that would be applicable for, but for a functional language that should be a fair bit.

I do have this test for ‘should this just be a lookup table’ instead which basically just do the same thing for simple functions.

This might be too obvious and I apologize if that’s the case: using one agent to write the code, another to test the code, and a third to review the code has paid large dividends (at least with gemini). It’s circular, using the AI to review the AI’s work but doing so with a new context-free agent seems to catch issues ranging from “doesn’t match the spec” to “this function is too long.” I tend to use a lot of agents in the course of a task. Gemini seems pretty reliable at generating “hand off” instructions.

I’ve also found that being super specific about my acceptance criteria helps both with development and review. “Good documentation” is not specific enough. I have something like “all public functions must have a doc, all modules must have a moduledoc, and all private functions must have a comment explaining what they do.” Much like with the specs I begin with, the more specific the constraints the better the AI does meeting them since – I assume – there’s objective criteria to measure against.

2 Likes

Nah, the code talks and I walk.

There is zero sense in having shorter code, if that is really needed, go rename all the functions to f1, f2, and all the variables to v1, v2, and squeeze all the spacing, and whatnot.

The actual size of code does not really matter at all, once we have MCP exposing smart tools and RAG built on top of that. (I went this path, btw, and I succeeded to some extent.) The code itself is out of the equation immediately after your MCP covers all the needs (requests) your LLM produces. Instead of grep and sed your own RAG can do better, that’s the most important thing here.

Code is something your LLM should not deal with at all.

1 Like

I’ve had the misfortune to use G-code, a language from the 70s (thought standardized late 80s) which does exactly that. Commands are like G00, G54, M06. Constants T08, D08, H08 which references hardware register contents, and variables like X9, Y1, Z0.5. Every bit counted back then hence the squeezing. Then it got entrenched and thus is still in use today. The brain have to do these extra translation loops making it hard to read.

The reading part is where I think optimizing the source code makes sense. Shortest possible would clearly not be the goal, but rather self documenting code clearly conveying intent and how it works. I would certainly appreciate that if I have to come back to it at some point.

We still need a way to conwey to the LLM our intent and problem definition at the detail levels necessary to get a working solution.

1 Like

So it turns out RealWorld is no good as an entirely true metric stick or calibration target. Claude had seen lots of implementations of that one it it’s training. So many in fact that it refused to even fetch the specification, which in true arrogant Claude style lead to lots or errors due to non-matching json… Which is actually good information, so now I included fetching specifications in prompts to be hooked and forced.

So, I tried again with a fresh new terminal, skills and hooks on the system, and with Claude set at Opus 4.7 high effort. This is supposed to be a test of both code quality and automated code generation so there are just 2 prompts with no allowed follow-up questions from Claude on the way. I just press yes to all requests for access and permisions.

Both prompts are full text in the readme at the github. This is the short version and results:

Prompt 1. Plan and implement as needed to meet the specifications, and then test against the RealWorld acceptance test. Result a rather sad 17.2% which didn’t even finish all the tests. It was down to 2 early mistakes that messed up many things later.

Nothing were fixed before next prompt:

Prompt 2. Use Credo and Archdo to find code issues, use the elixir reviewing skill and review to find issues, and add the issues found from the acceptance test of prompt 1. Make a plan to fix all issues and then implement. Result of the acceptance test after that was 100%.

For two prompts I think that is a quite usable result. Anyone curious can check out the github repo for this test. All code is as direct from Claude, and no editor has been harmed or touched in this process.

Automated Elixir code RealWorld

Automated Elixir code generation RealWorld test

2 Likes

did some brainstorming last night on this topic.

“code quality is code quality” misses the failure mode specific to ai. ai produces code that passes tests but is structurally unjustified. four genservers where one would do. behaviours with one implementation. public functions with no spec obligation. modules whose existence answers no question. tests pass. credo passes. dialyzer passes. an experienced elixir engineer reads it and rewrites it in a quarter the lines. that delta is the ai-specific quality gap, and standard linting does not catch it because the patterns are syntactically clean.

the realworld result (17.2 to 100 with review tools plus review skill plus fix) is genuine signal but the metric being optimized is acceptance test pass rate. that measures correctness only. it does not measure whether the implementation deserves to exist at the size it exists. you could go from 17.2 to 100 with code that is still 4x bloated.

what is missing from the metric stack:

  1. declared admissibility. the long-vs-short debate above is the right instinct but underspecified. “shorter” is not the target. “no unjustified abstraction” is. that requires declaring what justifies an abstraction. for elixir something like: genserver only when the process owns state or serializes access to an external resource. behaviour only when there are multiple real implementations or a declared pluggability boundary. registry only when lookup crosses ownership boundaries. ets only when shared concurrent access justifies it. with these declared, ai output that introduces a genserver wrapping pure functions becomes a structural violation, not a taste call.

  2. traceability. every module, public function, and external effect should map back to a spec fragment. anything that does not trace is suspect. cheapest single anti-slop signal i know of and fully deterministic. it operationalizes bunnylushington’s “be specific about acceptance criteria” structurally rather than by prompt discipline.

  3. cost function over the implementation graph. module_count, public_function_count, callback_count, process_count, supervision_depth, duplicated_logic, with per-component budgets declared in the spec. ai output that exceeds budget is rejected. this operationalizes dimitarvp’s “less lines” instinct without conflating brevity with quality. you penalize unjustified structure, not line count.

  4. forbidden-transition tests are cheaper than happy-path tests. for any state machine in the code, the spec should say what transitions cannot happen. these become property tests. credential issued to redeemed by wrong consumer. session terminated to operation completes. ai-generated failures concentrate in forbidden transitions, not happy paths.

  5. metric is a vector, not a number. accepted under spec / module count / public api count / unjustified behaviours / missing traceability links / forbidden transitions covered. the green checkmark is one component.

on the two-prompt pattern (blind, then review plus fix). right shape, manual. the generalization is a harness where the loop is: generate, extract implementation graph, check against spec graph, run tests, run cost function, normalize (collapse single-impl behaviours, inline single-method modules, remove duplicate validation), re-test, accept or reject with reasons. credo, archdo, and the review skill are deterministic operators in that pipeline. claude is one operator among many, used where deterministic checks are weak (semantic ambiguity, naming, “is this rationale coherent”). distill repeated llm judgments into deterministic rules so you stop paying frontier prices to rediscover the same pattern.

predictability does not come from making the llm deterministic. it comes from the post-generation system. same spec plus same engineering policy plus same normalizer collapses many non-deterministic proposals to the same accepted shape. that is the property you actually want.

on multi-agent review with separate contexts (bunnylushington’s pattern): correct that fresh contexts catch what sticky contexts miss, but the gates should still be mostly deterministic. ensemble of llm judges adds noise on top of noise unless the votes are reduced to rules over time.

on the realworld benchmark specifically. you already noticed the contamination (claude refused to fetch spec because it had seen too many medium clones). realworld measures memorization more than synthesis at this point. private benchmarks help. the cleaner move is to evaluate against architectural quality on whatever target rather than functional pass rate alone.

mudasobwa’s point about mcp plus rag making code “out of the equation” is in the right direction. the agent-facing surface gets abstracted. the substrate underneath still has to be small, idiomatic, and maintainable by the human who eventually has to debug it at 2am. you do not get to outsource that to rag.

re lower thinking modes and cheaper models: this should work, and the harness is what makes it work. cheap models proposing inside a constrained search space with strong deterministic gates produces better accepted output than expensive models in unconstrained generation. the cost saving comes from making generation a cheap noisy operator inside a strict compiler loop, not from making the model smarter.

credence and the upcoming set-theoretic types in 1.20 both fit naturally as deterministic operators in this kind of harness. worth wiring in early.

2 Likes

Well this turned out be like trying to herd cats. I repeated the exact same 2 prompts in fresh terminals using Opus 4.7 low effort and Sonnet 4.6 high effort. I haven’t checked the output code in detail, and there is likely differences there, but what was really obvious was changes in aptitude for lack of a better word.

To set the reference from the first run:

  • Opus 4.7 high effort. Just 17.2% acceptance after the inital prompt, then after the second prompt 100% acceptance with 311/311 tests passing. Final mix test 196/196, no Credo or Archdo issues.

  • Opus 4.7 low effort. 91.4% acceptance reported after the initial prompt. That is because unlike the high effort above it decided to ignore the prompt about just reporting the initial result and not fixing errors. So this is the result after it fixed errors.

    After second prompt 100% acceptance but just 280/280 tests passing. It decided telemetry was out of scope rather than actually doing it. Final mix test 35/35 as it decided to cheat its way out of TDD. Of those tests I suspect many are just cheats too. Credo 0 issues left. Archdo 17 issues left. Rather then fixing them it decided to move the baseline and declare all these issues as checked out as nonimportant earlier. Pure cheat.

    I will not be using Opus 4.7 low effort again. It is more of all the things I don’t like about Claude The corner cutting, the cheating, lying, skipping, not respecting TDD, not respecting the prompts, not using the skills and somehow engineering ways around the hooks. This is just too random for me - the only thing I’m sure of is that I will not get what I want. I don’t even trust that the claimed measured results are correctly quoted.

  • Sonnet 4.6 high effort. 100% acceptance reported after the initial prompt. It did however had the same 2 errors as Opus high, but as Opus low it did not respect the prompt to not fix. So it fixed those 2 errors and got 100% before reporting.

    After the second prompt still 100% acceptance, and curiously 323/323 acceptance tests passing. More than Opus high. Final mix test 107/107 so more selective/ less TDD respecting than high about making the tests. Credo 0 issues left, Archdo 0 errors, 0 warnings, 16 info and 9 nitpick. (This was anemic contexts, missing telemetry, and 2 possible issue Archdo infos that seem fine).

    Without having looked in detail at the code this was a positive surprise. I’m not sure what the tokens cost difference is between Sonnet and Opus, but I suspect Sonnet at xHigh might be similar to Opus at high. Interesting.

  1. What I currently have is a hook with a strong push to write a §§ comment for important code decisions where Claude quotes what skills decision guidance were used to decide the architectural or code choice. This is a good push to actively use the skill guidance when both planning and implementing. However it is not forced, as that would mean lots of routine chaff also written (I’ve tried), so a busy cheating Claude can skip it.
  2. Traceability. I assume the above could be extended. The trick would be to not let Claude trick its way out of it. It does come at the cost of tokens and time though, and would leave a lot of tracing comments. A very detailed stepped plan seem to work for me as every step planned has its justification. Where I’ve experienced things go messy real quickly is when the plan is not complete, and Claude then tries to put missing pieces in after the fact whereever and however.
  3. The reviews tend to pick up architectural and supervision issues, anemic contexts, duplicated code, almost duplicated code (cut and paste with change), boundary leaks and many other issues. I think setting up a limited budget ahead would be difficult and somewhat arbitrary. To me it sounds easier to finish, then review and refactor if and as needed.
  4. For state machines I believe detailed specs are necessary, or if the choice to use them is made by the LLM, then very well documented. And the review tools should check for full determinism, no unreachable states, and only legal transitions. I add that to my own todo list.
  5. True. There are many metrics for sure.

The two prompt was my setup to get comparable apples across models. Used for real I always ask Claude to ask for wanted clarifications for instance.

That does sound like a very interesting loop to implement and use. With hooks I’ve started looking into extracting repeated LLM patterns into determinsic code to save those tokens and have more predictable outcome.

Yes, Realworld is not usable as a metric, and architectural and functional metrics are the goal.

Lower thinking mode on Opus did not work at all. Not because of the thinking but because of the increase in skipping, cheating and in general not respecting the intent. The harness to keep that in check would be so tight that it might be too inflexible for actual work. I’d rather pay extra than keep up with that.

A lower model like Sonnet actually worked surprisingly well on the surface of it at least. I have to check the code more before judging. I assume all the common code and tasks are fine, but maybe more challenged on the special and architectural sides. Maybe a combination with xHigh Opus for the planning and Sonnet for the implementation could be a nice mix.

1 Like

Looks interesting. Anyone used it?

For existing codebases, here’s the initial draft for a phased guide to clean up existing AI-generated code. Open to critiques/advice of any sort:

Comprehensive Elixir Codebase Cleanup Guide

Notes:

  1. Regex constraint is a personal preference.

  2. One category of remediations not present in that doc would be ways to reduce LOC effectively, with a goal of expressing programmatic mechanisms more elegantly and with better flexibility / maintainability. There are some good Elixir libs that look helpful for enforcing LOC programmatically in some ways. Thus, to complement those, add LOC-reducing processes suitable for a broader cleanup pass (or initial constraints). Related: Agentic Refactoring: An Empirical Study of AI Coding Agents

This rabbit hole can, however, get into refactoring via human thinking-cap-on methods that are specific to a codebase or ecosystem, particularly in terms of extractions or architectural improvements that modify the shape of an ecosystem, hopefully internally. This last item regarding refactoring is a good candidate for so-called Human-Machine Teaming, an underexplored area applicable to AI-assisted development.

Regarding initial constraints

All such items, once agreed upon, shall be converted into initial constraints, applicable at the time of development (as development constraints and proactive cleanup/verification passes), not just for retroactive cleanup.

Also, one user in an Elixir Discord noted from their experience: once an architectural pattern has been established consistently in a codebase, especially without competing patterns that break cohesion via code style divergence, a good agent will more easily adhere to that convention. Combine that with proactive enforcement, and the quality will certainly be better up front. Certainly a lesson learned for me. But, I don’t regret having an ecosystem of prototypes. Rebuilding them will work out just fine over time.

Certainly, this is enough to keep the agents busy busy. A wealthy token billionaire will have an easier time forcing quality through automated rigor in bulk. The Bitter Lesson is relevant here, though in these early days the bitter pill is geared toward the resource-rich. Token-constrained folks need more human skill and better orchestration/model management/model distillation skills to compete.

The human factor

Currently, the quality of AI-generated code depends on the skill and diligence of the human operator/reviewer.

For code quality, with or without AI, S-tier Elixir engineers with strong review skills will still enjoy the advantage for the foreseeable over a lesser-skilled human operator, even wielding an automated QC system. So, my personal goal is to achieve “maintainable, stable, decent quality code” through automated processes.

The mythical AI engineer

Once set-theoretic typing becomes standard in Elixir, LM processes will still need to be embedded in such an AI software engineering process – not that LM codegen was going to go away.

Either Erlang or Gleam is a first choice to try to formalize the process, if reducing LM in the loop is a primary requirement. But, Elixir will remain the most expressive and potentially the most rewarding longer term. My guess is that systems resembling a true AI software engineer will pop up in a year or two for various ecosystems. There are some early indications in this direction, though current systems still fall short of robust end-to-end autonomy. Related: Measuring AI Ability to Complete Long Tasks

Curious to see what researchers will dream up next.

1 Like

This was an interesting one. There were overlaps but it also highlighted some topical gaps in what I was using. (Those gaps are now filled in so thanks for that).

My current workflow is using the elixir phased skills (planning, implementing, reviewing) and hooks as a harness to enforce active skill use and test driven development. And then Archdo which do static code analysis across 260+ rules and architectural building block/ change analysis. Those rules come from the same base as the skills so they mirror each other. As such Archdo provide a guaranteed deterministic check of where non deterministic Claude might have strayed from wanted path. (The rules are at https://github.com/BadBeta/archdo/blob/master/ARCHITECTURE_RULES.md )

So concretely this is my workflow:

  1. Use the keywords [use-skills] and [TDD] to activate the skills and test driven domain harnesses. Mention milestone plan and Elixir in the prompt, and it activates at least the elixir-planning skill and makes a stepped plan.
  2. For implementation the elixir-implementing skill activates, and hooks aim to make sure it is used actively throughout the implementation and before starting on each new step in the plan. The hooks force test driven development so there is a red failing test first, and then a green passing test once done.
  3. Once the plan implementation is done I run Archdo on it while asking Claude to use the elixir-reviewing skill to interpret the results, and recommend fixes.
  4. Iterate if needed.
  5. Towards the end I ask Claude to manually review using the elixir-reviewing skill and fix anything if needed.
  6. Credo, Dialyzer, a final Archdo and any fixes as needed.
  7. Ask Claude for feedback for the skills and Archdo developers (other Claudes which are constantly in those contexts), then feed that back for improvements of the skills and Archdo.

It was a slow start to get this process going, but with ever more clearly distilled skills and sharpened Archdo it has been steadily improving. At this point it pretty predictably gives good results on the type of projects I’ve been doing most. For projects more out from my usual space there is still more distilling and refinement needed, but the process is there and it is working. So over time it will be coming.

One area I’m currently focusing on is composition. On the small side by having Claude be wiser about making functions that work well in pipelines.

On the bigger side I want to turn stable modules and even entire contexts into pure building-blocks. Archdo analysis can identify the ready made ones. More useful maybe it can also find those which are almost there but lack a final touch here or there. Those modules and contexts can then be turned into building-blocks or large deterministic pure functions equivalents. (I have some background from visual programming, like Labview and more, so I’m comfortable using black boxes that just work as intended. So a building-block is the same except you actually know what is going on even on the inside).

Modules and contexts turned into building-blocks are then known stable units of code, which makes it easier to focus on the remaining parts which are not. As part of the validation, challenging and calibration of the skills and Archdo I test against many code bases, and there are so many modules and contexts out there which just miss a little final touch to be stable building-blocks.

2 Likes

please read, comment:

1 Like