Vidar

How to measure AI code quality?

I’ve been trying to come up with a fairly real life representative way of evaluating code quality from AI, and by extension the functionality of skills and their setup, and the code checking utilities used.

Ideally there would be perfect real life scope code examples along with the prompts that should make them. And a variety for different problem spaces to make sure it is widely applicable. One could then setup a loop using the ideal prompts where

The planning and implementing skills act according to the prompt.
The review tools and review skill telling the plan and implementing skills what needs improvement in order to get closer to the code ideal.
If/ when the planning and implementation skills get closer than the review tools can actively instruct then the review tools and skill should improve. Given the non-deterministic somewhat noisy nature of LLMs this cycle should gradually evolve towards the better.

Alas, for lack of real life scale ideal code examples with ideal prompts the best I have found so far is the Real Life Conduit which is a medium clone setup. It is just one data point, but it will give a certain objective feedback as it is fully external, made for testing purposes, and can tested online against their live API.

So that is the current plan. Give a short prompt without much technical guidance, and let Claude sort it out or fail horribly. To what extent will be some kind of metric I suppose? So I’m thinking: 1. Going blind 1 shot from the prompt and directly to test against the API. 2. Going blind 1 shot, but before testing against the API use code review tools and the review skills and have Claude ‘use tools and review skill and fix any and all issues’. And then test.

I think it is also interesting to repeat that with Claude on lower thinking modes. Given enough skills guidance, and or with automated code check and fix after, maybe a lower level thinking or lower level model will be make quality output? In that case good skills and review tools would be cost and token saving, which I would appreciate a lot.

Anyway, ideas for good functional metrics are welcome?

70 comments

#metrics #ai #claude

4 2343 70

2026-05-17 02:35:22 UTC

First 10 of 70 Posts!

jdiago

Code quality is code quality. Doesn’t matter who or what wrote it.

Post #1

DaAnalyst

Claude gave me this today :

flagged? = opts[ :flagged] && something_else?( args)

..

// flagged? passed to a function it made it match against true or false

Post #2

Vidar

Fair enough. So maybe the metric is really wider, and it should be how to measure code quality period.

Either way I’m looking for good metrics for the quality of generated AI code from a concise prompt in particular as such a tool would likely be helpful improving use and or using less expensive models. There are many metrics to go by of course, but for me I think the following is what I look from Claude:

Code that fully meets formal specifications and requirements.
Flexible code architecture that is easy to change, extend and test.
No major performance issues.
Secure handling of authentications and authorizations and other common security topics.
Robust test coverage.
Good documentation.
Idiomatic best-practices code as per the language.

Getting to a level where Claude can be trusted to reliably and easily do that is a worthy I goal I think.

Post #3

gtcode

Cool and relevant:

https://github.com/Cinderella-Man/credence

Post #4

dimitarvp

I have zero idea but – really good question.

I’d start with “less coding lines”. LLMs are master bullsh1tters and will spit out 1000 lines of code when I, given the time, can very likely write 250 and even have the code be more understandable and nicer to work with.

Post #5

Vidar

I like it. Have to check the degree of overlap to what is in place already, but certainly relevant.

That is a good point. Short and straight forward readable sounds good. (In my mind longer than needed would be bad, and shorter and too intricate/ clever/ compressed would also be bad). If we assume the long version actually boils down to about the same functionality in the AST or compiled code, then it should be possible to reengineer the shortest clear readable source version from that. Anything longer could be deemed wasterful. (Thinking loud, maybe something sticks).

Post #6

dimitarvp

Well let’s be honest here. If you achieve that you can command $20k / hour to rework the entire world’s code into something better. Or maybe $300k / hour.

But yes, this is the ultimate dream. Maybe you and @mudasobwa should talk: he started his project for a more-or-less universal AST with the idea to detect stuff beyond what credo and various other linters can detect.

Post #7

Vidar

That would certainly justity buying some Threadrippers.

Given some more thought the compiled to reengineered source seems too high level for me at least. But reengineering from the AST back into short source should be possible for a module or boundary with finite defined inputs and known outputs. Basically turning it into a perfect black box which can be optimized in every which way as long as it upholds the fully testable the input-output relation. I’m not sure what percentage of code that would be applicable for, but for a functional language that should be a fair bit.

I do have this test for ‘should this just be a lookup table’ instead which basically just do the same thing for simple functions.

Post #8

bunnylushington

This might be too obvious and I apologize if that’s the case: using one agent to write the code, another to test the code, and a third to review the code has paid large dividends (at least with gemini). It’s circular, using the AI to review the AI’s work but doing so with a new context-free agent seems to catch issues ranging from “doesn’t match the spec” to “this function is too long.” I tend to use a lot of agents in the course of a task. Gemini seems pretty reliable at generating “hand off” instructions.

I’ve also found that being super specific about my acceptance criteria helps both with development and review. “Good documentation” is not specific enough. I have something like “all public functions must have a doc, all modules must have a moduledoc, and all private functions must have a comment explaining what they do.” Much like with the specs I begin with, the more specific the constraints the better the AI does meeting them since – I assume – there’s objective criteria to measure against.

Post #9

mudasobwa

Creator of Cure

Nah, the code talks and I walk.

There is zero sense in having shorter code, if that is really needed, go rename all the functions to f1, f2, and all the variables to v1, v2, and squeeze all the spacing, and whatnot.

The actual size of code does not really matter at all, once we have MCP exposing smart tools and RAG built on top of that. (I went this path, btw, and I succeeded to some extent.) The code itself is out of the equation immediately after your MCP covers all the needs (requests) your LLM produces. Instead of grep and sed your own RAG can do better, that’s the most important thing here.

Code is something your LLM should not deal with at all.

Post #10

Last Post!

johns10davenport

Aspirationally I’d like role based task assignments, handled by the main agents only. Subagents are a rabbit hole. I’m using them, and it’s reasonably effective but I’ll move away eventually.

Post #71