How to measure AI code quality?

Validating the quality of the code, produced by the very sophisticated code completion tool with another code completion tool always looked silly to me. That’s why I extensively developed static code analysis and linter tools in the last several month.

Additional credo checks helped me to catch stuff several times → OeditusCredo v0.4.0 — Documentation

Literally last week I caught and broke the huge compile-connected mega-cycle (459 nodes) with Ragex.

Old good static analysis works with the code on the level, LLM could never ever achieve, that’s why.

4 Likes

i’d like to lightly push back on the tautological definition of quality, and say that there is a human factor

meaning

quality is in large part dependent on who, or what is running, reading, or interacting with it

if its machine to machine, then it doesn’t mater how complicated or unreadable it is if its hyper optizmized or a slop trough

but for me, i’f i’m looking at it it has to be simple, grokkable, scoped, abstractable in an interface way where large swathes of the code can be boiled down to a simple interface with simple outputs and mutations

so i guess i’m saying that quality has facets, factors, observers, or users which will have their own definitions of quality, and then there are core metrics universal which do reflect quality and are totally fine to use for guidance (but generally they stem from somewhere, for a reason, the measurements, the qualities of craftsmanship, and excellence in engineering, i mean)

as i sit on the PR assembly line in the inspections department, its seriously a strain having to read every line of code, its too easy to miss a little line here and there, i am only able to fully inspect code that is quality code, meaning it passes all prior tests, all linters, passes conventions, and whatever else things which make sense to have

all of this to say i’m in agreement with @dimitarvp

I’d start with “less coding lines”.

in regards to the AST

see a discussion here too: BNF Grammars ( question/discussion ) · Issue #4 · yogthos/Matryoshka · GitHub

especially the paper which Matryoshka implemented

Ragex looks interesting, its so closely related to the AST things for compressions and cacheing

2 Likes

And if^W when it blows up in production what are we supposed to do? Rewrite from scratch?

1 Like

I will have to take some time to read and then digest that.

I’ve also come to that a deterministic static analysis tool is a good match to the non-deterministic nature and practice of the LLMs. I will most certainly take a good look at those tools to either use directly or get inspired by. Hope you don’t mind?

I do think there are some universal aspects to quality code, some aspects that are particular to certain paradigms, architectures or language idioms, and some that are more human taste.

The human taste is less important for LLMs maybe, but still important for reviewing.

I think I chose the wrong headline for this topic as that is too wide and open for interpretation. Maybe it should have been stated the opposite way instead: how to measure and minimize bad quality AI code?

I have not tested, but I feel fairly certain that LLMs also benefit from standards/ patterns, simple and readable code. Easier to recognize, easier to use, and ever recycled code bits are less prone to errors. And if nothing else likely spends less tokens.

1 Like

It does not matter whether I do mind or not, everything is MIT-licensed.

1 Like

It matters to me though. :slight_smile:

I’ve taken a brief look at Regex and that is a massive project. I will have to check that out more in depth.

When I mind others looking into my projects, I accomplish them using a pen and a paper and keep in my bedside table :slight_smile:

Unfortunately, static code analysis is always massive. Theoretically, Ragex supports all the languages, supported by metastatic, but frankly I never used it for Haskell and/or Ruby.

1 Like

Yes, but the project seem to go beyond just static analysis and also into AI agnostic, multi-language, RAG and refactoring edits. I likely missed some or even many! :slight_smile:

A quick search show that Archdo is almost fully overlapped (except change economy and composibility/ building blocks which are still fresh additions), but the intent and scope seems fairly different. I should try ragex on some projects and get a feel.

I have pen and paper by the bed too, but that is more for remembering any good solutions or ideas that pop up while sleeping.

So I just scribbled down some thoughts while reading, although most are just me checking that I got it right.

  1. We are basically looking at the entire frame of the project for the lifetime of the project?
  2. So if I got this correct:
    1. So inner loop is the code generation.
    2. Middle loop track specs and changes, normalize the code and optimize for size? (How do you set the size budget? What if the solution is not possible within the budget?)
    3. Outer loop accept the code and track its pedigree so to speak, or fail it and use the failure to change prompt, specification, hyper parameters, etc and try again from inner loop.
  3. That will achieve a single source of truth actively maintained through changes. Event sourced for projections/ linage/ data mining later on?
  4. Graphs make sense. For the LineageGraph it seems to me some of these are general and not project specific? Guidance/ skills, rules, exceptions and humans are likely the same across many projects.
  5. Over time that should give a good dataset for improvement.
  6. My setup is very skill centered so I have basically banned the use of agents for planning/ implementing/ reviewing. (You can set them up to use skills too, but between lots of skills and hooks I don’t bother). Anyway, my experience using agents actively is limited.
  7. Ok, so the graph is the true representation and changes are gated.
  8. Adversarial challenge for challenging assumed truths, finding gaps or issues. Makes sense as a concept.
  9. Yes, but ideally I do want to reach a static style/ patterns guide that stays as stable as possible, and only as flexible as needed. Once that goalpost is reached it should only move if forced. Stable is a key quality in itself I think.
  10. Doing hyperparameter optimization/ search on non-deterministic processes with fuzzy evaluation of quality/ ranking will take lots of iterations not to mention tokens?
  11. I like the concept but how will it actually work? Static analysis and tests will not actually fix an issue just locate it? LLMs can both locate and fix, but then in a not guaranteed manner, so the resolved claim itself will have to checked or tested?
  12. Ok, I’m lost here for the most part.
  13. So, open and fully declared context initialization settings including what is today hidden attributes? And the harness/ environment around packed in too? I would like that. And easy to change or spin-up different ones.
  14. I can’t say I have looked much at the credential side so far, but a granular setup makes sense.
  15. I have not made code intelligence systems so nothing sensible to say about that. Visual programming can be very similar to graphs and code language is made to accurately describe a resulting graph. To me they (naively) seem like two sides of the same coin, although different code can lead to the same graph nodes. I assume the clue is that one is easier to work with than the other for certain kind of work
  16. Living documents are good if the living is automated.

One thing that is a bit unclear to me is the granularity. In general it seems to focus project wide, but some parts make more sense at the plan milestone, code bits or code change level.

In practice I use static architecture with initialization and three levels of iteration loops:

  1. There is the initial setup with prompt/ specifications and using skills to plan to the actual plan.
  2. The failed test - implement - passed test loop
  3. Plan done it is the static analysis, LLM reviews and my checks. Then back to 2 as needed.
  4. Once the project is accepted as done, there is the second loop with skill feedback.

I can see how the living substrate concept would create a living maintained long term project, and an ever improving LLM setup, both of which are good. It is relatively complex though.

For me I think the ease of use and code output will be the two key factors.

1 Like

1. Lifetime frame?

Yes — but not “model every byte forever.” It’s the long-term frame for load-bearing facts: boundaries, contracts, capabilities, evidence, lineage. Transient stuff (failed attempts, local patches) lives in lineage storage, not the main design surface.

2. Three loops — got it right?

Pretty much. One refinement on the middle loop: it’s not just size optimization. Size is one signal. The real goal is minimizing engineering cost while keeping behavioral, architectural, and evidence constraints intact. A smaller implementation that breaks an invariant is invalid.

On budget: it’s per SpecCell type, not global. A pure domain module gets a tight budget; a stateful process gets more room because it needs it.

If the solution can’t fit the budget, that’s a design signal, not a fatal error. Three outcomes: re-budget (the cell was under-budgeted), split (it’s too large), or redesign (the approach is over-mechanized).

3. Single source of truth, event-sourced?

Yes. The cleanest model is event-sourced at the semantic-fact level — spec asserted, invariant violated, normalizer applied, exception approved, etc. From that stream you can materialize the spec graph, implementation graph, evidence coverage, lineage traces, slop reports, whatever you need. Code stays the executable reality; the graph is the engineering truth.

4. Some LineageGraph concepts are general across projects?

Correct. Better to split it: a project-specific LineageGraph (cells, patches, exceptions, decisions) and a reusable Harness Doctrine Graph (skills, rules, detectors, normalizers, failure patterns). The reusable layer becomes your cross-project improvement dataset.

5. Good dataset for improvement?

Yes, and judgment traces are the most valuable output besides accepted code. A trace captures rejected designs, reasons for rejection, normalizer effects, and the accepted normal form — far richer than a normal commit. That’s what makes it useful for improving context bundles, rules, normalizers, repair classifiers, and model selection.

6. Skill-centered, no agents for planning/review — compatible?

Fully compatible. The architecture doesn’t need agent swarms. spec.audit, spec.bundle, spec.accept are just bounded operators — skills. The substrate owns authority, state, acceptance, and lineage. The LM fills a constrained hole; it doesn’t own the plan or verdict. The doc should probably say “bounded proposal operators” instead of making it sound agent-heavy.

7. Graph is true representation, changes are gated?

Yes, with one nuance: code is still a reality source. Brownfield or handwritten code can reveal the graph is incomplete. A graph mismatch doesn’t always mean “reject” — sometimes it means “the graph was missing a legitimate fact.” The important thing is drift can never silently merge.

8. Adversarial challenge?

Yes. The adversary attacks assumptions, not just code — can this invariant be bypassed? Can this credential leak through logs? Can this state transition happen out of order? Good adversarial findings get promoted into deterministic rules or property tests whenever possible.

9. Stable style/pattern guide vs. dynamic ENF?

You’re right that stability is a key quality. The resolution is layering ENF: a stable core that rarely changes, project policy that only changes via ADRs, experimental rules as warnings only, and explicit scoped exceptions. Living shouldn’t mean moving goalposts — it means stable doctrine plus evidence-driven exceptions and promotions.

10. Hyperparameter search on non-deterministic fuzzy processes — too many iterations?

Yes, if done naively. The system shouldn’t do broad HPO over fuzzy LLM judgments. Harness evolution should mostly tune deterministic or semi-deterministic things: context bundle contents, operator ordering, cost weights, normalizer selection, model choice by task class. Search should be small, cached, off the critical path, and judged by concrete metrics — fewer ENF violations, smaller implementation graph, more mutants killed, lower human review defects. LLM-as-judge can propose hypotheses but shouldn’t be the verdict engine.

11. Static analysis and tests don’t fix issues, they locate them — how does repair actually work?

Exactly right. Static analysis and tests produce evidence and counterexamples; they don’t fix anything. The repair loop is: detector finds violation → classify it → rebuild context bundle with the failure and allowed repair scope → bounded operator proposes a patch → same detector must pass → implicated invariants must pass. LLM proposes the repair, harness verifies it, mutation/adversary tests prevent shallow gaming. The resolved claim isn’t trusted until checked by deterministic evidence.

12. Lost here.

That section (AccessGraph) needs better explanation. Short version: it’s one substrate primitive that answers who may read/modify/execute/delegate anything — code, credentials, agent scope, all of it. The key distinction is read broad context, modify narrow scope, escalate for architecture changes.

13. Open, fully declared context initialization including hidden attributes, easy to spin up different modes?

Yes, that’s exactly the intent. Task intent, SpecCell, capability bundle, model settings, allowed files, forbidden actions, runtime assumptions, tool permissions, hidden harness defaults, cost budgets, trust zone — all declared. Hidden attributes should be harness-controlled, not undocumented ambient behavior. Enables easy mode switching: local-dev, strict CI, security-critical, brownfield audit, etc.

14. Granular credential setup makes sense?

Yes. The central object isn’t the raw secret — it’s an auditable, scoped, non-exportable lease. The agent holds a reference to the lease; only the trusted connector redeems it at the final effect boundary. Granularity lets you express exactly which connector can redeem, enforce expiry and revocation, and guarantee secrets never appear in logs or telemetry.

15. Nothing to say about code intelligence, but graphs and code seem like two sides of the same coin?

Your intuition is right. The key distinction is authority. Most code intelligence systems build a graph from code and treat it as a cache (code → graph). This inverts it: the graph is the source of truth, code is a projection. Visual programming and code are both projections of the same underlying structure — different projections are easier to work with for different kinds of work.

16. Living documents only good if automated. Granularity is unclear?

Agreed on both. SpecCells should exist at multiple levels — system, subsystem, component, operation, code-change — depending on risk and change surface. Don’t SpecCell every helper; do SpecCell load-bearing units: public APIs, capability boundaries, process lifecycles, external effects, credentialed operations. And yes, if humans have to manually update everything, the system fails. Automation is the requirement, not a nice-to-have.


Your existing loop (setup → test/implement/test → analysis/review → feedback) maps directly onto the three nested loops. The main difference is the living substrate stores and operationalizes the feedback instead of leaving it as informal skill memory. Complexity is real — ease of use and code output quality are the right things to optimize for.

1 Like

QC tools of interest:

Elixir Vibe · GitHub (many)
GitHub - Cinderella-Man/credence: Credence is a semantic linter for Elixir · GitHub

1 Like

Agree here - an author’s definition of Quality must be defined.

I’ve attempted to do this here for Jido Packages: Package Quality Standards · Agent Jido

I’ll be the first to say that I find this definition incomplete. It’s a continuous work in progress. I have had some good success with pointing agents at this document to align a package with this definition.

I’ve been collaborating with @dannote and his work in packages like ex_slop to help with this - I know there are others. Here’s some results:

It’s a constant process.

2 Likes

TIL I learned about ex_slop, thank you for making me aware of it!

I’ll also read your article.

1 Like

I forgot to mention - just this week - I’ve been iterating on my shared Github Actions for the Jido ecosystem: GitHub - agentjido/github-actions · GitHub

I learned a lot from the Ash Ecosystem on this - ash/.github/workflows/ash-ci.yml at main · ash-project/ash · GitHub

This underscores my point above that “Quality” must be defined - and then encoded in both a welcoming developer experience but also serve a pragmatic purpose.

That definition of Quality must be accessible to both Humans and Agents. It’s critical across a large codebase - public or private - to maintain mental alignment among a team and that doesn’t come for free. Great projects and teams do a lot of work to make this happen. I am a big believer that if you optimize for a healthy and great “Human Developer” experience, you’ll get the good agentic experience by default too (with a few caveats!).

1 Like

I am a big believer in the same, it’s just that I am so busy lately that I can only dream of formalizing things in such ways. Big appreciation from me that you and many others are actively working on this – love it, will try and use every applicable bit for my personal and paid projects, and will provide feedback (eventually).

1 Like