did some brainstorming last night on this topic.
“code quality is code quality” misses the failure mode specific to ai. ai produces code that passes tests but is structurally unjustified. four genservers where one would do. behaviours with one implementation. public functions with no spec obligation. modules whose existence answers no question. tests pass. credo passes. dialyzer passes. an experienced elixir engineer reads it and rewrites it in a quarter the lines. that delta is the ai-specific quality gap, and standard linting does not catch it because the patterns are syntactically clean.
the realworld result (17.2 to 100 with review tools plus review skill plus fix) is genuine signal but the metric being optimized is acceptance test pass rate. that measures correctness only. it does not measure whether the implementation deserves to exist at the size it exists. you could go from 17.2 to 100 with code that is still 4x bloated.
what is missing from the metric stack:
-
declared admissibility. the long-vs-short debate above is the right instinct but underspecified. “shorter” is not the target. “no unjustified abstraction” is. that requires declaring what justifies an abstraction. for elixir something like: genserver only when the process owns state or serializes access to an external resource. behaviour only when there are multiple real implementations or a declared pluggability boundary. registry only when lookup crosses ownership boundaries. ets only when shared concurrent access justifies it. with these declared, ai output that introduces a genserver wrapping pure functions becomes a structural violation, not a taste call.
-
traceability. every module, public function, and external effect should map back to a spec fragment. anything that does not trace is suspect. cheapest single anti-slop signal i know of and fully deterministic. it operationalizes bunnylushington’s “be specific about acceptance criteria” structurally rather than by prompt discipline.
-
cost function over the implementation graph. module_count, public_function_count, callback_count, process_count, supervision_depth, duplicated_logic, with per-component budgets declared in the spec. ai output that exceeds budget is rejected. this operationalizes dimitarvp’s “less lines” instinct without conflating brevity with quality. you penalize unjustified structure, not line count.
-
forbidden-transition tests are cheaper than happy-path tests. for any state machine in the code, the spec should say what transitions cannot happen. these become property tests. credential issued to redeemed by wrong consumer. session terminated to operation completes. ai-generated failures concentrate in forbidden transitions, not happy paths.
-
metric is a vector, not a number. accepted under spec / module count / public api count / unjustified behaviours / missing traceability links / forbidden transitions covered. the green checkmark is one component.
on the two-prompt pattern (blind, then review plus fix). right shape, manual. the generalization is a harness where the loop is: generate, extract implementation graph, check against spec graph, run tests, run cost function, normalize (collapse single-impl behaviours, inline single-method modules, remove duplicate validation), re-test, accept or reject with reasons. credo, archdo, and the review skill are deterministic operators in that pipeline. claude is one operator among many, used where deterministic checks are weak (semantic ambiguity, naming, “is this rationale coherent”). distill repeated llm judgments into deterministic rules so you stop paying frontier prices to rediscover the same pattern.
predictability does not come from making the llm deterministic. it comes from the post-generation system. same spec plus same engineering policy plus same normalizer collapses many non-deterministic proposals to the same accepted shape. that is the property you actually want.
on multi-agent review with separate contexts (bunnylushington’s pattern): correct that fresh contexts catch what sticky contexts miss, but the gates should still be mostly deterministic. ensemble of llm judges adds noise on top of noise unless the votes are reduced to rules over time.
on the realworld benchmark specifically. you already noticed the contamination (claude refused to fetch spec because it had seen too many medium clones). realworld measures memorization more than synthesis at this point. private benchmarks help. the cleaner move is to evaluate against architectural quality on whatever target rather than functional pass rate alone.
mudasobwa’s point about mcp plus rag making code “out of the equation” is in the right direction. the agent-facing surface gets abstracted. the substrate underneath still has to be small, idiomatic, and maintainable by the human who eventually has to debug it at 2am. you do not get to outsource that to rag.
re lower thinking modes and cheaper models: this should work, and the harness is what makes it work. cheap models proposing inside a constrained search space with strong deterministic gates produces better accepted output than expensive models in unconstrained generation. the cost saving comes from making generation a cheap noisy operator inside a strict compiler loop, not from making the model smarter.
credence and the upcoming set-theoretic types in 1.20 both fit naturally as deterministic operators in this kind of harness. worth wiring in early.