use case

Code quality metrics that actually predict pain — and the vanity numbers to ignore

the short answer

The code quality metrics that matter are the ones that predict the cost of change — file size (past ~600 lines a file holds too many jobs), nesting depth (6+ levels is where logic bugs live), duplication (copies that must change together), dead code volume, consistency, and repo hygiene — while raw lines of code, comment density, and a coverage percentage chased for its own sake are vanity numbers; clean·vibes measures the first set as six weighted subscores rolled into one 0–100 score built to be tracked as a delta between scans.

Most code metrics answer a question nobody asked. Total lines of code measures how much code you have, not whether it's good. Comment density rewards narrating the obvious. Coverage chased as a number produces tests that assert nothing. The metrics worth watching share one property: they predict the cost of change — how long the next feature takes, how likely the next fix is to break something else.

That lens — cost of change — is the honest way to evaluate any quality number, including clean·vibes's. This page sorts the common metrics into the ones that predict pain, the ones that don't, and explains how the useful ones become the six subscores in a cleanliness score.

51.5%of 99 vibe-coded repos had at least one file past the ~600-line threshold — averaging 3.4 flagged files eachSource: clean·vibes rules-engine study, 99 public AI/vibe-coded GitHub repos, collected june 10, 2026

The metrics that predict the cost of change

File size is the bluntest and one of the most predictive: past ~600 lines a file almost always holds several unrelated jobs, and past 1,200 it's where bugs hide — nobody reads it whole, every change risks side effects, merge conflicts become constant. Nesting depth is its partner for readability: six levels deep means a reader holds six conditions in their head at once, which is precisely where logic bugs live. Both are cheap to measure and hard to argue with.

Duplication measures how many places one decision is written — every coupled copy multiplies the cost of changing that decision. And the leftover metrics — dead code volume, consistency violations, hygiene gaps — measure friction: each commented-out block, phantom diff, and missing lockfile is a small tax on every future interaction with the repo. None of these guarantee bugs; all of them raise the price of the next change, which is what quality means in practice.

The vanity numbers

Total lines of code tells you size, not quality — and used as a productivity metric it's actively harmful, rewarding exactly the bloat the useful metrics flag. Comment density assumes comments are good per se; a codebase that needs narration everywhere is usually one whose names and structure have failed, and the highest-value comments (the rare why-comments) disappear into a density average.

Coverage deserves a more careful verdict: having tests genuinely matters — clean·vibes flags a repo with zero tests as a hygiene finding — but the percentage makes a poor target. Chasing a number produces assertion-free tests that inflate it, and the difference between 70% and 85% says little about whether the tests would catch a real regression. Track whether the critical paths are tested; ignore the decimal.

From metrics to a score you can track

Individual metrics are most useful when something turns them into decisions. clean·vibes's model: every check feeds one of six categories (structure & size 20, readability & complexity 20, duplication 15, dead code 15, consistency 15, hygiene 15), each category starts at 100 and loses points per finding by severity (critical −40, high −22, medium −10, low −4), and the overall score is the weighted average. One number, fully decomposable back into the findings that produced it — the opposite of a black-box grade.

The score is built to be tracked as a delta, not worshipped as an absolute: scan, fix the ranked findings (each ships a paste-ready Claude prompt), re-scan, and the report shows movement since last scan. And the standing caveat applies to every metric on this page: these measure the cost of change, not correctness — a clean repo can still compute the wrong answer, which is what tests and review are for.

Quality metrics sorted by whether they predict the cost of change

MetricWhat it actually tells youVerdict
File sizeFiles past ~600 lines hold several jobs; past 1,200, bugs hideWatch it — clean·vibes: structure, weight 20
Nesting depth6+ levels = conditions nobody can hold in their headWatch it — clean·vibes: readability, weight 20
DuplicationHow many places one decision is writtenWatch it — clean·vibes: weight 15
Dead code volumeFriction and false leads for every readerWatch it — clean·vibes: weight 15
Consistency + hygienePhantom diffs, unreproducible installs, onboarding costWatch them — clean·vibes: weight 15 each
Total lines of codeSize. Nothing else.Ignore (harmful as a target)
Comment densityHow much narration exists, not whether it helpsIgnore the average; keep the why-comments
Coverage %That tests ran lines — not that they'd catch a regressionHave tests; don't chase the number

frequently asked

What about cyclomatic complexity?
It's a legitimate cousin of nesting depth — both measure how many paths a reader must track. Depth-based checks catch most of the same offenders while being easier to read in a report, which is the trade clean·vibes makes. If you already track cyclomatic complexity in CI, keep it; the two agree far more than they differ.
Why is file size weighted so heavily?
Because it compounds: a giant file taxes every change anyone ever makes in it, and AI coding tools grow files aggressively — adding to the current file always works. Structure & size and readability together carry 40 of the 100 weights for exactly that reason.
Can I game the score by splitting files mechanically?
Somewhat — any metric can be gamed. But the gaming is mostly the fix: splitting a 1,500-line file into coherent modules is the improvement, and the prompts clean·vibes writes push toward seam-based splits, not arbitrary ones. A score you gamed honestly is a codebase you improved.
Does a high cleanliness score mean my code is correct?
No — and any tool that implies otherwise is overselling. Cleanliness predicts the cost of change; correctness is verified by tests and review. clean·vibes doesn't execute your code. The honest pitch is that clean code makes correctness cheaper to achieve and maintain.

Published June 10, 2026 · Last updated June 11, 2026

ready to try clean·vibes?

score your repo