Code quality metrics that matter (and the ones that don't)

The metrics that predict the cost of change

File size is the bluntest and one of the most predictive: past ~600 lines a file almost always holds several unrelated jobs, and past 1,200 it's where bugs hide — nobody reads it whole, every change risks side effects, merge conflicts become constant. Nesting depth is its partner for readability: six levels deep means a reader holds six conditions in their head at once, which is precisely where logic bugs live. Both are cheap to measure and hard to argue with.

Duplication measures how many places one decision is written — every coupled copy multiplies the cost of changing that decision. And the leftover metrics — dead code volume, consistency violations, hygiene gaps — measure friction: each commented-out block, phantom diff, and missing lockfile is a small tax on every future interaction with the repo. None of these guarantee bugs; all of them raise the price of the next change, which is what quality means in practice.

The vanity numbers

Total lines of code tells you size, not quality — and used as a productivity metric it's actively harmful, rewarding exactly the bloat the useful metrics flag. Comment density assumes comments are good per se; a codebase that needs narration everywhere is usually one whose names and structure have failed, and the highest-value comments (the rare why-comments) disappear into a density average.

Coverage deserves a more careful verdict: having tests genuinely matters — clean·vibes flags a repo with zero tests as a hygiene finding — but the percentage makes a poor target. Chasing a number produces assertion-free tests that inflate it, and the difference between 70% and 85% says little about whether the tests would catch a real regression. Track whether the critical paths are tested; ignore the decimal.

From metrics to a score you can track

Individual metrics are most useful when something turns them into decisions. clean·vibes's model: every check feeds one of six categories (structure & size 20, readability & complexity 20, duplication 15, dead code 15, consistency 15, hygiene 15), each category starts at 100 and loses points per finding by severity (critical −40, high −22, medium −10, low −4), and the overall score is the weighted average. One number, fully decomposable back into the findings that produced it — the opposite of a black-box grade.

The score is built to be tracked as a delta, not worshipped as an absolute: scan, fix the ranked findings (each ships a paste-ready Claude prompt), re-scan, and the report shows movement since last scan. And the standing caveat applies to every metric on this page: these measure the cost of change, not correctness — a clean repo can still compute the wrong answer, which is what tests and review are for.

Quality metrics sorted by whether they predict the cost of change

Metric	What it actually tells you	Verdict
File size	Files past ~600 lines hold several jobs; past 1,200, bugs hide	Watch it — clean·vibes: structure, weight 20
Nesting depth	6+ levels = conditions nobody can hold in their head	Watch it — clean·vibes: readability, weight 20
Duplication	How many places one decision is written	Watch it — clean·vibes: weight 15
Dead code volume	Friction and false leads for every reader	Watch it — clean·vibes: weight 15
Consistency + hygiene	Phantom diffs, unreproducible installs, onboarding cost	Watch them — clean·vibes: weight 15 each
Total lines of code	Size. Nothing else.	Ignore (harmful as a target)
Comment density	How much narration exists, not whether it helps	Ignore the average; keep the why-comments
Coverage %	That tests ran lines — not that they'd catch a regression	Have tests; don't chase the number

frequently asked

What about cyclomatic complexity?

It's a legitimate cousin of nesting depth — both measure how many paths a reader must track. Depth-based checks catch most of the same offenders while being easier to read in a report, which is the trade clean·vibes makes. If you already track cyclomatic complexity in CI, keep it; the two agree far more than they differ.

Why is file size weighted so heavily?

Because it compounds: a giant file taxes every change anyone ever makes in it, and AI coding tools grow files aggressively — adding to the current file always works. Structure & size and readability together carry 40 of the 100 weights for exactly that reason.

Can I game the score by splitting files mechanically?

Somewhat — any metric can be gamed. But the gaming is mostly the fix: splitting a 1,500-line file into coherent modules is the improvement, and the prompts clean·vibes writes push toward seam-based splits, not arbitrary ones. A score you gamed honestly is a codebase you improved.

Does a high cleanliness score mean my code is correct?

No — and any tool that implies otherwise is overselling. Cleanliness predicts the cost of change; correctness is verified by tests and review. clean·vibes doesn't execute your code. The honest pitch is that clean code makes correctness cheaper to achieve and maintain.

Published June 10, 2026 · Last updated June 11, 2026

Code quality metrics that actually predict pain — and the vanity numbers to ignore

The metrics that predict the cost of change

The vanity numbers

From metrics to a score you can track

Quality metrics sorted by whether they predict the cost of change

frequently asked

more on clean·vibes

related across the studio

ready to try clean·vibes?