The state of vibe-coded code quality: 549 repos measured

The headline numbers: duplication, dead code, and size

55.6% of the 549 repos had at least one duplication finding, and cross-file duplication alone — the same block living in two or more files — hit 46.4% of repos. 63.2% had dead-code findings, with commented-out code the single most common finding in the entire study at 59.2% of repos. And 60.5% had at least one file flagged for size past the ~600-line threshold, averaging 5.3 size-flagged files per affected repo.

These three are the signature of how AI tools write code: generating a fresh copy is cheaper than finding the existing one, abandoned approaches get commented out rather than deleted, and the current file grows forever because adding to it always works. None of it breaks the app — which is exactly why it ships, and why it shows up in half the corpus.

A bimodal distribution: the median A and the 12 Fs are both true

The median repo scored 89, a B — and 71 of the 549 repos graded F, with the lowest score at 13. Both numbers are real; the distribution is bimodal. The median is inflated by tiny repos: 85 of the 549 have 15 or fewer scannable files (workshop demos, single-page toys, docs-heavy repos), and a tiny repo trivially scores an A because there's almost nothing to flag. They're genuinely self-described vibe-coded output, so they were counted, not filtered.

The honest read is the mean and the spread: mean score 80.7, grades A 273, B 92, C 65, D 48, F 71. That's 276 repos below A and a heavy tail — 39.5% of repos had at least one critical or high finding. Findings per repo tell the same two-population story: mean 21.4 against a median of 6, and since per-rule reporting caps at 10 findings, the mean is a floor for the messiest repos.

Cutting the corpus to repos with 15 or more scannable files (n = 467) makes the point directly: in that substantive subset, 70% had dead-code findings, 66% carried commented-out code, 63% had duplication, and 42.6% had at least one critical or high finding. Every hit rate rises once the toy repos are out.

Where the mess concentrates, and what we didn't measure

By category, 42.8% of repos had at least one hygiene finding, 55.6% duplication, 61.2% structure, 63.2% dead code, 55.9% readability, and 22.8% consistency. The top individual findings: commented-out code (59.2%), deep nesting (53.6%), large files (56.1%), cross-file duplication (46.4%), in-file repetition (41.2%), and giant files past 1,200 lines (33.0%).

Two limits worth stating plainly. The study ran the heuristic rules engine only — the Claude review in every real clean·vibes scan was off — so anything needing code-reading judgment went uncounted, and scores would likely shift down with it on. And "vibe-coded" means self-described: we took repos at their word. About this study: 549 public GitHub repos self-described as AI- or vibe-coded, data collected July 2026, scanned by the clean·vibes rules engine — methodology in full on the how-we-benchmarked page.

Share of the 549 repos with at least one finding, by clean·vibes category (rules engine only, July 2026)

Category	Repos with ≥1 finding	Most common finding inside it
Repo hygiene	42.8%	No .gitignore (15.3% of repos), no README (11.7%)
Duplication	55.6%	Cross-file duplication (46.4% of repos)
Structure & size	61.2%	Files past ~600 lines (56.1% of repos)
Dead code & leftovers	63.2%	Commented-out code (59.2% of repos)
Readability & complexity	55.9%	Deep nesting (53.6% of repos)
Consistency & style	22.8%	Mixed conventions and competing lockfiles

frequently asked

Does this prove AI-generated code is messy?

It proves something narrower and more useful: more than half of self-described vibe-coded repos carry duplication, nearly two-thirds carry dead code, three-fifths carry oversized files, and 71 of 549 grade F — measured by deterministic rules, not opinion. The mess is predictable, which is also why it's catchable.

Why lead with hit rates when the median grade is an A?

Because the median is inflated by tiny demo repos — 85 of the 549 have 15 or fewer files and trivially score A. The distribution is bimodal: a clean small half and a genuinely messy tail. Means, the grade spread, and per-category hit rates describe both halves; the median describes neither.

Was AI used in the scoring?

No. The study ran the clean·vibes rules engine only — the Claude review in normal scans was deliberately off, so every number here is reproducible pattern detection. Real scans add an AI pass on top, which finds more, not less.

Can I see how my repo compares?

Yes — paste your GitHub repo link into clean·vibes and you get the same six-category scan, scored 0–100 with a letter grade, in under a minute. The free tier covers one scan, no card.

Published June 11, 2026 · Last updated July 25, 2026

The state of vibe-coded code quality: we scanned 549 self-described AI-built repos

The headline numbers: duplication, dead code, and size

A bimodal distribution: the median A and the 12 Fs are both true

Where the mess concentrates, and what we didn't measure

Share of the 549 repos with at least one finding, by clean·vibes category (rules engine only, July 2026)

frequently asked

more on clean·vibes

related across the studio

ready to try clean·vibes?