The headline numbers: duplication, dead code, and size
56.6% of the 99 repos had at least one duplication finding, and cross-file duplication alone — the same block living in two or more files — hit 45.5% of repos. 50.5% had dead-code findings, with commented-out code the single most common finding in the entire study at 47.5% of repos. And 51.5% had at least one file flagged for size past the ~600-line threshold, averaging 3.4 size-flagged files per affected repo.
These three are the signature of how AI tools write code: generating a fresh copy is cheaper than finding the existing one, abandoned approaches get commented out rather than deleted, and the current file grows forever because adding to it always works. None of it breaks the app — which is exactly why it ships, and why it shows up in half the corpus.
A bimodal distribution: the median A and the 12 Fs are both true
The median repo scored 92, an A — and 12 of the 99 repos graded F, with the lowest score at 30. Both numbers are real; the distribution is bimodal. The median is inflated by tiny repos: 22 of the 99 have 15 or fewer scannable files (workshop demos, single-page toys, docs-heavy repos), and a tiny repo trivially scores an A because there's almost nothing to flag. They're genuinely self-described vibe-coded output, so they were counted, not filtered.
The honest read is the mean and the spread: mean score 83.4, grades A 54, B 16, C 12, D 5, F 12. That's 45 repos below A and a heavy tail — 35.4% of repos had at least one critical or high finding. Findings per repo tell the same two-population story: mean 16.3 against a median of 5, and since per-rule reporting caps at 10 findings, the mean is a floor for the messiest repos.
Where the mess concentrates, and what we didn't measure
By category, 60.6% of repos had at least one hygiene finding, 56.6% duplication, 52.5% structure, 50.5% dead code, 47.5% readability, and 18.2% consistency. The top individual findings: commented-out code (47.5%), deep nesting (46.5%), large files (46.5%), cross-file duplication (45.5%), in-file repetition (38.4%), and giant files past 1,200 lines (26.3%).
Two limits worth stating plainly. The study ran the heuristic rules engine only — the Claude review in every real cleanvibes scan was off — so anything needing code-reading judgment went uncounted, and scores would likely shift down with it on. And "vibe-coded" means self-described: we took repos at their word. About this study: 99 public GitHub repos self-described as AI- or vibe-coded, data collected june 10, 2026, scanned by the cleanvibes rules engine — methodology in full on the how-we-benchmarked page.
Share of the 99 repos with at least one finding, by cleanvibes category (rules engine only, june 2026)
| Category | Repos with ≥1 finding | Most common finding inside it |
|---|---|---|
| Repo hygiene | 60.6% | No .gitignore (23.2% of repos), no README (15.2%) |
| Duplication | 56.6% | Cross-file duplication (45.5% of repos) |
| Structure & size | 52.5% | Files past ~600 lines (46.5% of repos) |
| Dead code & leftovers | 50.5% | Commented-out code (47.5% of repos) |
| Readability & complexity | 47.5% | Deep nesting (46.5% of repos) |
| Consistency & style | 18.2% | Mixed conventions and competing lockfiles |
frequently asked
- Does this prove AI-generated code is messy?
- It proves something narrower and more useful: more than half of self-described vibe-coded repos carry duplication, half carry dead code and oversized files, and 12 in 99 grade F — measured by deterministic rules, not opinion. The mess is predictable, which is also why it's catchable.
- Why lead with hit rates when the median grade is an A?
- Because the median is inflated by tiny demo repos — 22 of the 99 have 15 or fewer files and trivially score A. The distribution is bimodal: a clean small half and a genuinely messy tail. Means, the grade spread, and per-category hit rates describe both halves; the median describes neither.
- Was AI used in the scoring?
- No. The study ran the cleanvibes rules engine only — the Claude review in normal scans was deliberately off, so every number here is reproducible pattern detection. Real scans add an AI pass on top, which finds more, not less.
- Can I see how my repo compares?
- Yes — paste your GitHub repo link into cleanvibes and you get the same six-category scan, scored 0–100 with a letter grade, in under a minute. The free tier covers about 5 scans a month.
Last updated June 11, 2026