use case

The state of vibe-coded code quality: we scanned 99 self-described AI-built repos

the short answer

We ran the cleanvibes heuristic rules engine (no AI pass) over 99 public GitHub repos that describe themselves as AI- or vibe-coded, collected june 10, 2026: 56.6% had at least one duplication finding, 50.5% had dead-code findings, 51.5% had at least one file past the ~600-line threshold — and while the median repo scored 92 (an A), the distribution is bimodal: a mean of 83.4 and 12 repos at F, because the medians are inflated by tiny demo repos and the messy tail is genuinely messy.

Everyone has an opinion about what AI-generated code looks like inside. We wanted numbers instead, so on june 10, 2026 we collected 99 public GitHub repos that describe themselves as AI- or vibe-coded and ran every one through the cleanvibes rules engine — the same deterministic checks every scan starts with, with the Claude review switched off so the results are pure, repeatable pattern detection.

The result is a portrait with two halves. Most of the corpus scores well — partly because much of it is small. But the hit rates are striking: more than half of all repos carry duplicated blocks, half carry dead code, half have at least one file past the size threshold — and 12 of the 99 graded F. This page is the report, including where the dataset flatters itself and how to read it honestly.

56.6%of 99 vibe-coded repos had at least one duplication finding — the same logic written more than onceSource: cleanvibes rules-engine study, 99 public self-described AI/vibe-coded GitHub repos, collected june 10, 2026

The headline numbers: duplication, dead code, and size

56.6% of the 99 repos had at least one duplication finding, and cross-file duplication alone — the same block living in two or more files — hit 45.5% of repos. 50.5% had dead-code findings, with commented-out code the single most common finding in the entire study at 47.5% of repos. And 51.5% had at least one file flagged for size past the ~600-line threshold, averaging 3.4 size-flagged files per affected repo.

These three are the signature of how AI tools write code: generating a fresh copy is cheaper than finding the existing one, abandoned approaches get commented out rather than deleted, and the current file grows forever because adding to it always works. None of it breaks the app — which is exactly why it ships, and why it shows up in half the corpus.

A bimodal distribution: the median A and the 12 Fs are both true

The median repo scored 92, an A — and 12 of the 99 repos graded F, with the lowest score at 30. Both numbers are real; the distribution is bimodal. The median is inflated by tiny repos: 22 of the 99 have 15 or fewer scannable files (workshop demos, single-page toys, docs-heavy repos), and a tiny repo trivially scores an A because there's almost nothing to flag. They're genuinely self-described vibe-coded output, so they were counted, not filtered.

The honest read is the mean and the spread: mean score 83.4, grades A 54, B 16, C 12, D 5, F 12. That's 45 repos below A and a heavy tail — 35.4% of repos had at least one critical or high finding. Findings per repo tell the same two-population story: mean 16.3 against a median of 5, and since per-rule reporting caps at 10 findings, the mean is a floor for the messiest repos.

Where the mess concentrates, and what we didn't measure

By category, 60.6% of repos had at least one hygiene finding, 56.6% duplication, 52.5% structure, 50.5% dead code, 47.5% readability, and 18.2% consistency. The top individual findings: commented-out code (47.5%), deep nesting (46.5%), large files (46.5%), cross-file duplication (45.5%), in-file repetition (38.4%), and giant files past 1,200 lines (26.3%).

Two limits worth stating plainly. The study ran the heuristic rules engine only — the Claude review in every real cleanvibes scan was off — so anything needing code-reading judgment went uncounted, and scores would likely shift down with it on. And "vibe-coded" means self-described: we took repos at their word. About this study: 99 public GitHub repos self-described as AI- or vibe-coded, data collected june 10, 2026, scanned by the cleanvibes rules engine — methodology in full on the how-we-benchmarked page.

Share of the 99 repos with at least one finding, by cleanvibes category (rules engine only, june 2026)

CategoryRepos with ≥1 findingMost common finding inside it
Repo hygiene60.6%No .gitignore (23.2% of repos), no README (15.2%)
Duplication56.6%Cross-file duplication (45.5% of repos)
Structure & size52.5%Files past ~600 lines (46.5% of repos)
Dead code & leftovers50.5%Commented-out code (47.5% of repos)
Readability & complexity47.5%Deep nesting (46.5% of repos)
Consistency & style18.2%Mixed conventions and competing lockfiles

frequently asked

Does this prove AI-generated code is messy?
It proves something narrower and more useful: more than half of self-described vibe-coded repos carry duplication, half carry dead code and oversized files, and 12 in 99 grade F — measured by deterministic rules, not opinion. The mess is predictable, which is also why it's catchable.
Why lead with hit rates when the median grade is an A?
Because the median is inflated by tiny demo repos — 22 of the 99 have 15 or fewer files and trivially score A. The distribution is bimodal: a clean small half and a genuinely messy tail. Means, the grade spread, and per-category hit rates describe both halves; the median describes neither.
Was AI used in the scoring?
No. The study ran the cleanvibes rules engine only — the Claude review in normal scans was deliberately off, so every number here is reproducible pattern detection. Real scans add an AI pass on top, which finds more, not less.
Can I see how my repo compares?
Yes — paste your GitHub repo link into cleanvibes and you get the same six-category scan, scored 0–100 with a letter grade, in under a minute. The free tier covers about 5 scans a month.

Last updated June 11, 2026

ready to try cleanvibes?

score your repo