Why publish the methodology at all?

Because the numbers are only worth citing if you can see how they were made. Corpus selection, what ran, what was off, and where the floors are — it's all here, and the headline stats survive the caveats.

Why was the AI pass switched off?

Reproducibility. Rules-only results are deterministic — same repo in, same findings out — which makes the study auditable. It also makes the numbers conservative: a full cleanvibes scan adds a Claude review on top, which finds more, not less.

Did you filter out the tiny demo repos?

No. They self-describe as vibe-coded, so they're legitimately in the population — but we flag that 22 of 99 have ≤15 files and trivially score A, which inflates the medians and creates the bimodal shape. That's why the report leads with means, the grade distribution, and per-category hit rates instead.

Will you re-run the study?

The setup makes re-runs cheap — same engine, fresh corpus — so a periodic re-run is the plan. The june 10, 2026 snapshot is version one; if the numbers move, the comparison gets its own write-up.

How we benchmarked 99 vibe-coded repos for code quality

The corpus: self-described, not curated

We collected public GitHub repos that describe themselves as AI- or vibe-coded — the term appears in their name, description, or README. The search produced 107 candidates; we capped at the first 100 and didn't curate beyond that. One repo was skipped because its archive exceeded 40MB compressed, leaving n = 99 analyzed.

Self-description is the honest framing and the honest limitation: we took repos at their word. Some are substantial applications; some are workshop demos and single-page toys — 22 of the 99 have 15 or fewer scannable files, and 8 have 5 or fewer (the median repo has 54 scannable files, the mean 107). We counted them all, because they genuinely are self-described vibe-coded output, and we report means and grade distributions alongside medians precisely because tiny repos trivially score A and inflate the medians.

What ran: the rules engine, with the AI pass off

Each repo's default-branch archive was streamed in memory by cleanvibes's own fetcher — the same path a normal scan uses, nothing written to disk — with a 90-second per-repo timeout. The cleanvibes rules engine then ran its full check set across the six categories (structure & size, readability & complexity, duplication, dead code & leftovers, consistency & style, repo hygiene) with production thresholds, unmodified for the study — including the 8-line normalized-window hashing that detects duplication.

The Claude review that's part of every real cleanvibes scan was deliberately switched off, so every number in the study is deterministic and reproducible: same repo in, same findings out. That cuts both ways — anything that needs code-reading judgment went uncounted, so the study's findings are a floor on what a full scan reports, and scores would likely shift down with the AI pass on. Scores use the production model: six weighted category subscores, severity deductions, weighted average, letter grades.

Caps, floors, and the caveats that matter

The engine caps reporting at 10 findings per rule and 300 per repo. For aggregate stats this matters: per-repo finding counts (mean 16.3, median 5) are floors, not totals, for the messiest repos — a repo with thirty 600-line files reports ten of them. Repo-level hit rates — "x% of repos had at least one finding of type y" — are unaffected by the caps, which is one reason we lead with them.

The rest of the honest list: this measures cleanliness, not correctness — no code execution, no tests run, so a clean grade is a tidy repo, not a verified one. "Vibe-coded" is self-description, not verified provenance. And the corpus is one snapshot, collected june 10, 2026; repos change, and a different month's corpus would produce somewhat different numbers. We'd rather you know all of this and trust the numbers that survive it: the hit rates are clean, the bimodal grade distribution is real, and every figure traces to a per-repo result row.

how it works

01
Collect candidates
Search public GitHub for repos that self-describe as AI- or vibe-coded in name, description, or README. 107 candidates found, capped at the first 100, no further curation.
02
Fetch each repo the way a scan does
Default-branch archive, streamed in memory by cleanvibes's own fetcher — nothing written to disk — with a 90-second per-repo timeout. One repo skipped for size (>40MB compressed), leaving n = 99.
03
Run the rules engine, unmodified
The production cleanvibes check set across all six categories, production thresholds and severity model, AI pass off — so every result is deterministic and reproducible.
04
Score with the production model
Six weighted category subscores, severity deductions (critical −40, high −22, medium −10, low −4), weighted average, letter grades — identical to what a real scan reports.
05
Aggregate per repo, report distributions
Hit rates as share-of-repos-with-≥1-finding (immune to the per-rule caps), plus means, medians, and the full grade distribution — with the tiny-repo inflation of the medians stated rather than hidden.

frequently asked

Why publish the methodology at all?: Because the numbers are only worth citing if you can see how they were made. Corpus selection, what ran, what was off, and where the floors are — it's all here, and the headline stats survive the caveats.
Why was the AI pass switched off?: Reproducibility. Rules-only results are deterministic — same repo in, same findings out — which makes the study auditable. It also makes the numbers conservative: a full cleanvibes scan adds a Claude review on top, which finds more, not less.
Did you filter out the tiny demo repos?: No. They self-describe as vibe-coded, so they're legitimately in the population — but we flag that 22 of 99 have ≤15 files and trivially score A, which inflates the medians and creates the bimodal shape. That's why the report leads with means, the grade distribution, and per-category hit rates instead.
Will you re-run the study?: The setup makes re-runs cheap — same engine, fresh corpus — so a periodic re-run is the plan. The june 10, 2026 snapshot is version one; if the numbers move, the comparison gets its own write-up.

Last updated June 11, 2026

How we benchmarked 99 vibe-coded repos: corpus, engine, caps, and caveats

The corpus: self-described, not curated

What ran: the rules engine, with the AI pass off

Caps, floors, and the caveats that matter

how it works

Collect candidates

Fetch each repo the way a scan does

Run the rules engine, unmodified

Score with the production model

Aggregate per repo, report distributions

frequently asked

more on cleanvibes

ready to try cleanvibes?