how to

How we benchmarked 99 vibe-coded repos: corpus, engine, caps, and caveats

the short answer

Our code-quality study scanned 99 public GitHub repos that describe themselves as AI- or vibe-coded (collected june 10, 2026) using the cleanvibes heuristic rules engine only — AI pass off, per-rule reporting capped at 10 findings and 300 per repo, archives streamed in memory with a 90-second timeout — and the main caveats are that 22 tiny demo repos inflate the medians, finding counts are floors because of the caps, and "vibe-coded" means self-described, not verified provenance.

A statistic is only as good as the method behind it, so this page writes the method down: where the repos came from, exactly what ran against them, what was deliberately switched off, and every caveat we know about. If you've seen our numbers quoted — 56.6% of vibe-coded repos with duplication findings, 12 of 99 grading F — this is where they come from.

The short version: 99 public GitHub repos self-described as AI- or vibe-coded, data collected june 10, 2026, each scanned by the cleanvibes rules engine in memory, no AI pass, results aggregated per repo. Nothing was filtered to make the numbers better or worse.

99 reposscanned with the same deterministic rules engine every cleanvibes scan starts with — fixed thresholds, no AI passSource: cleanvibes rules-engine study, public self-described AI/vibe-coded GitHub repos, collected june 10, 2026

The corpus: self-described, not curated

We collected public GitHub repos that describe themselves as AI- or vibe-coded — the term appears in their name, description, or README. The search produced 107 candidates; we capped at the first 100 and didn't curate beyond that. One repo was skipped because its archive exceeded 40MB compressed, leaving n = 99 analyzed.

Self-description is the honest framing and the honest limitation: we took repos at their word. Some are substantial applications; some are workshop demos and single-page toys — 22 of the 99 have 15 or fewer scannable files, and 8 have 5 or fewer (the median repo has 54 scannable files, the mean 107). We counted them all, because they genuinely are self-described vibe-coded output, and we report means and grade distributions alongside medians precisely because tiny repos trivially score A and inflate the medians.

What ran: the rules engine, with the AI pass off

Each repo's default-branch archive was streamed in memory by cleanvibes's own fetcher — the same path a normal scan uses, nothing written to disk — with a 90-second per-repo timeout. The cleanvibes rules engine then ran its full check set across the six categories (structure & size, readability & complexity, duplication, dead code & leftovers, consistency & style, repo hygiene) with production thresholds, unmodified for the study — including the 8-line normalized-window hashing that detects duplication.

The Claude review that's part of every real cleanvibes scan was deliberately switched off, so every number in the study is deterministic and reproducible: same repo in, same findings out. That cuts both ways — anything that needs code-reading judgment went uncounted, so the study's findings are a floor on what a full scan reports, and scores would likely shift down with the AI pass on. Scores use the production model: six weighted category subscores, severity deductions, weighted average, letter grades.

Caps, floors, and the caveats that matter

The engine caps reporting at 10 findings per rule and 300 per repo. For aggregate stats this matters: per-repo finding counts (mean 16.3, median 5) are floors, not totals, for the messiest repos — a repo with thirty 600-line files reports ten of them. Repo-level hit rates — "x% of repos had at least one finding of type y" — are unaffected by the caps, which is one reason we lead with them.

The rest of the honest list: this measures cleanliness, not correctness — no code execution, no tests run, so a clean grade is a tidy repo, not a verified one. "Vibe-coded" is self-description, not verified provenance. And the corpus is one snapshot, collected june 10, 2026; repos change, and a different month's corpus would produce somewhat different numbers. We'd rather you know all of this and trust the numbers that survive it: the hit rates are clean, the bimodal grade distribution is real, and every figure traces to a per-repo result row.

how it works

  1. 01

    Collect candidates

    Search public GitHub for repos that self-describe as AI- or vibe-coded in name, description, or README. 107 candidates found, capped at the first 100, no further curation.

  2. 02

    Fetch each repo the way a scan does

    Default-branch archive, streamed in memory by cleanvibes's own fetcher — nothing written to disk — with a 90-second per-repo timeout. One repo skipped for size (>40MB compressed), leaving n = 99.

  3. 03

    Run the rules engine, unmodified

    The production cleanvibes check set across all six categories, production thresholds and severity model, AI pass off — so every result is deterministic and reproducible.

  4. 04

    Score with the production model

    Six weighted category subscores, severity deductions (critical −40, high −22, medium −10, low −4), weighted average, letter grades — identical to what a real scan reports.

  5. 05

    Aggregate per repo, report distributions

    Hit rates as share-of-repos-with-≥1-finding (immune to the per-rule caps), plus means, medians, and the full grade distribution — with the tiny-repo inflation of the medians stated rather than hidden.

frequently asked

Why publish the methodology at all?
Because the numbers are only worth citing if you can see how they were made. Corpus selection, what ran, what was off, and where the floors are — it's all here, and the headline stats survive the caveats.
Why was the AI pass switched off?
Reproducibility. Rules-only results are deterministic — same repo in, same findings out — which makes the study auditable. It also makes the numbers conservative: a full cleanvibes scan adds a Claude review on top, which finds more, not less.
Did you filter out the tiny demo repos?
No. They self-describe as vibe-coded, so they're legitimately in the population — but we flag that 22 of 99 have ≤15 files and trivially score A, which inflates the medians and creates the bimodal shape. That's why the report leads with means, the grade distribution, and per-category hit rates instead.
Will you re-run the study?
The setup makes re-runs cheap — same engine, fresh corpus — so a periodic re-run is the plan. The june 10, 2026 snapshot is version one; if the numbers move, the comparison gets its own write-up.

Last updated June 11, 2026

ready to try cleanvibes?

score your repo