how to

How we benchmarked 99 vibe-coded repos: corpus, engine, caps, and caveats

the short answer

Our security study scanned 99 public GitHub repos that describe themselves as AI- or vibe-coded (collected june 10, 2026) using the securevibes heuristic rules engine only — AI pass off, per-rule reporting capped at 10 findings and 300 per repo, archives fetched in memory with a 90-second timeout — and the main caveats are that 22 tiny demo repos inflate the medians, finding counts are floors because of the caps, and "vibe-coded" means self-described, not verified provenance.

A statistic is only as good as the method behind it, so this page writes the method down: where the repos came from, exactly what ran against them, what was deliberately switched off, and every caveat we know about. If you've seen our numbers quoted — 21.2% of vibe-coded repos with a secrets finding, 27.3% with a critical or high — this is where they come from.

The short version: 99 public GitHub repos self-described as AI- or vibe-coded, data collected june 10, 2026, each scanned by the securevibes rules engine in memory, no AI pass, results aggregated per repo. Nothing was filtered to make the numbers better or worse.

99 reposscanned with the same deterministic rules engine every securevibes scan starts with — fixed thresholds, no AI passSource: securevibes rules-engine study, public self-described AI/vibe-coded GitHub repos, collected june 10, 2026

The corpus: self-described, not curated

We collected public GitHub repos that describe themselves as AI- or vibe-coded — the term appears in their name, description, or README. The search produced 107 candidates; we capped at the first 100 and didn't curate beyond that. One repo was skipped because its archive exceeded 40MB compressed, leaving n = 99 analyzed.

Self-description is the honest framing and the honest limitation: we took repos at their word. Some are substantial applications; some are workshop demos and single-page toys — 22 of the 99 have 15 or fewer scannable files, and 8 have 5 or fewer (the median repo has 54 scannable files, the mean 107). We counted them all, because they genuinely are self-described vibe-coded output, and we report means and distributions alongside medians precisely because the tiny repos inflate the medians.

What ran: the rules engine, with the AI pass off

Each repo's default-branch archive was fetched in memory by securevibes's own fetcher — the same path a normal scan uses, nothing cloned to disk — with a 90-second per-repo timeout. The securevibes rules engine then ran its full check set across the six categories (secrets, injection, auth, data exposure, dependencies, transport) with production thresholds, unmodified for the study.

The Claude review that's part of every real securevibes scan was deliberately switched off, so every number in the study is deterministic and reproducible: same repo in, same findings out. That cuts both ways — anything that needs code-reading judgment went uncounted, so the study's findings are a floor on what a full scan reports, and scores would likely shift down with the AI pass on. Scores use the production model: six weighted category subscores, severity deductions, weighted average, letter grades.

Caps, floors, and the caveats that matter

The engine caps reporting at 10 findings per rule and 300 per repo. For aggregate stats this matters: per-repo finding counts (mean 4.1, median 2) are floors, not totals, for the messiest repos. Repo-level hit rates — "x% of repos had at least one finding of type y" — are unaffected by the caps, which is one reason we lead with them.

The rest of the honest list: this is heuristic static analysis, not a pentest — no code execution, no vulnerability database, so the study measures pattern-detectable issues only. "Vibe-coded" is self-description, not verified provenance. And the corpus is one snapshot, collected june 10, 2026; repos change, and a different month's corpus would produce somewhat different numbers. We'd rather you know all of this and trust the numbers that survive it: the hit rates are clean, the grade distribution is real, and every figure traces to a per-repo result row.

how it works

  1. 01

    Collect candidates

    Search public GitHub for repos that self-describe as AI- or vibe-coded in name, description, or README. 107 candidates found, capped at the first 100, no further curation.

  2. 02

    Fetch each repo the way a scan does

    Default-branch archive, fetched in memory by securevibes's own fetcher — nothing written to disk — with a 90-second per-repo timeout. One repo skipped for size (>40MB compressed), leaving n = 99.

  3. 03

    Run the rules engine, unmodified

    The production securevibes check set across all six categories, production thresholds and severity model, AI pass off — so every result is deterministic and reproducible.

  4. 04

    Score with the production model

    Six weighted category subscores, severity deductions (critical −40, high −22, medium −10, low −4), weighted average, letter grades — identical to what a real scan reports.

  5. 05

    Aggregate per repo, report distributions

    Hit rates as share-of-repos-with-≥1-finding (immune to the per-rule caps), plus means, medians, and the full grade distribution — with the tiny-repo inflation of the medians stated rather than hidden.

frequently asked

Why publish the methodology at all?
Because the numbers are only worth citing if you can see how they were made. Corpus selection, what ran, what was off, and where the floors are — it's all here, and the headline stats survive the caveats.
Why was the AI pass switched off?
Reproducibility. Rules-only results are deterministic — same repo in, same findings out — which makes the study auditable. It also makes the numbers conservative: a full securevibes scan adds a Claude review on top, which finds more, not less.
Did you filter out the tiny demo repos?
No. They self-describe as vibe-coded, so they're legitimately in the population — but we flag that 22 of 99 have ≤15 files and trivially score A, which inflates the medians. That's why the report leads with means, grade distribution, and per-category hit rates instead.
Will you re-run the study?
The setup makes re-runs cheap — same engine, fresh corpus — so a periodic re-run is the plan. The june 10, 2026 snapshot is version one; if the numbers move, the comparison gets its own write-up.

Last updated June 11, 2026

ready to try securevibes?

scan your repo