use case

How much code duplication is too much — and which copies actually matter

the short answer

Code duplication becomes too much when the copies must change together — duplicated business logic, validation, or data handling means every fix has to land in every copy and eventually doesn't — while small incidental repeats (imports, config boilerplate, test setup) are mostly harmless; cleanvibes detects duplication by hashing normalized sliding windows of code, flags file pairs sharing substantial blocks plus heavy repetition inside single files, and weights the category at 15 of 100 in its cleanliness score.

"Don't repeat yourself" is the most over-applied rule in programming, and the corrective — "a little copying is better than a little dependency" — is real wisdom. So how much duplication is actually too much? The honest answer isn't a percentage; it's a question: if this block changes, do all its copies have to change too?

Vibe-coded apps sit on the wrong side of that question more often than most, because generating a fresh copy is cheaper than finding the existing one — ask an AI tool for similar logic twice and you'll get it twice, with one variable renamed. This page draws the practical line, explains how detection works, and covers what to do once you know where your copies live.

56.6%of 99 self-described vibe-coded repos had at least one duplication finding — cross-file copies alone hit 45.5%Source: cleanvibes rules-engine study, 99 public AI/vibe-coded GitHub repos, collected june 10, 2026

The line: copies that must change together

Harmful duplication is coupled duplication. The same validation logic in three endpoints, the same fetch-and-error-handle block in four components, the same price calculation in the cart and the checkout — these copies encode one decision in several places, so when the decision changes, someone has to find every copy. The first time they miss one, you have a bug that exists in some code paths and not others, which is the most confusing kind to chase.

Harmless duplication is incidental: import blocks, config boilerplate, test arrange-act-assert scaffolding, two functions that happen to look similar today but serve different masters and will drift apart for good reasons. Deduplicating those buys you nothing and costs you an abstraction. The test is never "do these lines match?" — it's "is this one piece of knowledge written twice?"

How detection actually works

You can't grep for duplication, because the copies are never quite identical — a renamed variable, different whitespace, a reordered argument. Real detectors normalise first and compare windows of code rather than whole files. cleanvibes's approach: strip and normalise each line, slide an 8-line window through every code file, hash each window, and look for the same hashes appearing in more than one place.

File pairs sharing several windows get flagged as cross-file duplication — the report names both files and estimates the duplicated line count — and heavy repetition inside a single file gets flagged separately, because a block repeated five times in one file is a loop or a helper that never got written. Both severities feed the duplication subscore, which carries weight 15 in the overall cleanliness score.

Fixing it without over-abstracting

The fix for coupled duplication is boring and correct: extract the shared block into one function or module and import it from every former copy. Resist the urge to build a configurable mega-helper that handles all the copies' slight differences with flags — if the copies genuinely differ, extract only the truly shared core and let the call sites keep their differences visibly.

This is mechanical work that coding agents do well with precise instructions, which is why every cleanvibes duplication finding ships a ready-to-paste Claude prompt naming both files and the extraction to perform, with behaviour-preserving constraints. Worth saying plainly: window-based detection is a heuristic — it finds textual near-copies, not every conceptual repeat — so treat the report as the high-confidence list, not the complete one.

Duplication that matters vs duplication that doesn't

KindExampleVerdict
Coupled business logicSame price calculation in cart and checkoutFix now — copies must change together
Repeated handling blocksSame fetch-and-error block in four componentsExtract a shared helper
In-file repetitionSame block five times in one fileFold into a loop or function
Test scaffoldingSimilar setup across test filesMostly fine — clarity beats DRY in tests
BoilerplateImports, config blocks, type declarationsLeave it — deduplicating buys nothing
Coincidental similarityTwo look-alike functions serving different featuresLeave it — they'll drift apart for good reasons

frequently asked

Is there an acceptable percentage of duplication?
Percentages are the wrong lens — 5% duplicated boilerplate is fine and 2% duplicated business logic is a problem. Ask whether the copies encode one decision in several places. That's the duplication that bills you.
Why do AI coding tools duplicate so much?
Because generating a fresh copy is the path of least resistance: the tool doesn't reliably know a helper already exists elsewhere in your codebase, and writing new code always works. Unless you explicitly point at the existing function, you often get a second one.
How does cleanvibes count duplicated lines?
It slides an 8-line normalized window through every code file, hashes the windows, and counts windows shared between file pairs. Pairs sharing several windows are reported with an estimated duplicated-line count and both file names — enough to go straight to the extraction.
Won't extracting everything make my code harder to read?
Over-extraction is a real failure mode, which is why the right unit is the genuinely shared core, not everything that looks similar. The fix prompts cleanvibes writes are scoped to the flagged blocks — one extraction per finding, smallest reasonable diff, behaviour unchanged.

Published June 10, 2026 · Last updated June 11, 2026

ready to try cleanvibes?

score your repo