The exact rule
The block is two lines: 'User-agent: CCBot' followed by 'Disallow: /'. The user-agent line names Common Crawl's bot exactly, and the disallow with a single slash means every path. Save it in a plain text robots.txt file at your domain root so it resolves at example.com/robots.txt. Common Crawl honours robots.txt, so a compliant crawl will skip your site once the rule is in place.
If you want to hold back only part of your site, replace the slash with a path, for example 'Disallow: /research/', and stack multiple Disallow lines under the same user-agent for several sections. As with all robots.txt rules, this is a request rather than enforcement, so pair it with a firewall for any scraper that ignores the standard outright.
Why blocking CCBot has outsized reach
Most AI crawlers collect data for a single organisation's models. CCBot is different: Common Crawl publishes its archive openly, and that archive is then used as training material by many separate teams and models. Blocking GPTBot affects OpenAI's training; blocking CCBot affects everyone who trains on Common Crawl, which is a far longer list. That makes it one of the highest-leverage single entries in your robots.txt.
There is a timing nuance. Common Crawl periodically publishes new snapshots, and a model trains on whichever snapshot existed when it was built. Blocking CCBot today keeps your content out of future snapshots, but content already captured in past public archives is already out there. Adding the rule sooner rather than later limits how much of your site flows into the next round of downstream training.
CCBot is one of several to block
CCBot is high-leverage, but it is not the only AI crawler that matters. Plenty of labs run their own dedicated crawlers in addition to using Common Crawl, so a thorough opt-out also names GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, PerplexityBot, Bytespider, and others. Each is a separate user-agent that needs its own block.
Tracking that list by hand is the hard part, because the names change as new crawlers appear. robot.guard maintains a curated list of these AI user-agents, including CCBot, so you toggle them on rather than memorising them. It also whitelists legitimate search and archive crawlers like Googlebot and the Internet Archive, lets you add your own path rules, and previews the exact file before you download it for your site root.
how it works
- 01
Open or create robots.txt
Open your robots.txt, or create a plain text file named robots.txt at your domain root so it loads at example.com/robots.txt. In robot.guard you can begin in the free editor immediately.
- 02
Add the CCBot block
Add 'User-agent: CCBot' then 'Disallow: /'. To block only a section, use a path such as '/research/' in place of the slash. In robot.guard, toggle CCBot in the curated list to insert it.
- 03
Add the other high-value AI crawlers
Since many labs also run their own bots, add GPTBot, ClaudeBot, anthropic-ai, Google-Extended, PerplexityBot, and Bytespider. robot.guard keeps these in one maintained list so you do not track names yourself.
- 04
Preview, download, and deploy
Preview the file to confirm Googlebot and the Internet Archive are still allowed, download robots.txt, and upload it to your site root. Confirm by loading example.com/robots.txt in a browser.
frequently asked
- Why does blocking CCBot matter more than blocking one AI company's bot?
- Common Crawl publishes its archive openly, and many separate models train on it. Blocking CCBot removes your content from a dataset used by a long list of downstream models, not just one organisation.
- Does blocking CCBot remove my pages from existing datasets?
- No. It keeps your content out of future Common Crawl snapshots. Anything captured in past public archives is already distributed, which is why adding the rule sooner limits future exposure.
- Will blocking CCBot affect my search ranking?
- No. CCBot is a data-collection crawler separate from search crawlers like Googlebot and Bingbot. Blocking it has no effect on how your site is indexed or ranked in search.
- Is blocking CCBot enough on its own?
- It is high-leverage but not complete. Many labs also run dedicated crawlers, so add GPTBot, ClaudeBot, anthropic-ai, and others. robot.guard keeps the full curated list in one place.
Last updated June 9, 2026