how to

How to block CCBot (Common Crawl)

the short answer

To block CCBot, add 'User-agent: CCBot' and 'Disallow: /' to your robots.txt, which keeps your pages out of Common Crawl's open dataset that trains many downstream AI models at once.

CCBot is the crawler operated by Common Crawl, a nonprofit that publishes a giant, free archive of the web. Blocking it takes two lines in robots.txt. But CCBot is worth special attention because of what sits downstream of it: a single block here has outsized reach.

Common Crawl's dataset is one of the most widely used training corpora in AI. Many language models, from large labs and small projects alike, are trained on extracts of it. So when you block CCBot, you are not just opting out of one company's bot, you are removing your content from a dataset that feeds a long list of models in one move.

~32%of all web traffic came from bad bots in 2024, much of it automated scraping (Imperva, 2024 Bad Bot Report)

The exact rule

The block is two lines: 'User-agent: CCBot' followed by 'Disallow: /'. The user-agent line names Common Crawl's bot exactly, and the disallow with a single slash means every path. Save it in a plain text robots.txt file at your domain root so it resolves at example.com/robots.txt. Common Crawl honours robots.txt, so a compliant crawl will skip your site once the rule is in place.

If you want to hold back only part of your site, replace the slash with a path, for example 'Disallow: /research/', and stack multiple Disallow lines under the same user-agent for several sections. As with all robots.txt rules, this is a request rather than enforcement, so pair it with a firewall for any scraper that ignores the standard outright.

Why blocking CCBot has outsized reach

Most AI crawlers collect data for a single organisation's models. CCBot is different: Common Crawl publishes its archive openly, and that archive is then used as training material by many separate teams and models. Blocking GPTBot affects OpenAI's training; blocking CCBot affects everyone who trains on Common Crawl, which is a far longer list. That makes it one of the highest-leverage single entries in your robots.txt.

There is a timing nuance. Common Crawl periodically publishes new snapshots, and a model trains on whichever snapshot existed when it was built. Blocking CCBot today keeps your content out of future snapshots, but content already captured in past public archives is already out there. Adding the rule sooner rather than later limits how much of your site flows into the next round of downstream training.

CCBot is one of several to block

CCBot is high-leverage, but it is not the only AI crawler that matters. Plenty of labs run their own dedicated crawlers in addition to using Common Crawl, so a thorough opt-out also names GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, PerplexityBot, Bytespider, and others. Each is a separate user-agent that needs its own block.

Tracking that list by hand is the hard part, because the names change as new crawlers appear. robot.guard maintains a curated list of these AI user-agents, including CCBot, so you toggle them on rather than memorising them. It also whitelists legitimate search and archive crawlers like Googlebot and the Internet Archive, lets you add your own path rules, and previews the exact file before you download it for your site root.

how it works

  1. 01

    Open or create robots.txt

    Open your robots.txt, or create a plain text file named robots.txt at your domain root so it loads at example.com/robots.txt. In robot.guard you can begin in the free editor immediately.

  2. 02

    Add the CCBot block

    Add 'User-agent: CCBot' then 'Disallow: /'. To block only a section, use a path such as '/research/' in place of the slash. In robot.guard, toggle CCBot in the curated list to insert it.

  3. 03

    Add the other high-value AI crawlers

    Since many labs also run their own bots, add GPTBot, ClaudeBot, anthropic-ai, Google-Extended, PerplexityBot, and Bytespider. robot.guard keeps these in one maintained list so you do not track names yourself.

  4. 04

    Preview, download, and deploy

    Preview the file to confirm Googlebot and the Internet Archive are still allowed, download robots.txt, and upload it to your site root. Confirm by loading example.com/robots.txt in a browser.

frequently asked

Why does blocking CCBot matter more than blocking one AI company's bot?
Common Crawl publishes its archive openly, and many separate models train on it. Blocking CCBot removes your content from a dataset used by a long list of downstream models, not just one organisation.
Does blocking CCBot remove my pages from existing datasets?
No. It keeps your content out of future Common Crawl snapshots. Anything captured in past public archives is already distributed, which is why adding the rule sooner limits future exposure.
Will blocking CCBot affect my search ranking?
No. CCBot is a data-collection crawler separate from search crawlers like Googlebot and Bingbot. Blocking it has no effect on how your site is indexed or ranked in search.
Is blocking CCBot enough on its own?
It is high-leverage but not complete. Many labs also run dedicated crawlers, so add GPTBot, ClaudeBot, anthropic-ai, and others. robot.guard keeps the full curated list in one place.

Last updated June 9, 2026

ready to try robot.guard?

start guarding your site