how to

How to protect your content from AI training with robots.txt

the short answer

To keep your content out of AI training, disallow the training crawlers in robots.txt — GPTBot, Google-Extended, CCBot, anthropic-ai/ClaudeBot and similar — which the major AI companies honour as an opt-out; this stops your pages feeding new models while leaving search crawlers like Googlebot fully allowed.

If you publish writing, photography, code, or data, it is now part of the raw material AI companies use to train models — unless you opt out. The good news is that the main training crawlers treat robots.txt as exactly that opt-out, so a correctly configured file is a real, respected way to keep your work out of the next dataset.

Here is how the opt-out works, which crawlers to name, and the limits worth being honest about.

opt-outhow major AI crawlers treat a robots.txt block — they honour it as consent withdrawn
robotguard.ogbuilds.ai
robot.guard
editorconfigsblocklist
download

curated ai scraper blocklist

kept current as new crawlers appear — toggle the ones you want shut out.

5 of 6 blocked
user-agentoperatorpurposeblock
GPTBotOpenAImodel training
ClaudeBotAnthropicmodel training
CCBotCommon Crawltraining dataset
Google-ExtendedGooglegemini training
BytespiderByteDancemodel training
PerplexityBotPerplexityanswer engine
tiprobots.txt is a polite request — pair it with a firewall rule for crawlers that ignore it.

where this happens in the app

the training crawlers — gptbot, google-extended, ccbot, claudebot — honour a robots.txt block as an opt-out; robot.guard keeps the list current so your opt-out covers today's bots, not last year's.

  1. 1block training and dataset crawlers without touching googlebot — they're separate user-agents.
  2. 2toggle each one off; search visibility stays fully intact.

The crawlers to block — and why search is safe

Training and dataset crawlers include GPTBot (OpenAI's training crawler), Google-Extended (Google's AI-training token), CCBot (Common Crawl, whose open dataset trains many models), and anthropic-ai / ClaudeBot. Disallowing each by its user-agent tells these companies, in the channel they have committed to honour, that your content is off-limits for training.

The reason this doesn't hurt your reach is that training tokens are deliberately separate from search tokens. Google-Extended is not Googlebot; blocking the former opts you out of AI training while your pages keep ranking in Search. That separation is the whole point — it lets you say no to training and yes to discovery at the same time.

What this can and can't do

Be clear-eyed about scope. A robots.txt opt-out works for crawlers that honour it, going forward — it does not retroactively remove content already in existing datasets, and it does not stop a bad actor who ignores the standard. For content that absolutely cannot be scraped, you need authentication or a firewall as well.

Within those limits, it is the highest-leverage step available: free, immediate, and respected by the largest AI crawlers. robot.guard makes it reliable by keeping the training-crawler list current and writing each block correctly, so your opt-out actually covers today's crawlers instead of last year's.

how it works

  1. 01

    name the training bots

    List GPTBot, Google-Extended, CCBot, ClaudeBot and similar training crawlers.

  2. 02

    disallow each

    Add a Disallow: / block per user-agent — or toggle them in robot.guard.

  3. 03

    keep search on

    Leave Googlebot and Bingbot allowed so discovery is unaffected.

  4. 04

    refresh as needed

    Revisit the list as new training crawlers launch.

frequently asked

Do AI companies really honour a robots.txt opt-out?
The major ones publish their training crawler user-agents and state that they respect robots.txt. It's an opt-out for compliant crawlers, not a guarantee against ones that ignore the standard.
Does blocking training crawlers remove my content from existing models?
No. It prevents future training crawls that honour it; it can't retroactively pull content from datasets already collected.
Will this affect how I show up in AI search answers?
Possibly — some answer-engine crawlers are separate from training crawlers, so you can block training while allowing answer engines, or block both. robot.guard separates them so you choose.
Is robots.txt enough to protect sensitive content?
No. For content that must not be scraped at all, combine robots.txt with authentication or a firewall, since those enforce rather than request.

Last updated June 9, 2026

ready to try robot.guard?

start guarding your site