The crawlers to block — and why search is safe
Training and dataset crawlers include GPTBot (OpenAI's training crawler), Google-Extended (Google's AI-training token), CCBot (Common Crawl, whose open dataset trains many models), and anthropic-ai / ClaudeBot. Disallowing each by its user-agent tells these companies, in the channel they have committed to honour, that your content is off-limits for training.
The reason this doesn't hurt your reach is that training tokens are deliberately separate from search tokens. Google-Extended is not Googlebot; blocking the former opts you out of AI training while your pages keep ranking in Search. That separation is the whole point — it lets you say no to training and yes to discovery at the same time.
What this can and can't do
Be clear-eyed about scope. A robots.txt opt-out works for crawlers that honour it, going forward — it does not retroactively remove content already in existing datasets, and it does not stop a bad actor who ignores the standard. For content that absolutely cannot be scraped, you need authentication or a firewall as well.
Within those limits, it is the highest-leverage step available: free, immediate, and respected by the largest AI crawlers. robot.guard makes it reliable by keeping the training-crawler list current and writing each block correctly, so your opt-out actually covers today's crawlers instead of last year's.
how it works
- 01
name the training bots
List GPTBot, Google-Extended, CCBot, ClaudeBot and similar training crawlers.
- 02
disallow each
Add a Disallow: / block per user-agent — or toggle them in robot.guard.
- 03
keep search on
Leave Googlebot and Bingbot allowed so discovery is unaffected.
- 04
refresh as needed
Revisit the list as new training crawlers launch.
frequently asked
- Do AI companies really honour a robots.txt opt-out?
- The major ones publish their training crawler user-agents and state that they respect robots.txt. It's an opt-out for compliant crawlers, not a guarantee against ones that ignore the standard.
- Does blocking training crawlers remove my content from existing models?
- No. It prevents future training crawls that honour it; it can't retroactively pull content from datasets already collected.
- Will this affect how I show up in AI search answers?
- Possibly — some answer-engine crawlers are separate from training crawlers, so you can block training while allowing answer engines, or block both. robot.guard separates them so you choose.
- Is robots.txt enough to protect sensitive content?
- No. For content that must not be scraped at all, combine robots.txt with authentication or a firewall, since those enforce rather than request.
Last updated June 9, 2026