how toofficial ogbuilds guide

How to protect your content from AI training with robots.txt

By ogbuilds, the studio behind robot·guard · updated 2026-06-09

the short answer

To keep your content out of AI training, disallow the training crawlers in robots.txt — GPTBot, Google-Extended, CCBot, anthropic-ai/ClaudeBot and similar — which the major AI companies honour as an opt-out; this stops your pages feeding new models while leaving search crawlers like Googlebot fully allowed.

If you publish writing, photography, code, or data, it is now part of the raw material AI companies use to train models — unless you opt out. The good news is that the main training crawlers treat robots.txt as exactly that opt-out, so a correctly configured file is a real, respected way to keep your work out of the next dataset.

Here is how the opt-out works, which crawlers to name, and the limits worth being honest about.

opt-outhow major AI crawlers treat a robots.txt block — they honour it as consent withdrawn

build your robots.txt →about robot·guard

The crawlers to block — and why search is safe

Training and dataset crawlers include GPTBot (OpenAI's training crawler), Google-Extended (Google's AI-training token), CCBot (Common Crawl, whose open dataset trains many models), and anthropic-ai / ClaudeBot. Disallowing each by its user-agent tells these companies, in the channel they have committed to honour, that your content is off-limits for training.

The reason this doesn't hurt your reach is that training tokens are deliberately separate from search tokens. Google-Extended is not Googlebot; blocking the former opts you out of AI training while your pages keep ranking in Search. That separation is the whole point — it lets you say no to training and yes to discovery at the same time.

What this can and can't do

Be clear-eyed about scope. A robots.txt opt-out works for crawlers that honour it, going forward — it does not retroactively remove content already in existing datasets, and it does not stop a bad actor who ignores the standard. For content that absolutely cannot be scraped, you need authentication or a firewall as well.

Within those limits, no step gives you more for the effort: free, immediate, and respected by the largest AI crawlers. robot·guard makes it reliable by keeping the training-crawler list current and writing each block correctly, so your opt-out actually covers today's crawlers instead of last year's.

how it works

01
name the training bots
List GPTBot, Google-Extended, CCBot, ClaudeBot and similar training crawlers.
02
disallow each
Add a Disallow: / block per user-agent — or toggle them in robot·guard.
03
keep search on
Leave Googlebot and Bingbot allowed so discovery is unaffected.
04
refresh as needed
Revisit the list as new training crawlers launch.

frequently asked

Do AI companies really honour a robots.txt opt-out?

The major ones publish their training crawler user-agents and state that they respect robots.txt. It's an opt-out for compliant crawlers, not a guarantee against ones that ignore the standard.

Does blocking training crawlers remove my content from existing models?

No. It prevents future training crawls that honour it; it can't retroactively pull content from datasets already collected.

Will this affect how I show up in AI search answers?

Possibly — some answer-engine crawlers are separate from training crawlers, so you can block training while allowing answer engines, or block both. robot·guard separates them so you choose.

Is robots.txt enough to protect sensitive content?

No. For content that must not be scraped at all, combine robots.txt with authentication or a firewall, since those enforce rather than request.

Last updated June 9, 2026

more on robot·guard

robot·guard — smart robots.txt that pays for itself →

part ofAI agent & bot security →

ready to try robot·guard?

build your robots.txt →

How to protect your content from AI training with robots.txt

curated ai scraper blocklist

The crawlers to block — and why search is safe

What this can and can't do

how it works

name the training bots

disallow each

keep search on

refresh as needed

frequently asked

more on robot·guard

ready to try robot·guard?