The compliant majority: who actually obeys
The AI crawlers you are most likely worried about are also the ones most likely to behave. OpenAI's GPTBot and OAI-SearchBot, Anthropic's ClaudeBot, Common Crawl's CCBot, Google's Google-Extended, and PerplexityBot all publish documented user-agent strings and state that they read and respect robots.txt. These are large, reputational operators with legal and PR incentives to follow the standard, and independent reporting on crawler behaviour has generally found that the named bots do honour disallow rules.
For these crawlers, a clean robots.txt is genuinely effective. When robot.guard writes a disallow rule for GPTBot or ClaudeBot, you are speaking the exact language those operators read, and they back off. That is the bulk of the AI scraping pressure most sites face, which is why maintaining an accurate, current block list is the single highest-leverage thing you can do.
The caveats: spoofing, undisclosed agents, and the no-wall problem
Now the uncomfortable part. Some bots ignore robots.txt entirely. Others crawl under generic or undisclosed user-agents specifically so they do not match your rules. And because robots.txt is voluntary, nothing technically stops a bad actor from reading your disallow rules and crawling anyway. A user-agent string can be faked, so a determined scraper can even impersonate a browser. There have been documented disputes about whether certain AI crawlers fully respected robots.txt in practice, which is a useful reminder that the file is a request, not a lock.
This is why the right mental model is layered. robots.txt handles the compliant majority cheaply and cleanly. For everything else, you pair it with enforcement at the network level: a firewall or WAF that can rate-limit, challenge, or hard-block traffic regardless of what user-agent it claims. robot.guard manages the request layer well, and is honest that the file is one layer, not the whole defense.
What a realistic strategy looks like
Start with an accurate robots.txt that whitelists the search crawlers you want (Googlebot, Bingbot, DuckDuckBot, the Internet Archive) and disallows the AI user-agents you do not. Because the list of AI crawlers changes constantly, a maintained, curated block list beats a static snippet you copied off a forum two years ago. That alone neutralizes most identifiable AI scraping from operators that play by the rules.
Then add a firewall for the rest. Use it to catch crawlers that ignore your file, hammer your server, or hide behind generic agents. Think of robots.txt as the front door sign that polite visitors read, and the firewall as the lock for everyone who does not. You need both, and they do different jobs, but the robots.txt layer is where you start because it is free and it stops the majority.
frequently asked
- Do AI crawlers really obey robots.txt?
- The major, identifiable ones do. GPTBot, ClaudeBot, CCBot, Google-Extended, and PerplexityBot publish user-agents and state they honour robots.txt, and reporting suggests their compliance is generally good. Bots that hide their identity or ignore the standard are the exception you need a firewall for.
- Can a crawler just ignore my robots.txt?
- Yes, technically. robots.txt is a voluntary request, not enforcement. Compliant crawlers honour it by choice; a bad actor can read your rules and crawl anyway, or spoof a user-agent. That is why robots.txt alone is not a wall.
- If it can be ignored, is robots.txt even worth it?
- Absolutely. It cheaply stops the large, reputational AI operators that generate most identifiable scraping traffic and that do follow the standard. It is the highest-leverage first layer; you just pair it with a firewall for the bots that do not comply.
- How do I block the bots that ignore robots.txt?
- Use network-level enforcement: a firewall or WAF that rate-limits, challenges, or blocks traffic by IP and behaviour rather than trusting the user-agent string. robot.guard handles the robots.txt request layer; the firewall handles enforcement.
Last updated June 9, 2026