use case

Do AI bots actually respect robots.txt?

the short answer

The major, identifiable AI crawlers like GPTBot, ClaudeBot, CCBot, Google-Extended, and PerplexityBot publish their user-agents and state they honour robots.txt, but it is a voluntary request rather than an enforced wall, so it stops the compliant majority while a firewall is needed for bots that ignore it or hide their identity.

It is a fair question to be skeptical about. You put a few lines in a text file and trust that billion-dollar AI companies will read them and politely back off. Does that actually happen? The honest answer is mostly yes, with real caveats. The big, identifiable AI crawlers do publish their user-agents and publicly commit to honouring robots.txt, and reporting on real-world traffic suggests compliance among the named operators is generally good.

But robots.txt has always been a request, not an enforcement mechanism. It works because well-behaved crawlers choose to obey it. That covers the operators that matter most, the ones whose traffic shows up clearly in your logs. It does not cover bots that ignore the standard, spoof their identity, or crawl under undisclosed user-agents. Understanding that line is the difference between a realistic robots.txt strategy and false confidence.

~49.6%of all web traffic was automated in 2023, with bad bots at roughly 32% (Imperva 2024 Bad Bot Report)

The compliant majority: who actually obeys

The AI crawlers you are most likely worried about are also the ones most likely to behave. OpenAI's GPTBot and OAI-SearchBot, Anthropic's ClaudeBot, Common Crawl's CCBot, Google's Google-Extended, and PerplexityBot all publish documented user-agent strings and state that they read and respect robots.txt. These are large, reputational operators with legal and PR incentives to follow the standard, and independent reporting on crawler behaviour has generally found that the named bots do honour disallow rules.

For these crawlers, a clean robots.txt is genuinely effective. When robot.guard writes a disallow rule for GPTBot or ClaudeBot, you are speaking the exact language those operators read, and they back off. That is the bulk of the AI scraping pressure most sites face, which is why maintaining an accurate, current block list is the single highest-leverage thing you can do.

The caveats: spoofing, undisclosed agents, and the no-wall problem

Now the uncomfortable part. Some bots ignore robots.txt entirely. Others crawl under generic or undisclosed user-agents specifically so they do not match your rules. And because robots.txt is voluntary, nothing technically stops a bad actor from reading your disallow rules and crawling anyway. A user-agent string can be faked, so a determined scraper can even impersonate a browser. There have been documented disputes about whether certain AI crawlers fully respected robots.txt in practice, which is a useful reminder that the file is a request, not a lock.

This is why the right mental model is layered. robots.txt handles the compliant majority cheaply and cleanly. For everything else, you pair it with enforcement at the network level: a firewall or WAF that can rate-limit, challenge, or hard-block traffic regardless of what user-agent it claims. robot.guard manages the request layer well, and is honest that the file is one layer, not the whole defense.

What a realistic strategy looks like

Start with an accurate robots.txt that whitelists the search crawlers you want (Googlebot, Bingbot, DuckDuckBot, the Internet Archive) and disallows the AI user-agents you do not. Because the list of AI crawlers changes constantly, a maintained, curated block list beats a static snippet you copied off a forum two years ago. That alone neutralizes most identifiable AI scraping from operators that play by the rules.

Then add a firewall for the rest. Use it to catch crawlers that ignore your file, hammer your server, or hide behind generic agents. Think of robots.txt as the front door sign that polite visitors read, and the firewall as the lock for everyone who does not. You need both, and they do different jobs, but the robots.txt layer is where you start because it is free and it stops the majority.

frequently asked

Do AI crawlers really obey robots.txt?
The major, identifiable ones do. GPTBot, ClaudeBot, CCBot, Google-Extended, and PerplexityBot publish user-agents and state they honour robots.txt, and reporting suggests their compliance is generally good. Bots that hide their identity or ignore the standard are the exception you need a firewall for.
Can a crawler just ignore my robots.txt?
Yes, technically. robots.txt is a voluntary request, not enforcement. Compliant crawlers honour it by choice; a bad actor can read your rules and crawl anyway, or spoof a user-agent. That is why robots.txt alone is not a wall.
If it can be ignored, is robots.txt even worth it?
Absolutely. It cheaply stops the large, reputational AI operators that generate most identifiable scraping traffic and that do follow the standard. It is the highest-leverage first layer; you just pair it with a firewall for the bots that do not comply.
How do I block the bots that ignore robots.txt?
Use network-level enforcement: a firewall or WAF that rate-limits, challenges, or blocks traffic by IP and behaviour rather than trusting the user-agent string. robot.guard handles the robots.txt request layer; the firewall handles enforcement.

Last updated June 9, 2026

ready to try robot.guard?

start guarding your site