How robots.txt actually works
The file is a list of rules grouped by User-agent (the name a crawler identifies itself with). Under each user-agent you write Allow and Disallow lines that name paths. A crawler that respects the standard fetches /robots.txt first, finds the block that applies to it, and stays within those rules for the rest of its visit.
Crucially, robots.txt governs crawling, not indexing or access. A Disallow tells a compliant crawler not to request a path; it does not password-protect anything, and a determined or badly-behaved bot can ignore the file entirely. It is a polite, widely-honoured request — powerful precisely because the bots that matter most choose to obey it.
Why it matters in the age of AI scrapers
Search engines crawl to send you visitors. AI training crawlers crawl to absorb your content into a model, and they can hit a site far more aggressively while returning no traffic at all. Because most of these AI bots do publish a user-agent and do honour robots.txt, a single well-maintained file is the simplest lever you have to allow the crawlers that help you and turn away the ones that only cost you.
The catch is that the file has to be correct and current. A typo can accidentally block Google; an out-of-date file does not know this year's AI user-agents. That gap between what robots.txt can do and how hard it is to maintain by hand is exactly what a manager like robot.guard closes.
frequently asked
- Where does robots.txt go?
- At the root of your domain, reachable at yoursite.com/robots.txt. Crawlers only look there; a robots.txt in a subfolder is ignored.
- Does robots.txt make a page private?
- No. It asks compliant crawlers not to fetch a path, but it does not block human visitors or bots that ignore it. For real privacy, use authentication or a firewall.
- Can robots.txt remove a page from Google?
- Not reliably. To keep a page out of search results, use a noindex meta tag or header; disallowing it in robots.txt can actually leave a bare URL in results because Google can't crawl the noindex.
- What happens if I don't have a robots.txt?
- Crawlers assume they may access everything. That is fine for many sites, but it means every AI scraper is welcome by default — which is increasingly not what owners want.
Last updated June 9, 2026