use case

What is robots.txt, and why it matters more than ever

the short answer

robots.txt is a plain-text file at the root of your site (yoursite.com/robots.txt) that tells well-behaved web crawlers which paths they may and may not request; it is a request, not a lock, so it works for compliant bots like search engines and AI crawlers that honour it, but not for ones that ignore it.

robots.txt is one of the oldest conventions on the web and one of the least understood. It is a single plain-text file you place at the root of your domain, and almost every legitimate crawler reads it before it starts requesting pages. In a few lines it tells those crawlers which parts of your site they are welcome to visit and which to leave alone.

For years it was a quiet SEO housekeeping file. Then AI companies began crawling the entire web to build training datasets, and the same file became the front line for deciding who gets to harvest your content. Understanding what robots.txt is — and what it is not — is the first step to using it well.

1994the year the robots.txt standard was introduced — still the web's default crawler contract

How robots.txt actually works

The file is a list of rules grouped by User-agent (the name a crawler identifies itself with). Under each user-agent you write Allow and Disallow lines that name paths. A crawler that respects the standard fetches /robots.txt first, finds the block that applies to it, and stays within those rules for the rest of its visit.

Crucially, robots.txt governs crawling, not indexing or access. A Disallow tells a compliant crawler not to request a path; it does not password-protect anything, and a determined or badly-behaved bot can ignore the file entirely. It is a polite, widely-honoured request — powerful precisely because the bots that matter most choose to obey it.

Why it matters in the age of AI scrapers

Search engines crawl to send you visitors. AI training crawlers crawl to absorb your content into a model, and they can hit a site far more aggressively while returning no traffic at all. Because most of these AI bots do publish a user-agent and do honour robots.txt, a single well-maintained file is the simplest lever you have to allow the crawlers that help you and turn away the ones that only cost you.

The catch is that the file has to be correct and current. A typo can accidentally block Google; an out-of-date file does not know this year's AI user-agents. That gap between what robots.txt can do and how hard it is to maintain by hand is exactly what a manager like robot.guard closes.

frequently asked

Where does robots.txt go?
At the root of your domain, reachable at yoursite.com/robots.txt. Crawlers only look there; a robots.txt in a subfolder is ignored.
Does robots.txt make a page private?
No. It asks compliant crawlers not to fetch a path, but it does not block human visitors or bots that ignore it. For real privacy, use authentication or a firewall.
Can robots.txt remove a page from Google?
Not reliably. To keep a page out of search results, use a noindex meta tag or header; disallowing it in robots.txt can actually leave a bare URL in results because Google can't crawl the noindex.
What happens if I don't have a robots.txt?
Crawlers assume they may access everything. That is fine for many sites, but it means every AI scraper is welcome by default — which is increasingly not what owners want.

Last updated June 9, 2026

ready to try robot.guard?

start guarding your site