how to

A developer's guide to robots.txt rules that don't bite you later

the short answer

For developers: robots.txt matches the most specific user-agent block (not all matching blocks), resolves Allow vs Disallow by longest-matching path, supports * and $ wildcards in modern crawlers, is case-sensitive in paths, and should be validated before deploy because a single wrong line can deindex a site.

robots.txt looks trivial — a few User-agent and Disallow lines — which is exactly why it bites. The matching rules are more subtle than they appear, the precedence between Allow and Disallow surprises people, and the failure mode is severe: a wrong line can quietly stop search engines crawling your whole site. This is a developer-level reference for getting it right.

Treat the file like code: understand the resolution rules, validate it, and don't hand-edit it in production on a Friday.

longest-matchhow compliant crawlers resolve a conflict between Allow and Disallow

The rules that actually decide behaviour

A crawler picks the single most specific User-agent group that matches its name and obeys only that group — it does not merge rules across groups. So if you have a generic User-agent: * block and a specific User-agent: Googlebot block, Googlebot follows its own block exclusively and ignores the wildcard one. Forgetting this is how people accidentally exempt a bot from rules they thought were global.

Within a group, modern crawlers resolve Allow versus Disallow by the longest matching path, not by order. Allow: /blog/public beats Disallow: /blog for a URL under /blog/public. Paths are case-sensitive, a trailing * matches any sequence, and $ anchors the end of a URL. These are well-supported by Google and Bing but not universally, so keep rules simple.

Keep it boring and validated

The safest robots.txt is an unclever one: explicit per-bot groups, minimal wildcards, and no rule you can't explain. Validate every change against a tester before it ships, because there is no runtime error for a bad robots.txt — it just silently changes what gets crawled. Many teams also keep the file in version control so changes are reviewed like any other deploy.

robot.guard fits this workflow by removing the hand-editing class of bugs: each toggle emits a correctly-scoped group, the preview is the exact file, and the curated AI list means you are not pasting user-agents from a dozen docs pages. You download a validated file and commit it, instead of editing live and hoping.

how it works

  1. 01

    group by user-agent

    Write one explicit group per crawler; remember only the most specific group applies.

  2. 02

    mind precedence

    Resolve conflicts with longest-match Allow/Disallow, not line order.

  3. 03

    validate before deploy

    Run the file through a tester — there's no error message for a broken robots.txt.

  4. 04

    version it

    Commit the generated file so robots.txt changes get reviewed like code.

frequently asked

Do all crawlers merge matching user-agent groups?
No. Compliant crawlers obey only the single most specific matching group. Rules in a less-specific group (like *) are ignored once a more specific group matches.
Is Allow or Disallow stronger?
Neither by default — modern crawlers use the longest matching path. A more specific Allow overrides a broader Disallow and vice versa.
Are robots.txt paths case-sensitive?
Yes. /Blog and /blog are different paths. The user-agent token, however, is matched case-insensitively.
Can I comment robots.txt?
Yes, lines starting with # are comments. robot.guard adds a generated-by comment so it's clear the file is managed.

Last updated June 9, 2026

ready to try robot.guard?

start guarding your site