robots.txt vs noindex: which actually removes a page from Google?

Crawling versus indexing: the distinction that trips everyone up

Search engines do two separate jobs. First they crawl: a bot fetches the page and reads its content. Then they index: the engine decides whether to store that page and show it in results. robots.txt only touches the first stage. A Disallow rule tells a compliant crawler not to fetch the page, which means the bot never downloads it.

The noindex directive lives at the second stage. It is an instruction placed on the page itself (an HTML meta robots tag, or an X-Robots-Tag HTTP header) that tells the search engine: you may look at this, but do not keep it in your index. Because crawling and indexing are different steps, blocking one does not control the other, and that is exactly where the classic trap is sprung.

Why blocking a page in robots.txt can backfire

Here is the catch. If you Disallow a page in robots.txt, the crawler never fetches it, which means it never sees the noindex tag sitting inside the page. Google can still learn the URL exists from links pointing to it, so it may index the address anyway, showing a bare result with no title or snippet because it was not allowed to read the content. You end up with the opposite of what you wanted.

To genuinely remove a page from search, you do the reverse of most people's instinct: leave the page crawlable, add a noindex tag, and wait for Google to recrawl and drop it. Only after the page has fallen out of the index is it safe to block it in robots.txt if you also want to save crawl budget. Block first and you lock the noindex behind a door the bot cannot open.

So when do you actually want robots.txt?

robots.txt shines for crawl control and access management of bots, not for cleaning up search results. Use it to stop crawlers wasting time on infinite faceted-search URLs, staging directories, or heavy endpoints, and to block AI scrapers like GPTBot, ClaudeBot, and CCBot that you do not want training on your content. These are jobs about who gets to fetch what, which is precisely what robots.txt was designed for.

A tool like robot·guard makes that side straightforward: whitelist the search and social bots you want, block AI crawlers from a curated, maintained list, add custom rules, preview the result live, and download the file. Pair that with noindex tags applied page by page for anything you want kept out of search results, and each tool does the job it is actually good at.

robots.txt vs the noindex meta tag at a glance

	robots.txt (Disallow)	noindex meta tag
What it controls	Crawling — whether a bot fetches the page	Indexing — whether the page appears in results
Where it lives	A single file at your site root	On the page itself (meta tag or X-Robots-Tag header)
Removes a page from Google?	No — can even leave a bare URL indexed	Yes — the intended way to drop a page from results
Requires the page to be crawlable?	It blocks crawling by definition	Yes — the bot must fetch the page to see it
Best for	Crawl control, saving crawl budget, blocking AI scrapers	Keeping indexed-but-unwanted pages out of search

frequently asked

I blocked a page in robots.txt but it still shows in Google. Why?

robots.txt only stops crawling, not indexing. Google can still index the URL from links pointing to it, showing a bare result with no snippet because it was not allowed to read the page. Use noindex instead, and leave the page crawlable so Google can see the tag.

Can I use robots.txt and noindex together on the same page?

Not at the same time for the same goal. If you block the page in robots.txt, the crawler never reads the noindex tag. Apply noindex first, wait for the page to drop from the index, and only then consider blocking it in robots.txt.

What is the right way to remove a page from search results?

Add a noindex meta robots tag (or X-Robots-Tag header) to the page, make sure the page is NOT disallowed in robots.txt, and wait for Google to recrawl it. Once it leaves the index, you can block it in robots.txt if you also want to save crawl budget.

Does robots.txt control access to a page?

No. robots.txt is a request that compliant crawlers honour; it controls crawling, not access or indexing. Anyone with the URL can still open the page, and non-compliant bots can ignore the file entirely. Use authentication for true access control.

Last updated June 9, 2026

robots.txt vs noindex: which one keeps a page out of Google?