use case

The AI crawler user-agents to know in 2026

the short answer

The AI crawler user-agents to know in 2026 include OpenAI's GPTBot and OAI-SearchBot, Anthropic's ClaudeBot and anthropic-ai, Google's Google-Extended, Common Crawl's CCBot, Meta-ExternalAgent, Applebot-Extended, PerplexityBot, Bytespider, Amazonbot, and others, and because the list changes constantly a maintained tool beats a static copy-pasted snippet.

If you want to control which AI systems can crawl your site, you first need to know who is knocking. AI crawlers identify themselves with user-agent strings, and each operator runs one or more tokens for different purposes: some gather training data, some power live answer engines, some build public datasets. Knowing which is which lets you make deliberate choices instead of blanket-blocking or blanket-allowing.

Below is a rundown grouped by operator. Treat it as a snapshot for 2026, because the single most important thing to understand about this list is that it does not hold still. Operators add tokens, rename them, and split crawling from training into separate agents. A static list you copy once is out of date the moment someone ships a new bot, which is exactly why a maintained, curated tool is worth more than a one-time snippet.

~32%of all web traffic came from bad bots in 2023, the highest share Imperva had recorded (Imperva 2024 Bad Bot Report)

Grouped by operator: who runs what

OpenAI runs GPTBot for training data and OAI-SearchBot for its search and answer features. Anthropic runs ClaudeBot, with anthropic-ai also appearing in the wild as an associated token. Google uses Google-Extended as its dedicated AI-training opt-out token, separate from Googlebot. Common Crawl operates CCBot, whose archives feed a huge range of downstream AI training because so many models train on Common Crawl data.

Beyond those, Meta crawls with Meta-ExternalAgent, Apple uses Applebot-Extended as its AI-training control, Perplexity runs PerplexityBot for its answer engine, ByteDance operates Bytespider, and Amazon runs Amazonbot. Smaller but relevant operators include Cohere (cohere-ai), Diffbot, and ImagesiftBot. robot.guard ships all of these on a curated, maintained block list so you do not have to track each operator's announcements yourself.

Training, answer engines, and datasets are not the same

It helps to sort these by purpose. Training crawlers like GPTBot, Google-Extended, Applebot-Extended, and CCBot gather content to train or improve models. Answer-engine crawlers like OAI-SearchBot and PerplexityBot fetch pages to generate live answers and cite sources, which is closer to search than to training. Dataset crawlers like CCBot build public corpora that many other companies then train on, giving one block outsized reach.

The distinction matters because your goals might differ by category. You may be happy to appear in answer engines that cite and link you while refusing to be used as raw training material, or you may want to block both. Whitelisting the legitimate search crawlers (Googlebot, Bingbot, DuckDuckBot, the Internet Archive) while disallowing the AI training agents is a common, sensible split, and it is the default posture robot.guard is designed around.

Why a maintained list beats a copy-paste

A robots.txt snippet you copied from a blog post is frozen in time. The AI crawler landscape is not. New tokens appear, operators split training from search into separate agents, and casing or naming changes can quietly break a hand-edited rule. When that happens, your static file silently stops covering bots it was supposed to catch, and you have no way of knowing until you audit it.

A maintained tool solves this by keeping the curated list current and letting you preview the exact file before you publish. With robot.guard you toggle the operators you want to block, see the precise robots.txt output live, and download it, so your file stays aligned with the real-world list instead of drifting out of date. The free editor covers this; Pro adds accounts and cloud-synced configs if you manage multiple sites.

Common AI crawler user-agents in 2026, grouped by operator and purpose.

user-agentoperatorpurpose
GPTBotOpenAIAI model training
OAI-SearchBotOpenAISearch and answer engine
ClaudeBotAnthropicAI model training
Google-ExtendedGoogleAI training opt-out token
CCBotCommon CrawlPublic dataset that feeds many models
PerplexityBotPerplexityAnswer engine fetching and citation
Meta-ExternalAgentMetaAI data collection
Applebot-ExtendedAppleAI training control token
BytespiderByteDanceAI data collection
AmazonbotAmazonCrawling and AI data collection

frequently asked

What are the main AI crawler user-agents in 2026?
The ones to know include GPTBot and OAI-SearchBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google), CCBot (Common Crawl), Meta-ExternalAgent (Meta), Applebot-Extended (Apple), PerplexityBot (Perplexity), Bytespider (ByteDance), and Amazonbot (Amazon), plus Cohere, Diffbot, and ImagesiftBot.
What is the difference between a training crawler and an answer-engine crawler?
Training crawlers like GPTBot, Google-Extended, and CCBot gather content to train models. Answer-engine crawlers like OAI-SearchBot and PerplexityBot fetch pages to generate live, cited answers, which is closer to search. You can choose to block one category and allow the other.
Why not just copy a robots.txt block list from a blog?
Because the list changes constantly. Operators add tokens, rename them, and split training from search. A copied snippet goes stale and silently stops covering new bots. A maintained, curated tool keeps the list current and lets you preview the exact file.
Does CCBot matter more than other crawlers?
It has outsized reach. Common Crawl builds a public dataset that many other companies train models on, so blocking CCBot can remove your content from numerous downstream training pipelines at once, not just one company's models.

Last updated June 9, 2026

ready to try robot.guard?

start guarding your site