How AI Crawlers Actually Work in 2026 — And How to Optimize for Them

AI bots accounted for 4.2% of all HTML requests across Cloudflare's network in 2025, peaking at 6.4% in late June — and AI traffic grew 187% in a single year. Meanwhile, only one of the major AI crawlers — Googlebot — actually executes JavaScript. The others (GPTBot, ClaudeBot, PerplexityBot, Bytespider) fetch your pages but never run a line of your code.

If your mental model of crawlers still starts and ends with Googlebot, it's already obsolete. There are 30+ AI bots active on the open web in 2026, falling into three operationally different categories. Each demands a different robots.txt rule, and each rewards different optimizations.

This guide is the technical and practical map. By the end, you'll know exactly which bots are hitting your site, what they want, how to allow the ones that drive traffic and block the ones that don't, and how to structure content so AI engines actually cite you.

The three categories at a glance

Training crawlers — scrape the web to build foundation models. (GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent)

Retrieval bots — fetch URLs in real time when an LLM needs to ground an answer. (OAI-SearchBot, Claude-SearchBot, PerplexityBot)

AI agents — autonomous browsers acting on behalf of a user. (ChatGPT Atlas, Perplexity Comet, Claude Computer Use)

Part 1: The Three Types of AI Crawlers

A single "block all AI" rule is now the wrong default. Each category does something different on your site, and each has a different impact on your business — so each deserves a different decision.

1.1 Training crawlers

Purpose: scrape the web to build training datasets for foundation models. They behave like classic spiders — breadth-first traversal, sitemap-driven, polite-ish.

Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), Bytespider (ByteDance), Meta-ExternalAgent, Google-Extended (which is actually a control token, not a separate crawler), and Applebot-Extended.

Business implication: blocking these costs zero AI search visibility. The one nuance is CCBot, which feeds Common Crawl — many open-source models train on this dataset, so blocking it has indirect effects.

1.2 Retrieval / search bots

Purpose: fetch URLs in real time when an LLM needs to ground an answer or build a search index. They are bursty, query-driven, and often fetch the same authoritative pages repeatedly.

Examples:

Bot	Operator	User-agent token	Respects robots.txt?
GPTBot	OpenAI	GPTBot	Yes (verified)
ClaudeBot	Anthropic	ClaudeBot	Yes (per docs)
Google-Extended	Google	Google-Extended	Yes (control token; no Search impact)
Applebot-Extended	Apple	Applebot-Extended	Yes (control token only)
Bytespider	ByteDance / TikTok	Bytespider	No — documented violations
CCBot	Common Crawl	CCBot	Yes
Meta-ExternalAgent	Meta	meta-externalagent	Yes
FacebookBot	Meta	FacebookBot	Yes
Diffbot	Diffbot	Diffbot	Configurable
AI2Bot	Allen Institute	AI2Bot	Yes
cohere-ai	Cohere	cohere-ai	Yes
DeepSeekBot	DeepSeek	DeepSeekBot	Inconsistent

Bot	Operator	User-agent token	Drives citations?
OAI-SearchBot	OpenAI	OAI-SearchBot	Yes — ChatGPT Search index
ChatGPT-User	OpenAI	ChatGPT-User	Yes — user-initiated fetches
Claude-SearchBot	Anthropic	Claude-SearchBot	Yes — Claude search index
Claude-User	Anthropic	Claude-User	Yes — user-initiated fetches
PerplexityBot	Perplexity	PerplexityBot	Yes — index builder
Perplexity-User	Perplexity	Perplexity-User	Yes (may bypass robots.txt)
DuckAssistBot	DuckDuckGo	DuckAssistBot	Yes
Amazonbot	Amazon	Amazonbot	Yes — Search + Alexa
Google-CloudVertexBot	Google	Google-CloudVertexBot	Yes — Vertex grounding
MistralAI-User	Mistral	MistralAI-User	Yes
YouBot	You.com	YouBot	Yes
Bingbot	Microsoft	Bingbot	Dual-purpose: Bing + Copilot

Agent	Operator	Detection method	Signs requests?
ChatGPT Atlas	OpenAI	Chrome 142 UA + 'ChatGPT Atlas' on favicon fetches	No
ChatGPT Agent	OpenAI	Web Bot Auth signature	Yes
Perplexity Comet	Perplexity	Chromium UA; internal extension visible in DOM	No
Claude Computer Use	Anthropic	API-driven; hard to distinguish from browser	No
Project Mariner	Google	Cloud-based VM (AI Ultra subscribers)	No
Browserbase / Anchor / Goose	Various	Web Bot Auth signature	Yes

Platform	Peak ratio (2025)	Recent ratio (mid-2025)
Anthropic	500,000 : 1 (January)	~38,000 : 1 (July)
OpenAI	3,700 : 1 (March)	~1,100 : 1 (July)
Perplexity	195 : 1	32 : 1 (News & Publications)
Google	30 : 1 (April)	9 : 1 (July)
DuckDuckGo	<1 : 1	Sends more than it crawls
Mistral	<1 : 1	Sends more than it crawls

How AI Crawlers Actually Work in 2026 — And How to Optimize for Them

Part 1: The Three Types of AI Crawlers

1.1 Training crawlers

1.2 Retrieval / search bots

Related free tools

Claude SEO Rank Tracker

Robots.txt Validator

ChatGPT SEO Rank Tracker

Related posts

What Is Generative Engine Optimization (GEO)? The Complete 2026 Guide

ChatGPT vs Google Search in 2026: Market Share, User Data & What It Means for SEO

1.3 AI agents (the new and growing category)

Part 2: The Complete AI Crawler Reference

Training crawlers

Retrieval / search bots

AI agents (2025–2026)

Part 3: How AI Crawlers Actually Work

3.1 URL discovery

3.2 Request patterns by category

3.3 The JavaScript rendering gap

3.4 Retrieval-Augmented Generation: what happens after the fetch

3.5 What AI crawlers look like in server logs

Part 4: The Volume Story — Why This Matters Now

4.1 AI bots are closing in on human traffic

4.2 The crawl-to-refer imbalance

4.3 Per-bot volume snapshot

Part 5: The Compliance Crisis — Who Actually Respects robots.txt

5.1 The compliant camp

5.2 The documented violators

5.3 Active lawsuits to watch

Part 6: Practical Optimization — How to Show Up in AI Answers

6.1 The Princeton GEO findings

6.2 Server-side rendering is mandatory

6.3 Schema markup priority

6.4 Content structure best practices

6.5 The per-bot robots.txt recipe

Part 7: Monitoring AI Crawler Traffic on Your Site

Key metrics to track

Part 8: What's Next — 2026–2027 Outlook

The Staged Playbook

This week

Next 30 days

Next 90 days

ChatGPT vs Perplexity for AI Visibility in 2026: Citations, Traffic, and Conversion Compared

Perplexity vs Google Search in 2026: The Data SEOs Need to Know