Robots.txt for AI Crawlers

TL;DR

AI companies now run three-tier crawler architectures that require separate robots.txt decisions: training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) that feed foundation models, search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) that index for AI answers, and user-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User) that retrieve pages in real time when a human asks a question. The blocking trade is asymmetric. December 2025 research from Rutgers and Wharton found publishers blocking AI crawlers saw a 23.1% monthly visit decline; BuzzStream's analysis of 4 million AI citations found 70.6% of sites blocking ChatGPT-User and 92.3% of sites blocking Google-Extended still appeared in citations. Blocking reduces traffic without reliably reducing citation. But allowing everything has costs too: ClaudeBot crawls 23,951 pages per referral, GPTBot 1,276:1, and only PerplexityBot delivers a strong return at roughly 110:1. 89.4% of AI crawler traffic is training or mixed-purpose; only 2.2% is real-time user-triggered. The surgical approach wins: block training crawlers (GPTBot, Google-Extended, CCBot, Meta-ExternalAgent, Bytespider, Applebot-Extended, Amazonbot) and explicitly allow search and user-triggered bots (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot). Robots.txt is a request, not a lock: 13% of AI bot requests bypassed it in Q4 2025 (up 400% from Q2), and Cloudflare de-listed Perplexity as a verified bot after catching stealth crawling with Chrome user-agent impersonation. For paywalled or proprietary content, supplement robots.txt with WAF rules or server-level 403 responses. Audit quarterly to remove deprecated strings (Claude-Web, Anthropic-AI), add new crawlers within a week of launch, and let referral data shape the surgical line: ChatGPT generates roughly 80% of AI referrals as of January 2026.

Audience

Technical SEO leads and infrastructure engineers at content-heavy sites deciding which AI crawlers to allow, block, or rate-limit in robots.txt

Cortex

Cortex is modern marketing. Old marketing waited on people. Modern marketing fuses the efficiency of AI with the experience of experts. Meet your optimization engine.

Get Cortex

Effective

Cloudflare's AI Crawl Control creates or updates a WAF custom rule on the zone to enforce blocks when a crawler is blocked in the dashboard, and surfaces a 'Robots.txt violations' column counting how often each AI crawler has violated the site's robots.txt file. [src]

Impact

Mistral AI operates separate crawlers including MistralAI-User for user-triggered fetches in Le Chat and MistralAI-Index for automated indexing, and explicitly states MistralAI-User is not used to crawl content for generative AI training. [src]

Action

Cloudflare reports that roughly 30% of global web traffic comes from bots, and a new category of AI crawlers has emerged that collects data to train AI models while raising concerns about content rights and infrastructure overload. [src]

Platform

Cloudflare offers a managed robots.txt feature that automatically creates and maintains entries telling AI training crawlers not to access a site, plus an option to block AI bots only on portions of a site monetized through ads. [src]

Methodology

Cortex synthesized this post from 15 documents across searchengineland.com, developers.cloudflare.com, blog.cloudflare.com, docs.mistral.ai, knownagents.com, and github.com on 2026-01-27, validated against publicly documented AI crawler user-agents and robots.txt policies.

Your robots.txt file used to be simple. A few directives for Googlebot, maybe a block on an admin directory, and you were done. That era ended. The AI crawler landscape saw a significant shift between May 2024 and May 2025, with GPTBot surging from 5% to 30% of AI crawler share.

The monthly volume of AI-driven traffic nearly tripled over the course of 2025, with training crawlers making up the majority at 67.5% of AI-driven traffic.

The problem isn't just volume. It's complexity. OpenAI now runs a three-tier structure with GPTBot, OAI-SearchBot, and ChatGPT-User.

Anthropic has updated its crawler documentation to formally list three separate web-crawling bots - ClaudeBot, Claude-User, and Claude-SearchBot - each with a distinct purpose and distinct consequences when blocked.

Perplexity operates a two-bot system: PerplexityBot for indexing and Perplexity-User for real-time retrieval. A blanket "block all AI crawlers" rule now carries real costs - and so does leaving everything open. This guide walks through the exact configuration decisions you need to make, bot by bot, directive by directive.

The Three-Tier Crawler Architecture You Need to Understand

Before touching your robots.txt, you need to understand the structural shift happening across every major AI company. They're splitting their crawlers into distinct tiers, and each tier has different implications for your content. Training crawlers collect content to build and improve AI foundation models. GPTBot and ClaudeBot fall here. ClaudeBot collects publicly available web content to contribute to the training and improvement of Anthropic's generative AI models. When a site restricts ClaudeBot access, it signals that the site's future materials should be excluded from AI model training datasets - but it does not retroactively remove content already collected.

Search crawlers index your content so it can appear in AI-powered search results. Each setting is independent of the others - a webmaster can allow OAI-SearchBot in order to appear in search results while disallowing GPTBot to indicate that crawled content should not be used for training OpenAI's generative AI foundation models. Claude-SearchBot serves the same function for Anthropic's ecosystem. User-triggered fetchers retrieve pages in real time when someone asks a question. ChatGPT-User and Claude-User activate only when a human makes a specific request. Here's where it gets tricky: OpenAI and Perplexity draw a sharper line for user-initiated fetchers, warning that robots.txt rules may not apply to ChatGPT-User and generally don't apply to Perplexity-User.

Anthropic says all three of its bots honor robots.txt, including Claude-User - a contrast with OpenAI and Perplexity's approach.

This three-tier split means your robots.txt now requires at least three separate decisions per AI company: whether to allow training, whether to allow search indexing, and whether to allow real-time retrieval.

The Complete User-Agent Reference for 2026

Getting the user-agent strings right is non-negotiable. A misspelled token or a deprecated string renders your directive useless.

OpenAI's Crawler Fleet

GPTBot generates the most AI crawler traffic among OpenAI's bots. OpenAI operates three distinct crawlers - GPTBot for training, OAI-SearchBot for search indexing, and ChatGPT-User for real-time RAG retrieval. Each crawler respects robots.txt directives independently.

GPTBot - Training data collection. User-agent token: GPTBot
OAI-SearchBot - Powers ChatGPT search results. User-agent token: OAI-SearchBot
ChatGPT-User - Fires when a human asks ChatGPT to browse. User-agent token: ChatGPT-User

OpenAI updated its documentation in December 2024 to add a note that GPTBot and OAI-SearchBot share information to avoid duplicate crawling when both are permitted. This is worth knowing: if you allow both, OpenAI won't hit your server twice for the same page.

All can be verified through their published IP ranges at openai.com/gptbot.json, openai.com/searchbot.json, and openai.com/chatgpt-user.json. This gives you a secondary verification layer beyond user-agent strings.

Anthropic's Three-Bot Framework

Anthropic's crawler page now lists ClaudeBot (training data collection), Claude-User (fetching pages when Claude users ask questions), and Claude-SearchBot (indexing content for search results) as separate bots, each with its own robots.txt user-agent string.

ClaudeBot - Training data. User-agent token: ClaudeBot
Claude-SearchBot - Search indexing. User-agent token: Claude-SearchBot
Claude-User - Real-time user-triggered retrieval. User-agent token: Claude-User

The previous version of Anthropic's crawler page referenced only ClaudeBot. Before ClaudeBot, Anthropic operated under the Claude-Web and Anthropic-AI user agents, both now deprecated. If your robots.txt still references Claude-Web or anthropic-ai, those directives are doing nothing for you against the current crawlers. One notable distinction: Anthropic supports the non-standard Crawl-delay extension to robots.txt. If you allow ClaudeBot but want to limit server impact, you can add Crawl-delay: 10 to space requests 10 seconds apart. This is useful for websites with limited server resources. Not all crawlers respect crawl-delay, but ClaudeBot does.

Perplexity's Two-Bot System

PerplexityBot - Indexes content for Perplexity's search engine. User-agent token: PerplexityBot
Perplexity-User - Real-time retrieval for user queries. User-agent token: Perplexity-User

Perplexity requires extra scrutiny. Even if a page is blocked, Perplexity may still index the domain, headline, and a brief factual summary. PerplexityBot indexes pages similarly to other search engines, and Perplexity does not build foundation models.

However, Perplexity has faced serious trust issues. Cloudflare observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked. Both their declared and undeclared crawlers were attempting to access content, and they would rotate through IPs and different ASNs in attempts to further evade website blocks.

Based on this behavior, Cloudflare de-listed Perplexity as a verified bot and added heuristics to their managed rules that block this stealth crawling.

This means robots.txt alone may not be sufficient for Perplexity. If blocking Perplexity is important to you, consider supplementing robots.txt with WAF rules or server-level blocks.

Other AI Crawlers Worth Tracking

Your robots.txt shouldn't stop at the big three. In 2026, you need to target at least 10–15 user-agent strings to cover the major AI crawlers. Key additions include:

Google-Extended - Controls whether your content trains Google's AI models (separate from Googlebot for search)
Meta-ExternalAgent: Meta-ExternalAgent sends zero referral traffic. It's the second-highest volume AI crawler and returns nothing to publishers.
Bytespider - ByteDance's crawler
CCBot - Common Crawl's bot, a frequent source for training datasets
Applebot-Extended - Apple's AI training crawler
Amazonbot - Amazon's data collection bot

Making the Strategic Decision: Block, Allow, or Go Surgical

This is where most guides oversimplify. The right approach depends on your business model, and the data strongly favors a surgical strategy over a binary one.

Why Blanket Blocking Is Costly

Research published December 31, 2025 by academics from Rutgers Business School and The Wharton School found that publishers blocking AI crawlers via robots.txt experienced a total traffic decline of 23.1% in monthly visits and a 13.9% decline in human-only browsing. Blocking appears to reduce traffic without reliably reducing AI citation rates. That last point is counterintuitive but critical. A BuzzStream study of 4 million AI citations found that among sites blocking ChatGPT-User, 70.6% still appeared in citations. Among sites blocking Google-Extended, 92.3% still appeared. Roughly 95% of ChatGPT citations came from sites blocking training bots. AI systems find ways to reference your content even when you block them - through cached data, third-party references, and other data sources.

Why Blanket Allowing Has Costs Too

The crawl-to-refer ratio tells the other side of the story. ClaudeBot crawls 23,951 pages per referral. GPTBot's ratio is 1,276:1. For every visitor Anthropic sends back to your site, its crawler has consumed nearly 24,000 pages of your server resources.

According to Cloudflare Radar, 89.4% of AI crawler traffic is training or mixed-purpose. Only 8% is search-related, and just 2.2% responds to actual user queries in real time. Allowing everything means giving away bandwidth to crawlers that may never send a single visitor back.

The Surgical Approach: Block Training, Allow Search

The most popular strategy in 2026 is blocking training-specific user agents like GPTBot, Google-Extended, CCBot, and anthropic-ai to prevent your content from entering model training datasets, while allowing retrieval user agents like ChatGPT-User, Claude-Web, and PerplexityBot so your content can appear in real-time AI search answers.

Block training-focused bots while explicitly allowing search and user-action bots. You'll block 89.4% of the extractive traffic while preserving the 10.2% that could send actual visitors.

This framework serves most businesses. But content publishers with proprietary research or paywalled content may need to go further - blocking all AI crawlers except those from platforms where they have licensing agreements.

Writing Your robots.txt: Syntax That Actually Works

Syntax errors in robots.txt are common, silent, and devastating. Even experienced webmasters make syntax errors that render their robots.txt ineffective. These mistakes are surprisingly common and frustratingly silent, since crawlers won't tell you they're ignoring your malformed directives.

The Balanced Configuration

Here's a practical robots.txt for the surgical approach - blocking training crawlers, allowing search and retrieval:

# Traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot Allow: /

# OpenAI: Block training, allow search and retrieval
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: /

# Anthropic: Block training, allow search and retrieval
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: /

# Perplexity: Allow (search-focused, no foundation model training)
User-agent: PerplexityBot
Allow: /

# Google AI training: Block
User-agent: Google-Extended
Disallow: /

# Other training crawlers: Block
User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / User-agent: Applebot-Extended Disallow: /

# Default
User-agent: *
Disallow: /admin/
Disallow: /api/

Sitemap: https://yoursite.com/sitemap.xml ```

### Critical Syntax Rules

Every User-agent line needs at least one Allow or Disallow directive following it. A User-agent line with nothing beneath it is silently ignored.

Don't insert blank lines between a User-agent and its directive - bots treat folder and file names as case-sensitive. If your folder is named `/Blog/` but you block `/blog/`, the crawler ignores your directive entirely.

One common misconception is that rules are read in order from top to bottom. Modern crawlers like Googlebot evaluate rules according to their degree of specificity - more specific rules have priority over general ones.

Some people add `User-agent: *` with `Disallow: /` thinking it only blocks AI bots. This actually blocks ALL crawlers, including Googlebot, which destroys your traditional search rankings. Always use specific user agent names for AI bots.

### The Path-Level Approach for Mixed Content

If you want AI search crawlers to see your blog and documentation but not your proprietary case studies or pricing, use path-specific rules:

User-agent: OAI-SearchBot Allow: /blog/ Allow: /docs/ Allow: /features/ Disallow: /pricing/ Disallow: /case-studies/ Disallow: /dashboard/


This gives you granular control without an all-or-nothing tradeoff.

## Beyond robots.txt: When You Need Stronger Enforcement {#beyond-robots-txt-when-you-need-stronger-enforcement}

The instructions in robots.txt files cannot enforce crawler behavior to your site; it's up to the crawler to obey them. While Googlebot and other respectable web crawlers obey the instructions, other crawlers might not. Robots.txt is a request, not a lock.

TollBit's report reveals that 13% of AI bot requests in Q4 2025 bypassed robots.txt files, and the share of bots ignoring these guardrails jumped 400% from Q2 to Q4 2025. The compliance gap is widening.

### Server-Level Blocks

For Apache servers, you can use `.htaccess` rules that return 403 responses to specific user agents. For Nginx, conditional blocks based on `$http_user_agent` can reject requests outright. These are harder to evade than robots.txt because they deny the HTTP request entirely rather than relying on voluntary compliance.

### Cloudflare's Managed Approach

Managed robots.txt for AI crawlers is available on all Cloudflare plans. If your website does not have a robots.txt file, Cloudflare creates a new file with managed block directives and serves it for you. Cloudflare also offers WAF rules that can block or rate-limit AI crawlers at the network level - a more reliable enforcement mechanism than robots.txt alone.

### Automated robots.txt Services

Known Agents (formerly Dark Visitors) offers robots.txt files that update as new agents emerge, so you never have to manage them yourself or make manual edits. This solves a real problem: The AI industry launches new crawlers regularly, and existing ones sometimes change their user agent strings.

## Monitoring and Maintaining Your Configuration {#monitoring-and-maintaining-your-configuration}

Setting up your robots.txt is the first step. Keeping it current requires ongoing work.

### Check Your Server Logs

Run a grep across your access logs for the key AI user-agent strings. A command like `grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|google-extended" access.log` shows which AI crawlers are actually visiting and how frequently. If a bot you've blocked still appears, it's ignoring your directive and you need server-level enforcement.

### Audit Quarterly

A quarterly review is the minimum. The AI industry launches new crawlers regularly, and existing ones sometimes change their user agent strings. Set a calendar reminder to check community resources like Known Agents and the official documentation from OpenAI, Google, and Anthropic. When a major new AI platform launches, add its crawler to your robots.txt within the first week.

Your audit should verify three things: deprecated strings have been removed, new crawlers have been added, and your existing directives still match your business goals. Both Claude-Web and Anthropic-AI are now deprecated. If your robots.txt still references those deprecated strings, you may have outdated and ineffective directives in place.

### Track Referral Traffic by Source

In January 2026, ChatGPT still generates about 80% of all AI referral traffic. But the landscape shifts fast. Google Gemini more than doubled its referral traffic to websites between November 2025 and January 2026, according to SE Ranking's analysis of more than 101,000 sites. Monitor your analytics for referral traffic from `chat.openai.com`, `perplexity.ai`, `gemini.google.com`, and `claude.ai` to see which platforms actually send visitors to your site. That data should drive your robots.txt decisions.

## The Trade-Off That Defines Your Strategy {#the-trade-off-that-defines-your-strategy}

The AI crawler decision is fundamentally an economic one. Traffic from AI is still relatively small - only 0.1% of total web traffic - but it's growing fast, and early research showed that AI referral visitors convert better than traditional search visitors.

One study found AI search traffic converts at 14.2% compared to Google's 2.8%.

At the same time, the crawl-to-refer ratios directly impact the viability of content publication on the Internet. The trend continues to be more crawls and fewer referrals when compared in relation to each other.

The right configuration depends on whether you sell content or sell through content. If your articles, data, or analysis *are* the product - like news publishers, research firms, or data providers - blocking training crawlers while preserving search visibility is the defensible choice. Unless you're the New York Times, the downside of invisibility far outweighs theoretical IP concerns. The sophisticated 2026 approach isn't binary - it's surgical.

If you sell products or services and content drives awareness, the math flips. Maximum AI visibility means maximum surface area for recommendation. Block the most aggressive training crawlers with the worst crawl-to-refer ratios (Meta-ExternalAgent, Bytespider). Allow everything that might put your name in an AI-generated answer. Whatever you choose, stop treating your robots.txt like a set-and-forget file. It's now a living policy document - one that determines whether 89% of AI crawler traffic hits your servers for nothing, or whether the 11% that matters can find you when it counts.

Key Takeaways

-Treat robots.txt as three separate decisions per AI vendor: training, search indexing, and user-triggered retrieval.
-Block training crawlers like GPTBot, Google-Extended, CCBot, and Bytespider while explicitly allowing OAI-SearchBot, Claude-SearchBot, and PerplexityBot.
-Back robots.txt with WAF rules or 403 responses for paywalled content, since 13% of AI bot requests bypassed robots.txt in Q4 2025.
-Audit your user-agent strings quarterly to drop deprecated tokens like Claude-Web and add new crawlers within a week of launch.
-Use referral analytics to draw the surgical line, prioritizing access for bots like PerplexityBot that return roughly 110 pages crawled per referral.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit