AI Training vs. AI Retrieval

TL;DR

Audience

Technical SEOs and content strategists at publishers and mid-market brands deciding how to configure robots.txt for AI training versus AI retrieval crawlers

Cortex

Cortex is modern marketing. Old marketing waited on people. Modern marketing fuses the efficiency of AI with the experience of experts. Meet your optimization engine.

Get Cortex

Effective

Google-Extended is a robots.txt token with no separate user-agent string that publishers can disallow to block content from training future Gemini models without affecting Search inclusion, AI Overviews, or AI Mode visibility. [src]

Impact

Unlike OpenAI, Microsoft uses a single Bingbot crawler for Bing Search, Copilot, and Grounding with Bing Search, with no separate Copilot bot or training opt-out token. [src]

Action

LLM search relies on retrieval-augmented generation (RAG), where the model actively fetches new information from cached web pages or live search indices rather than relying only on its training data. [src]

Platform

According to The New York Times, OpenAI's GPT-4 model was trained on over a million hours of YouTube transcriptions, making YouTube mentions part of the corpus ChatGPT learns from. [src]

Methodology

Cortex synthesized this post from 15 documents across ahrefs.com, aleydasolis.com, tryprofound.com, ipullrank.com, amsive.com, searchengineland.com, semrush.com, yoast.com, and first-party Google and Microsoft crawler documentation on 2026-05-09, validated against publisher-facing robots.txt and AI crawler guidance.

Your robots.txt file used to do one job: tell search engine crawlers which pages to index and which to skip. That era is over. Right now, your site is being visited by a growing fleet of AI bots-and they don't all want the same thing. Some are scraping your content to train large language models. Others are fetching your pages in real time to cite you in AI-powered search answers. AI crawlers often perform two roles: one set of bots gathers broad data for training the model, and another set does real-time crawling for Retrieval-Augmented Generation (RAG), which allows it to pull in fresh data when the AI needs up-to-the-minute info.

The distinction isn't academic. A blanket "block all AI bots" directive in your robots.txt could erase your visibility from ChatGPT Search, Perplexity, and Google AI Overviews. A blanket "allow everything" approach hands your content to model training pipelines with no attribution, no link back, and no compensation. Training crawlers scrape content to feed into large language model development. Your articles, guides, and resources become part of the AI's knowledge base. You don't get paid for this. You often don't even get credited. The publishers making smart decisions in this space are treating these as two entirely separate questions-and writing robots.txt rules accordingly.

Hostinger's analysis of 66.7 billion bot requests showed OpenAI's search crawler coverage growing from 4.7% to over 55% of sites in their sample, even as its training crawler coverage dropped from 84% to 12%. The gap is widening. Websites are choosing to welcome retrieval bots while shutting out training crawlers-and if you haven't made that choice deliberately, one is being made for you by default.

What AI Training Crawlers Actually Do to Your Content

AI training crawlers have a singular goal: absorb your content into a model's weight parameters. When GPTBot, ClaudeBot, CCBot, or Google-Extended visit your pages, they're not indexing you for search. They're gathering text that will be compressed, mixed with billions of other documents, and baked into the next version of a language model.

AI models operate through two primary mechanisms: training data and real-time retrieval. Training data represents information the model learned during its development, which means it has a knowledge cutoff date. Once your content enters a training corpus, it becomes embedded knowledge. The model can generate answers influenced by your writing without ever referencing your URL again. This is a one-way extraction. Training crawlers collect your content to train AI models. Your text becomes part of the model's knowledge, but you get no attribution, no link, and no traffic. Your expertise informs the model, but your brand gets none of the credit. For publishers and businesses that depend on organic traffic, this creates a measurable economic problem: you're investing in content that makes someone else's product more capable while your own traffic plateaus. The scale of training crawls is significant. Requests generated by GPTBot and Claude in just one month of late 2024 made up about 20% of Googlebot's requests from the same timeframe. These aren't trivial visits. They consume bandwidth, generate server load, and deliver zero direct return.

How Retrieval and Search Crawlers Work Differently

Retrieval crawlers operate on an entirely different model. Instead of bulk-scraping content for future model training, they fetch specific pages in real time to answer a user's question right now. Search or retrieval crawlers access your content in real time to power AI search features. When someone asks an AI assistant a question, these crawlers fetch relevant pages to inform the response. This is where your traffic opportunity lives.

These crawlers function more like traditional search engine bots than training scrapers. They index your content so that when a user asks ChatGPT, Perplexity, or Claude a question related to your expertise, the system can retrieve your page, extract the relevant answer, and cite you as a source-often with a link back.

For those tracking AI visibility, the retrieval bot category is what to watch. Training blocks affect future models, while retrieval blocks affect whether your content shows up in AI answers right now.

The behavioral difference matters for your server too. These bots fetch content on demand to answer specific user queries in AI assistants and search tools. Unlike training bots, they serve users directly rather than building datasets, which may explain their expanding access. Retrieval crawls are targeted and user-triggered. They don't blanket-crawl your entire site-they request specific URLs relevant to the question being asked.

The Three-Tier Bot Architecture You Need to Understand

The major AI companies have split their crawlers into distinct tiers, and understanding this architecture is the prerequisite for writing an effective robots.txt. Anthropic's three-bot structure mirrors a pattern already established by OpenAI and, to a lesser extent, Perplexity. OpenAI runs an equivalent three-bot system: GPTBot (training), OAI-SearchBot (search indexing), and ChatGPT-User (user-initiated retrieval).

OpenAI offers the cleanest separation. Each setting is independent of the others-for example, a webmaster can allow OAI-SearchBot in order to appear in search results while disallowing GPTBot to indicate that crawled content should not be used for training OpenAI's generative AI foundation models.

Anthropic now runs a similar three-bot system with ClaudeBot (training), Claude-SearchBot (search indexing), and Claude-User (real-time user requests). Anthropic says all three of its bots honor robots.txt, including Claude-User.

Perplexity operates a simpler two-tier structure. Perplexity operates a two-bot system: PerplexityBot for indexing and Perplexity-User for real-time retrieval. There is no separate training crawler because Perplexity does not maintain its own large-scale training pipeline in the same way. However, Perplexity's compliance record deserves scrutiny. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website's preferences. Cloudflare documented this behavior and subsequently de-listed Perplexity as a verified bot. Google presents the most complicated case. Google-Extended lets you opt out of Gemini training without affecting your search rankings. A site can currently block AI training via Google-Extended while still having its content used in AI Overviews every time a relevant query is searched. But there's no robots.txt mechanism to opt out of AI Overviews specifically while maintaining organic search visibility. Blocking Googlebot removes you from search entirely.

User-Initiated Fetchers: The Gray Zone

The third tier-user-initiated fetchers like ChatGPT-User and Perplexity-User-occupies legal and technical gray area. The most consequential change involves ChatGPT-User, the crawler that executes user-initiated actions within ChatGPT. OpenAI removed language indicating this crawler would comply with robots.txt rules. The logic is that when a human asks ChatGPT to browse a specific page, it's functionally equivalent to a person clicking a link-not an automated crawl.

OpenAI and Perplexity draw a sharper line for user-initiated fetchers, warning that robots.txt rules may not apply to ChatGPT-User and generally don't apply to Perplexity-User. This means your robots.txt controls are strongest against training crawlers, moderately effective against search crawlers, and weakest against user-triggered fetchers.

What Happens When You Block Everything: The Data

Many publishers responded to the AI crawling wave by blocking all AI user-agents. The BuzzStream study of 100 top news sites reveals the consequences. BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found 79% block at least one training bot.

But 71% also block retrieval bots that affect AI citations.

The irony is that blocking everything doesn't seem to fully prevent citations. 82.4% of sites that block OAI-SearchBot still appear in the dataset's AI citations. 88.2% of the sites that block GPTBot still appear in the dataset's AI citations. BuzzStream found that blocking AI crawlers rarely stops citation entirely-but the reasons are complex. Content already in training data persists. AI systems may access content through other pathways, including cached versions, third-party reprints, and syndication. Here's the uncomfortable nuance: being part of the training data might make AI systems more likely to cite you. If a model has already learned your brand, your frameworks, and your terminology through training, it's more likely to reference you in responses-even when it retrieves current information from other sources.

Blocking AI crawlers protects your content from being used as training data, but it also makes your publication progressively less visible in AI responses. Your historical content still exists in older training snapshots, but new reporting-your competitive advantage-becomes invisible.

A Strategic robots.txt Configuration (Not a Template)

The right configuration depends on your business model, but the emerging best practice is clear: block training, allow retrieval. Websites are allowing search bots while blocking training bots, and the gap is widening.

Here's the decision framework, broken down by crawler function: Block these (training crawlers): - GPTBot - OpenAI's model training crawler - ClaudeBot - Anthropic's training data collector - Google-Extended - Gemini training (separate from search) - CCBot - Common Crawl, used by many AI companies - Bytespider - ByteDance/TikTok training crawler - Meta-ExternalAgent - Meta's AI training crawler Allow these (search and retrieval crawlers): - OAI-SearchBot - ChatGPT search results - ChatGPT-User - user-initiated browsing (note: may not respect robots.txt) - Claude-SearchBot - Claude's search indexing - PerplexityBot - Perplexity's search index - DuckAssistBot - DuckDuckGo AI answers (doesn't train models)

Make a deliberate, category-by-category decision about which bots to allow and which to block-separating training bots from search and retrieval bots-based on your business goals and AI search visibility strategy.

A selective robots.txt configuration would look like this:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Meta-ExternalAgent Disallow: /

# Allow AI search and retrieval crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: PerplexityBot Allow: / ```

### Partial Access: Protecting Premium Content

You don't have to make binary decisions. Allow/Disallow directives work at the directory level. A publisher might allow AI search crawlers access to blog content while blocking them from premium reports:

User-agent: OAI-SearchBot Allow: /blog/ Allow: /resources/ Disallow: /premium/ Disallow: /reports/


This approach lets your publicly available content generate AI citations and referral traffic while protecting content you monetize directly.

## Beyond robots.txt: Content Signals and Enforcement {#beyond-robots-txt-content-signals-and-enforcement}

Robots.txt has a fundamental limitation: robots.txt files express your preferences. They do not prevent crawler operators from crawling your content at a technical level. Some crawler operators may disregard your robots.txt preferences and crawl your content regardless of what your robots.txt file says.

Cloudflare's Content Signals Policy, launched in late 2025, attempts to solve this by adding a semantic layer on top of robots.txt. This policy defines three content signals-search, ai-input, and ai-train-and their relevance to crawlers. The `ai-input` signal is particularly significant because it addresses the gap between training and traditional search: it governs whether your content can be used as input for real-time AI-generated answers, like Google AI Overviews or ChatGPT search summaries.

The default configuration is `search=yes` and `ai-train=no`.

Cloudflare customers have already turned on the managed robots.txt feature for over 3.8 million domains.

Content Signals represent where this space is heading, but they remain advisory. It's important to remember that content signals express preferences; they are not technical countermeasures against scraping. Some companies might simply ignore them. For actual enforcement, you need network-level controls-CDN bot management, WAF rules based on IP verification, or dedicated anti-bot solutions from providers like Cloudflare, Fastly, or Akamai.

### The Invisible Crawlers Problem

Even perfect robots.txt configuration has blind spots. A few major ones are effectively invisible to robots.txt: ChatGPT Atlas (OpenAI) uses a standard Chrome user-agent string with no identifying token. It blends with normal browser traffic and cannot be distinguished via robots.txt.

Many agentic browsers intentionally resemble standard Chromium at the fingerprinting level, so user-agent checks often won't reliably differentiate agent-driven sessions from normal Chrome traffic. As AI agents become more browser-like, robots.txt becomes less effective as a sole control mechanism. Think of it as necessary but insufficient-the foundation of a strategy, not the entire strategy.

## How to Audit Your Current Setup {#how-to-audit-your-current-setup}

Most sites have never deliberately configured AI crawler access. The default state-no specific AI bot rules in robots.txt-means everything is allowed. If you're on Cloudflare, the default may have flipped: On July 1st 2025, Cloudflare flipped a switch that changed how 20% of the public web interacts with AI systems. Every new Cloudflare domain now blocks all known AI crawlers by default.

Start with a four-step audit: 1. **Check your current robots.txt.** Pull up `yoursite.com/robots.txt` and look for any AI-specific user-agent directives. Note which bots are blocked and which are absent (absence means allowed). 2. **Review your CDN settings.** If you use Cloudflare, check whether the managed robots.txt feature is active. It may be blocking bots you want to allow. If your site uses Cloudflare, you need to change two settings to enable AI crawlers to access your content.

3. **Check server logs.** GPTBot is consistently the highest-volume AI crawler across all site niches, followed by PerplexityBot and OAI-SearchBot on pages covering AI, SaaS, and technical topics. Look at your access logs to see which bots actually visit your site, how frequently, and which pages they request. 4. **Remove deprecated user-agents.** The previous version of Anthropic's crawler page referenced only ClaudeBot and used broader language. Before ClaudeBot, Anthropic operated under the Claude-Web and Anthropic-AI user agents, both now deprecated. If your robots.txt still references `Claude-Web` or `Anthropic-AI`, update it to target the current bot names.

## The Strategic Calculus: What This Means for Your Business {#the-strategic-calculus-what-this-means-for-your-business}

The training-versus-retrieval distinction isn't a technical curiosity. It's a business decision with real revenue implications.

Anthropic's ClaudeBot makes approximately 71,000 requests for every single referral click it sends back to websites. The crawl-to-referral ratio for training bots is dismal. Search and retrieval crawlers perform better, but AI search referral traffic remains modest compared to traditional organic search. The value proposition is forward-looking: as more users start their research in AI interfaces rather than Google's blue links, being absent from AI search answers becomes a growing liability.

Taken together, the data show that publishers are no longer treating all bots the same. Instead of focusing on company names, they are looking at what each crawler actually does. That's the right approach. The question isn't "should I block OpenAI?"-it's "should I allow my content to train models I don't control, and separately, should I allow my content to be cited in AI-powered search?" For most businesses publishing original content, the answer is: block training, allow retrieval, monitor results, and adjust quarterly as the landscape shifts. These are not permanent decisions. AI companies change their crawler structures, rename user-agents, and update compliance policies regularly. The blanket "block AI crawlers" strategy that many sites adopted in 2024 no longer works the way it did. Your robots.txt needs to evolve with the same cadence. The publishers winning in AI search right now share three traits: they make deliberate crawler-by-crawler access decisions, they structure content for machine readability alongside human readability, and they treat robots.txt as a living document rather than a set-and-forget file. The distinction between training and retrieval is the axis that makes all of this possible. Get it right, and you protect your content while staying visible. Miss it, and you're either invisible to the fastest-growing discovery channel on the web-or giving your work away for free.

Key Takeaways

-Separate your robots.txt rules for training crawlers and retrieval crawlers instead of using a blanket allow or block.
-Disallow Google-Extended to opt out of Gemini training without losing AI Overviews or AI Mode visibility.
-Allow retrieval bots like OAI-SearchBot and PerplexityBot to preserve citations in ChatGPT Search and Perplexity.
-Audit your server logs monthly to see which AI crawlers are hitting your site and adjust directives accordingly.
-Treat training access as a licensing decision and retrieval access as a distribution decision, not one combined policy.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit