GEOJan 8, 2026·12 min read

Allowing OAI-SearchBot While Blocking GPTBot: A Training Opt-Out Playbook

Capconvert Team

Content Strategy

TL;DR

Most publishers want training-corpus exclusion without losing ChatGPT search visibility. The rule is six lines of robots.txt: block GPTBot, allow OAI-SearchBot, allow ChatGPT-User. This playbook walks the full deployment, including pre-flight checks, the three rule variants for different site shapes, the verification steps that catch silent CDN overrides, and the stakeholder narrative that gets legal and marketing aligned on the trade-off.

The training opt-out conversation usually starts in legal and ends in marketing. Legal wants to limit how the brand's content shows up in commercial AI training corpora. Marketing wants the brand cited when ChatGPT answers buyer-research questions in the category. The two requirements look opposed until you understand that OpenAI runs separate crawlers for each purpose. GPTBot handles training-corpus ingestion. OAI-SearchBot handles retrieval-time citation. Each respects its own robots.txt directive. Block one, leave the other allowed, and you get both outcomes at the same time.

The configuration that delivers this is six lines of robots.txt. The deployment that actually works the first time involves more than six lines because the failure modes sit at the CDN, the WAF, and the residual robots.txt rules that some other team deployed two years ago. This playbook is the version we run with clients who want the training-out, search-in pattern to land cleanly on the first deploy and stay landed.

Why This Configuration Is The Default For Most Publishers

For most content sites in 2026 the trade-off math is straightforward. Training-corpus inclusion is a long-horizon, low-direct-revenue contribution. Your content might shape a future GPT model's behavior, but the path from contribution to attributable business outcome is several years long and entirely indirect. Live ChatGPT search citations are the opposite. They drive measurable referral traffic, lift brand share-of-voice in AI surfaces that increasingly substitute for classical SERPs, and move buyer-research conversations toward your URL within weeks of a content investment.

Treating both inclusions as a single decision conflates two very different value propositions. Publishers who block all OpenAI activity because they read the GPTBot block guidance and stopped there end up sacrificing the live channel to defend a long-horizon position they did not actually need to defend. Publishers who block neither because they could not be bothered to parse the documentation end up contributing to the training corpus when they would not have if asked directly. The middle position, blocking GPTBot only, is the right default for almost every commercial publisher who is not actively building a moat around proprietary content.

This is true even for publishers with strong reservations about generative AI training as a category. Opting out of training shapes future model weights. It does not affect models that have already been trained, which already incorporate older versions of your content fetched before any block was in place. The forward-looking exclusion is real but bounded. The live citation channel runs against current model behavior in production today, and the volume of buyer-research traffic that flows through it grows month-over-month.

When To Pick A Different Configuration

There are categories where the default does not apply. Subscription publishers protecting paywalled archives, legal publishers concerned about adversarial discovery, and brands with proprietary research they want to keep out of competitor-accessible models have legitimate reasons to block both crawlers and accept the loss of citation visibility. Medical and pharmaceutical publishers operating under regulatory constraints sometimes need to opt out comprehensively. These are real cases, and the right answer for them is a full block, not the partial pattern in this playbook. The default assumption in this guide is that you are a commercial content site whose business model benefits from being citable in AI-generated answers.

The Robots.txt Block You Actually Want

The minimum viable configuration is six lines:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

That is the entire rule. GPTBot sees the Disallow and stops fetching pages for training. OAI-SearchBot sees the Allow and continues fetching pages for the live ChatGPT search index. ChatGPT search citations flow through. The training corpus contribution stops.

The explicit Allow rule under OAI-SearchBot is not strictly required by the robots.txt protocol when no other rule restricts the bot, but writing it explicitly removes ambiguity. Robots.txt parsers across vendors handle the absence of an Allow rule differently. Some assume the absence equals allow. Others apply the most specific matching User-agent block and inherit the strictest rule. Writing what you mean costs two lines of yaml and saves a class of edge-case failure where some other rule in your file inadvertently restricts OAI-SearchBot.

For sites that also want ChatGPT-User to keep working (so users who paste your URL into a chat get a working summary instead of an error), add a third block:

User-agent: ChatGPT-User
Allow: /

ChatGPT-User is the user-proxy crawler that fires when a ChatGPT user takes a specific action involving your page. Its behavior since a December 2025 policy change is documented separately and the robots.txt directive functions as a signal rather than a hard control, but writing the Allow keeps your intent visible and complete.

The fourth named OpenAI crawler, OAI-AdsBot, handles landing-page policy for OpenAI's paid placements. If you are not running paid placements in ChatGPT, it will not visit your site. If you are, allowing it is necessary for ad approval. Most publishers leave OAI-AdsBot out of robots.txt entirely and let it inherit the default allow.

What This Configuration Is Not

It is not a full opt-out. OpenAI still has the right to fetch your site through OAI-SearchBot and ChatGPT-User. Your content can still be incorporated into ChatGPT answers via the retrieval index. The only thing this configuration prevents is GPTBot contributing your pages to a future training corpus. If your goal is broader than that, this playbook is not the right one. The companion piece on GPTBot vs OAI-SearchBot covers the full opt-out configuration and the granular per-path variant for sites that need section-specific rules.

Pre-Deployment: Verify Your Current Crawl State

Before you change robots.txt, document what the current state looks like. The pre-flight check is twenty minutes of work and saves an order of magnitude more troubleshooting on the back end.

The first thing to confirm is which OpenAI bots are currently visiting your site. In your access logs (Nginx, Apache, CloudFront, or whatever your stack uses), grep the last 30 days of requests for each bot's user agent string:

grep -i "GPTBot" /var/log/nginx/access.log* | wc -l
grep -i "OAI-SearchBot" /var/log/nginx/access.log* | wc -l
grep -i "ChatGPT-User" /var/log/nginx/access.log* | wc -l

The counts give you a baseline. If GPTBot is hitting hundreds of URLs per week and OAI-SearchBot is hitting dozens, your site is being actively crawled by both and the configuration change will produce a visible shift in the logs. If GPTBot counts are zero, you may already be blocked at the CDN or WAF level and the robots.txt change will be cosmetic until you fix the upper layer. The diagnostic patterns and the right way to read the logs are covered in the AI crawler log analysis guide if you want the full reference.

The second thing to confirm is what your current robots.txt actually says, including any rules that might apply globally to all crawlers. Fetch your live robots.txt:

curl -sS https://your-site.com/robots.txt

Read the file carefully. Look for User-agent: * rules with Disallow paths. Look for any existing GPTBot or OAI-SearchBot blocks (some sites have leftover rules from earlier opt-out attempts). Look for CDN-injected directives that may have been added by your hosting provider's security tooling. The new rule you deploy has to coexist with everything already in the file.

The third thing to confirm is the CDN or WAF layer. Robots.txt rules are advisory at the protocol level. The actual enforcement runs through your CDN or WAF if you have one, and the dashboard settings can override or duplicate the robots.txt logic in ways that are not visible from the file itself. Log in to your Cloudflare, Fastly, AWS, or Akamai dashboard and check the bot management settings. Specifically check whether Cloudflare AI Audit or its equivalent in your provider is set to block OpenAI bots indiscriminately. If it is, your robots.txt rule will not be the binding control, and changing it without also adjusting the CDN setting will produce no behavior change.

The Baseline Document

We capture the pre-flight findings in a short baseline document before any change goes out. The document lists the current robots.txt contents, the 30-day fetch counts per bot, the current CDN bot-management settings, and a screenshot of a ChatGPT search query that cites the site as proof of current visibility. The baseline document is what we compare against in the verification phase after deployment. Without the baseline, the only signal you have is whether ChatGPT still cites you, and that signal is too slow and too coarse to trust.

Deployment: Three Versions Of The Rule

The exact robots.txt you deploy depends on how complex your site is. Three patterns cover almost every case.

  1. The minimal pattern, for single-section sites with one set of rules across the whole domain. Add the GPTBot block and the OAI-SearchBot allow at the top of your existing robots.txt, above any User-agent: * rules. The most-specific User-agent block wins in well-behaved parsers, so positioning matters less than completeness, but putting OpenAI's rules at the top makes them easy to find and audit. Use the six-line minimum from the previous section as a drop-in.
  2. The full named-fleet pattern, for sites that want explicit rules for every OpenAI bot. Add Allow rules for OAI-SearchBot, ChatGPT-User, and OAI-AdsBot in addition to the GPTBot Disallow. Eight to twelve lines total. This pattern eliminates ambiguity for every bot in the fleet and is what we recommend for higher-traffic sites where audit-trail clarity matters more than file brevity.
  3. The granular path-scoped pattern, for sites with mixed content categories. Combine the GPTBot block with path-specific rules for OAI-SearchBot that restrict it from sensitive sections (members areas, paid archives, admin paths) while leaving the marketing content open. This is the right pattern for publishers with subscription business models where the article archive should not be ingested but the marketing pages should remain reachable.

For all three patterns, the deployment mechanics are the same. Edit robots.txt, commit the change, push to your origin server, and verify that the file is being served correctly from the public URL. The fetch above (curl your own robots.txt) is also the post-deployment fetch. If the new contents do not appear there immediately, something is intercepting the request before it reaches your origin. Most commonly this is a CDN cache. Purge the cache and re-fetch.

A Note On Deployment Timing

There is no urgency benefit to deploying off-hours. Robots.txt changes do not produce user-visible impact. The bot behavior change rolls in gradually as each crawler re-fetches the robots.txt and updates its internal queues. GPTBot and OAI-SearchBot pick up new rules within 24 to 72 hours of their next crawl. The post-deployment monitoring window is one to two weeks, not minutes. Deploy during normal business hours when the engineer who made the change is available to investigate if something goes wrong.

Verification: Watching The Bot Behavior Change

The first verification step is reading your robots.txt as each bot would see it. Curl with the bot's user agent string:

curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2)" https://your-site.com/robots.txt
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0)" https://your-site.com/robots.txt

The response should match what you deployed. If your CDN serves a different file based on user agent (some bot-management features do this), you will see the divergence here.

The second verification step is access log monitoring. For the next two weeks after deployment, watch GPTBot and OAI-SearchBot fetch volume in your logs daily. The expected pattern is a steady decline in GPTBot fetches over the first 72 hours followed by complete cessation. OAI-SearchBot fetches should remain at roughly the pre-change baseline or grow slightly. If GPTBot fetches do not decline, the bot has not yet picked up the new rule (give it another day) or your CDN is serving a stale robots.txt (purge the cache and re-check). If OAI-SearchBot fetches drop or stop, your rule has unintended scope and is also blocking the bot you wanted to keep.

The third verification step is empirical citation testing. The full purpose of the configuration is to stay in ChatGPT search results. Confirm that you are still being cited by running 10 to 20 buyer-intent queries about your category through ChatGPT and counting how often your domain appears in the source citations. If the citation rate matches your baseline, the configuration is working. If it drops materially, you may have a CDN-layer block in place that the robots.txt change exposed, or some other indirect effect that needs investigation. The OAI-SearchBot optimization playbook covers the citation-side workflow if you want a deeper diagnostic walkthrough.

Verifying The Bot Identity Is Real

The user agent string is the primary signal but is easily spoofed. To verify a request claiming to be GPTBot or OAI-SearchBot is actually from OpenAI, cross-check the request's IP address against OpenAI's published ranges. The GPTBot IP range JSON lives at openai.com/gptbot.json and the OAI-SearchBot equivalent at openai.com/searchbot.json. A request claiming to be one of these bots from an IP outside the published range is not actually from OpenAI and should be treated as suspect.

Common Pitfalls That Break The Pattern

The configuration is simple. The deployment failures are not. Across the client robots.txt rewrites we have run, five patterns produce most of the post-deployment incidents.

  1. Global User-agent: * Disallow rules left over from staging. These are the single most common cause of unexpected OAI-SearchBot blocks. A Disallow: / under User-agent: * applies to every bot that does not have a more specific rule, and if your OAI-SearchBot Allow block is missing or written incorrectly, the global rule applies. The diagnostic is to read the robots.txt with OAI-SearchBot's user agent and look at whether your most-cited URLs are reachable.
  2. CDN-layer bot management overriding robots.txt. Cloudflare, Akamai, Fastly, and AWS WAF can all be configured to challenge or block OpenAI bots regardless of what robots.txt says. The robots.txt change appears to work on paper but no behavior changes in production because the upper layer is the binding control. The fix requires dashboard access to the CDN, not changes to robots.txt.
  3. Path-scoped Disallow rules that catch important pages. Granular pattern (option 3 above) is powerful but error-prone. A Disallow: /blog/ rule meant to protect draft posts will silently exclude your top-cited content from OAI-SearchBot if your blog posts are the pages ChatGPT cites most. The diagnostic is to enumerate the top-cited URLs from your baseline document and check each one against the path rules.
  4. Noindex meta tags injected by themes or plugins. Some retrieval pipelines check noindex meta tags and skip citation for pages marked as such, even when robots.txt explicitly allows the bot. WordPress sites running an SEO plugin sometimes have noindex toggles that affect categories of pages, and the toggle propagates as meta tags that conflict with the intended robots.txt configuration.
  5. Stale CDN cache serving an old robots.txt for hours after deployment. Some CDN configurations cache robots.txt aggressively. The new rule is on the origin but the public URL still serves the old version. The diagnostic is to curl your robots.txt with cache-busting headers and compare to the origin file. The fix is a manual cache purge after deployment.

The Order Of Diagnostic Operations

When a deployment looks like it failed, run the diagnostics in this order. First, curl your robots.txt and confirm the new contents are public. Second, curl with each bot's user agent and confirm parser behavior. Third, check your CDN dashboard for bot-management overrides. Fourth, inspect access logs for the expected fetch-volume shift. Fifth, run a citation spot-check against ChatGPT to confirm the live channel is still working. The order matters because each step rules out a category of failure before the next step is meaningful.

The Stakeholder Case For This Approach

The internal sell for this configuration is often harder than the deployment. Legal and marketing approach the trade-off from different vocabularies, and the engineer running the change is usually the one explaining the same trade-off twice in different language.

For legal, the relevant framing is that GPTBot is the bot that ships content into training corpora, and the GPTBot block stops that contribution. OpenAI's documentation explicitly supports the directive, OpenAI has acknowledged it in publisher statements, and the technical mechanism (separate user agents, separate IP ranges, separate robots.txt tokens) backs the policy claim. The brand's content stops contributing to future model weights from the deployment date forward. Content that was already ingested before the block was deployed remains in models that were already trained, which is a real but bounded consideration legal teams need to understand. The brand is not pretending the content was never used. It is opting out of further use going forward.

For marketing, the relevant framing is that OAI-SearchBot is the bot that powers ChatGPT search citations, and leaving OAI-SearchBot allowed preserves the channel. ChatGPT processes billions of queries per day. Every brand citation in those answers is a referral opportunity, a share-of-voice signal, and a buyer-research touchpoint. The configuration in this playbook keeps the channel open without compromising the training opt-out. The marketing team's job is unaffected. The same content investments that drive Google rankings continue to drive ChatGPT citations.

For executives, the relevant framing is the cost of getting it wrong. The default move (block all OpenAI activity) sacrifices a growing referral channel to defend a position the brand could have defended more narrowly. The alternative default (block nothing) contributes to training corpora the brand could have opted out of with one configuration change. The middle pattern is the only configuration that does what most brands actually want, and the deployment cost is two hours of engineering work plus one week of monitoring. The executive question is whether the brand should pay that cost. For most commercial publishers, the answer is yes.

When The Internal Sell Goes Sideways

Sometimes legal pushes back on any continued OpenAI access. Sometimes marketing pushes back on any restriction because they read a Twitter thread about AI citation lift. The right move is to bring both sides into the same conversation with the same data. The baseline document from the pre-flight step is the artifact that grounds the conversation. Show the current fetch volumes, the current citation rate, and the projected impact of each option. The decision usually lands in the middle pattern naturally because the data supports it. Where it does not, follow the constraint that the most-bound stakeholder requires and document the reasoning so the configuration is auditable later.

Frequently Asked Questions

How quickly will GPTBot stop fetching after I deploy the block?

OpenAI does not publish a guaranteed propagation window, but observable behavior in client deployments is that GPTBot picks up new robots.txt rules within 24 to 72 hours of its next scheduled crawl of your robots.txt file. For high-traffic sites the bot fetches robots.txt frequently and the change applies fast. For lower-traffic sites the cadence is slower. If you do not see a clear decline in GPTBot fetches after a week, run the diagnostic order above to identify whether the rule is being honored or another layer is interfering.

Does this configuration affect Google rankings or AI Overviews?

No. GPTBot and OAI-SearchBot are operated by OpenAI. Your Google rankings depend on Googlebot's crawl of your site, and Google's AI Overviews depend on the regular Googlebot index. Google's separate token, Google-Extended, controls Gemini training opt-out, and it is independent from the OpenAI directives. Deploying the OpenAI rules has zero effect on anything Google-controlled.

Is the same pattern available for Anthropic's ClaudeBot and Perplexity's bot?

Yes, with a different mechanism. Anthropic and Perplexity each operate separate bots for training and retrieval, and each respects its own robots.txt directives. ClaudeBot and PerplexityBot can be blocked at the training level the same way as GPTBot, while their retrieval counterparts stay allowed. The principle is identical across all three vendors: block the training crawler, allow the retrieval crawler.

What if my site is on Shopify, WordPress, or another platform that controls robots.txt?

Shopify generates robots.txt automatically and allows custom rules through the robots.txt.liquid template, which is editable in the theme code. WordPress generates robots.txt from a plugin or through Yoast or RankMath, all of which support custom User-agent blocks. Webflow exposes robots.txt directly in the SEO settings. For all three platforms, the mechanics are slightly different from a static-site origin but the result is the same: you add the GPTBot Disallow and OAI-SearchBot Allow rules through whichever interface the platform provides, and you verify by curling your live robots.txt.

Can I roll this configuration back if it does not work as expected?

Yes, immediately. Robots.txt is advisory and stateless. Remove the GPTBot Disallow, redeploy, and GPTBot will resume fetching within 24 to 72 hours. No content was destroyed and no irreversible action was taken. The roll-back is as low-risk as the deployment was. The only consideration is that any pages added to the GPTBot training queue while the block was in place will not have been fetched, and OpenAI may need a fresh crawl after the rule is lifted to refresh those pages. For most publishers this is an acceptable cost.

The training opt-out, citation-in pattern is the right default for almost every commercial publisher in 2026. The configuration is simple, the deployment is reversible, and the trade-off math favors it for any brand that values its presence in ChatGPT search answers. Where the default does not apply, the alternatives (full opt-out, granular per-path rules) are documented in the broader robots.txt material and can be reached from the same foundation.

If your team wants the pre-flight baseline, the deployment, the verification window, and the CDN-layer audit run as a single engagement, that work sits inside our generative engine optimization program. The configuration is small enough to look trivial and important enough to be worth doing carefully the first time.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit
Free Audit