Most publishers who want strong control over AI crawler access start with robots.txt and end with disappointment. Robots.txt is a polite request. Compliance is voluntary. The protocol works fine for well-behaved scheduled crawlers like GPTBot that publish their compliance policy alongside the user agent, but it gives you nothing against bots that ignore the file (intentionally or not), spoofed crawlers that pretend to be OpenAI from a different origin, or your own infrastructure layers that override robots.txt at the network level.
The layer that actually enforces bot policy is your CDN or WAF. The substrate that makes the enforcement possible is the IP address. OpenAI publishes the IP ranges that each of its four named crawlers uses in JSON files at openai.com, and those files are the single source of truth for "is this request actually from OpenAI." Once you can answer that question with confidence, you can build any policy you want: block, challenge, rate-limit, path-scoped restriction, geographic carve-out. The user agent string is a hint. The IP range is the proof.
This guide is the practical walkthrough for setting up IP-range based rules on the three CDN platforms most Capconvert clients run on, plus the automation pattern that keeps the rules current as OpenAI rotates the ranges.
Why IP-Range Rules Beat User-Agent Rules
User-agent matching is the entry-level way to identify a bot. A request claiming to be GPTBot or OAI-SearchBot sends a user agent string identifying itself, and a WAF rule that matches on the string is straightforward to write. The problem is that the user agent header is a self-reported field with no integrity guarantees. Any client can send any string. The actual identity of the requester is the network address it came from, and the network address either belongs to OpenAI's published infrastructure or does not.
In practice the gap between user-agent matching and IP-range matching shows up in three patterns.
The first is competitor scraping under cover. Scrapers running through residential or commercial proxies routinely send AI-crawler user agents because most WAFs are configured to whitelist those crawlers. A scraper sending "GPTBot" from a residential IP gets through bot defenses that would otherwise have blocked it. Match on IP range and the scraper fails the check immediately. The competitor learns to use a different cover identity, but at least the OpenAI cover is no longer free.
The second is anti-bot tooling that erroneously blocks legitimate OpenAI activity. Cloudflare's bot management, Akamai's bot manager, and AWS WAF's bot control all use heuristic scoring that sometimes flags real OpenAI requests as suspicious because of behavior patterns (high request rate from a small IP set, atypical user agents, unusual referer chains). An explicit IP-range allowlist takes the heuristic out of the loop and tells your WAF that requests from OpenAI's range are legitimate by definition.
The third is precise scope when you want it. IP-range rules can be combined with URL pattern matching to permit OpenAI on most of the site while restricting it from sensitive paths. User-agent rules can do the same in principle but rely on the user agent being honest, which it might not be. The IP-range version is more durable because the network identity does not lie.
When User-Agent Matching Is Still Useful
User-agent rules remain the right tool for logging and analytics. Counting GPTBot fetches in your access logs requires user-agent matching because the IP address alone does not tell you which OpenAI bot made the request. The two layers complement each other: user-agent for observability, IP-range for enforcement. The companion piece on reading AI crawler activity in server logs walks the log-analysis side of the equation.
The Four JSON Files You Need
OpenAI publishes one JSON file per named crawler. Each file lists the IP CIDR blocks the bot uses. The files are stable URLs, refreshed periodically as OpenAI's infrastructure scales. The current four:
- openai.com/gptbot.json - the training corpus crawler that ingests pages for future GPT model training
- openai.com/searchbot.json - OAI-SearchBot, the retrieval crawler that maintains the index ChatGPT search draws on
- openai.com/chatgpt-user.json - ChatGPT-User, the user-proxy bot that fetches pages on behalf of specific ChatGPT users
- A fourth file exists for OAI-AdsBot which validates landing-page policy for OpenAI's paid placements
The structure of each file is straightforward. It contains a JSON object with a "prefixes" array, and each prefix is an object with an "ipv4Prefix" or "ipv6Prefix" field listing the CIDR block. Fetching and parsing the file is a one-liner in any reasonable language:
curl -s https://openai.com/gptbot.json | jq -r '.prefixes[] | (.ipv4Prefix // .ipv6Prefix)'
Output is one CIDR per line, ready to paste into your WAF or pipe into an automated update. The same command works for each of the four files. The Anthropic, Perplexity, and Google equivalents follow the same pattern at different URLs, which lets you build a single fetch-and-merge script that maintains your entire AI-bot allowlist.
The official OpenAI bot documentation describes the user agents, the JSON URLs, and the bot purposes in one place. It is the canonical reference and worth reading once even if you mostly use the JSON files directly.
How Often The Ranges Change
OpenAI does not publish a change cadence, but observable behavior in client deployments is that the prefix lists shift modestly every few weeks. Most changes are additions as OpenAI provisions more capacity. Occasional changes are removals as old ranges are retired. The safe assumption is that any static IP allowlist will drift out of sync within a month and needs an automated refresh to remain accurate. The "Keeping The Rules Fresh" section below covers the automation pattern.
Cloudflare: The Most Common Configuration
Cloudflare is where most Capconvert clients sit, and Cloudflare's WAF is the most ergonomic of the major CDN platforms for IP-range based rules. The platform supports IP Lists (custom collections of CIDR blocks) that can be referenced from WAF Custom Rules, and the combination produces clean, auditable bot policy.
The basic setup is three steps. First, create an IP List in the Cloudflare dashboard under Manage Account > Configurations > Lists. Choose IP List as the type, name it something descriptive like "openai-gptbot-prefixes", and add the CIDR blocks from gptbot.json either manually or via the API. Repeat for each bot you want to manage separately.
Second, create a WAF Custom Rule that references the list. Under Security > WAF > Custom Rules, add a new rule with the expression:
(ip.src in $openai_gptbot_prefixes)
Choose the action: Block, Managed Challenge, JS Challenge, or Skip (for an allowlist pattern). For most publishers wanting to block GPTBot while allowing OAI-SearchBot, the action is Block and the rule scope is limited to GPTBot's IP list. The OAI-SearchBot list is referenced in a separate rule with Skip or no rule at all (allowing default behavior).
Third, deploy the rule and monitor. The Cloudflare WAF analytics show the rule's hit rate in real time, which lets you confirm immediately that the rule is firing against actual GPTBot traffic. Cloudflare's broader AI Audit feature provides separate visibility into AI crawler activity across the zone and is worth enabling alongside the custom rules.
For sites that want path-scoped restrictions, the expression supports compound conditions:
(ip.src in $openai_gptbot_prefixes) and (http.request.uri.path matches "^/members/")
This blocks GPTBot only from the /members/ path tree while leaving the rest of the site open. The same pattern works for any URL prefix, regex match, or query-string condition Cloudflare's expression language supports.
Automating Cloudflare IP List Updates
The Cloudflare API supports list updates, which means you can automate the refresh from the JSON files. A small script that fetches each OpenAI JSON, extracts the prefixes, and PUTs them to the corresponding Cloudflare list keeps the rules in sync. The Cloudflare Worker pattern (small JavaScript scripts that run on the edge) can do the fetch and update on a cron schedule, eliminating the need for external infrastructure. The "Keeping The Rules Fresh" section covers the script structure.
AWS WAF: IPSet And Rule Groups
AWS WAF uses IPSets and Rule Groups for IP-based filtering, with a different mental model than Cloudflare's IP Lists but equivalent functionality. The configuration is more verbose but the same logical building blocks apply.
The basic setup is four steps. First, create an IPSet in the AWS WAF console with the CIDR blocks from gptbot.json. The IPSet must be created in the same scope (CloudFront or regional) as the WAF you are protecting. Repeat for OAI-SearchBot, ChatGPT-User, and OAI-AdsBot if you want separate enforcement per bot.
Second, create a Rule that references the IPSet. The rule uses an "Originates from an IP address in" statement, pointing at the IPSet, with an action of Block, Allow, or Count. Block stops the request entirely. Count flags the request for monitoring without blocking, which is useful during the deployment verification phase before you commit to enforcement.
Third, add the rule to a Web ACL and associate the Web ACL with your CloudFront distribution or ALB. The association is where the rule becomes load-bearing. Until the Web ACL is associated with a resource, the rules do not run.
Fourth, monitor through CloudWatch metrics and the WAF sampled requests feature. The latter shows individual blocked requests with their IP, user agent, and path, which is the cleanest way to verify the rule is matching real OpenAI traffic.
For path-scoped restrictions, AWS WAF supports compound rules using ANDed statements. A rule combining an IPSet match with a URI Path match expression produces the same scoped-block behavior as Cloudflare's compound expression. The expression UI is more cumbersome but the result is equivalent.
AWS WAF Cost Considerations
AWS WAF prices per Web ACL per month plus per million requests evaluated. For low-traffic sites the cost is trivial. For high-traffic sites the per-request charges add up, especially if the WAF evaluates rules that match a small percentage of traffic. Right-sizing the rule set matters. The other consideration is that IPSets have a maximum capacity (10,000 CIDRs per IPSet at last check), which is well above OpenAI's range count but worth being aware of when you start combining ranges from multiple AI vendors into a single list.
Fastly And Other CDNs: The Generic Pattern
Fastly's VCL (Varnish Configuration Language) supports IP-range matching through ACLs declared at the service level. The pattern is more code-first than Cloudflare or AWS but offers the same control surface. Other CDNs (Akamai, Vercel, Netlify edge functions, Cloudflare Workers if you go fully programmatic) follow a similar pattern.
The generic configuration shape is the same regardless of platform:
- Declare or import the CIDR list for each OpenAI bot
- Match incoming requests against the list at the edge
- Take an action (block, allow, rate-limit, log) based on the match and any additional conditions
- Update the list periodically from the JSON source
In Fastly VCL, the syntax looks roughly like:
acl openai_gptbot {
"192.0.2.0"/24;
"203.0.113.0"/24;
# ... populated from openai.com/gptbot.json
}
sub vcl_recv {
if (client.ip ~ openai_gptbot) {
error 403 "Forbidden";
}
}
The ACL contents are populated from a fetch script that runs periodically. Fastly's dynamic ACLs allow updates through the API without redeploying VCL, which is the production-grade pattern.
Akamai's equivalent is Custom Bot Categories in the Bot Manager product, paired with Network Lists for the CIDR data. The mental model is the same: define the bot, define the network list, define the policy that combines them.
Vercel's edge functions and Cloudflare Workers occupy a slightly different position because they let you write the rule entirely in code rather than configuring it through a dashboard. The trade-off is more flexibility (any logic you want) against more responsibility (you maintain the code, the cron, the failure handling). For teams already maintaining edge functions for other purposes, adding OpenAI IP-range enforcement is a small extension. For teams that do not have edge functions for other reasons, a dashboard-configured WAF is the lower-overhead option.
Keeping The Rules Fresh
The IP ranges shift over time. A WAF rule built on a static snapshot of openai.com/gptbot.json will drift out of date within weeks, at which point new OpenAI fetches from new IP ranges will not match the rule and the policy will silently fail. The mitigation is an automated refresh job.
The minimum viable refresh is a cron-scheduled script that fetches each OpenAI JSON, extracts the prefix list, and updates the corresponding WAF IP list or IPSet. The script structure is identical across platforms:
#!/usr/bin/env bash
set -euo pipefail
for bot in gptbot searchbot chatgpt-user; do
curl -sS "https://openai.com/${bot}.json" \
| jq -r '.prefixes[] | (.ipv4Prefix // .ipv6Prefix)' \
> "/var/cache/openai/${bot}.txt"
# Then push the list to the WAF via the platform's API
done
The "push the list to the WAF via the platform's API" step is the platform-specific part. Cloudflare uses the IP Lists API. AWS uses the WAF UpdateIPSet API. Fastly uses the Dynamic ACL API. All three accept a current snapshot of the CIDR set and reconcile it against the existing list, adding new entries and removing stale ones.
Schedule the script to run daily or every few hours. The refresh cost is minimal (a few HTTP requests, a few API calls), and the staleness benefit is worth the complexity. Production-grade scripts add a few quality-of-life features: change detection (only update the WAF if the prefix list actually changed), Slack or email alerts on significant changes (a 50% prefix-list growth in one fetch is worth investigating), and audit logging (which prefixes were added, which removed, when).
For teams not running their own automation, several third-party services maintain the OpenAI prefix lists as a service and expose them through CDN-native integrations. Cloudflare's AI Audit, for example, includes preconfigured bot lists that are kept fresh upstream. If your CDN offers a managed version, using it eliminates the script entirely at the cost of less control over which bots are tracked. The same trade-off principles that informed your overall robots.txt policy in our training opt-out playbook apply here too.
The Failure Mode That Matters Most
The worst failure pattern is a refresh script that silently breaks (a 404 from the JSON URL, a parsing error, an API auth failure) and leaves the WAF rules referencing a stale list. The site continues to "work" from the publisher's perspective but the policy is no longer enforced. Add monitoring on the refresh job itself: dead-man's-switch alerting that pages someone if the script does not run successfully within its expected window. The cost of the monitoring is a small fraction of the cost of a silent policy failure.
Choosing Block, Challenge, Or Rate-Limit
Once the IP-range identification is in place, the policy decision is what to do with the matched traffic. Three actions cover almost every reasonable case.
Block returns a 4xx response (typically 403) and the request is rejected. This is the right choice when you want zero crawl activity from the matched range. Use it for full opt-outs and for sensitive path restrictions where any leakage is unacceptable.
Challenge returns an interstitial page or JavaScript challenge that the client must solve before being permitted. This is the right choice when you want to slow down bots without entirely refusing them, often because some of your tolerance for the bot is real and some is not. Most AI crawlers do not solve challenges, so the practical effect is similar to block, but the action is more polite and easier to roll back if you change your mind.
Rate-limit caps the requests per minute per IP, allowing some access while preventing burst traffic. This is the right choice when you want OpenAI's bots to keep working but limit their crawl bandwidth. The threshold depends on your origin's capacity and the bot's typical pattern; for most sites, 10 to 60 requests per minute per IP is a reasonable starting point. Rate-limiting is the most permissive of the three options and rarely the right answer when the actual goal is a clean policy decision.
For most publishers in the training opt-out, citation-in pattern (block GPTBot, allow OAI-SearchBot), Block is the right action for the GPTBot range and no rule at all is the right "action" for the OAI-SearchBot range (default allow). The rules can coexist in the same Web ACL or rule group without interference, and the bot fleet sees consistent enforcement.
What Not To Do
A few patterns produce more problems than they solve. Do not configure user-agent matching as a redundant layer above the IP-range rule expecting both to fire; the user agent layer adds noise to the analytics without strengthening the enforcement. Do not write rules that block all traffic from a country or ISP rather than from OpenAI's specific ranges; the over-block hits legitimate users and produces support tickets without improving bot defense. And do not enable rule-based blocks without first running the rule in Count or Log mode for a few days; the verification phase catches false positives before they affect production traffic.
Frequently Asked Questions
How do I know which OpenAI bot is actually visiting my site?
The user agent string identifies the bot (GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot), and you can grep your access logs to confirm. The IP range JSON files tell you which IPs are valid for each bot, so you can cross-reference a specific request against the relevant JSON to confirm both the user agent and the network identity match. Spoofed requests claim a user agent without coming from the published IP range; legitimate requests do both. The cross-check is one of the practical reasons to publish IP ranges in the first place, and it is the same verification mechanism that powers the policy decisions covered in GPTBot vs OAI-SearchBot.
What happens if I block an IP range that OpenAI removes later?
The blocked range simply has no traffic. Removing a CIDR block from your WAF takes no manual cleanup; if the range is no longer in OpenAI's JSON, the automated refresh removes it from your list at the next run. Stale entries do not cause functional issues, only minor analytics noise.
Will IP-range blocks affect ChatGPT-User the same way as GPTBot and OAI-SearchBot?
Yes. ChatGPT-User is a documented OpenAI crawler with its own published IP range. A rule that matches the chatgpt-user.json prefixes catches it the same way the gptbot.json prefixes catch GPTBot. The behavior considerations are different (ChatGPT-User fires on user-proxy actions rather than scheduled crawls), and the full implications are covered in our ChatGPT-User explainer, but the IP-range enforcement mechanism is the same.
Should I also block Anthropic and Perplexity at the IP layer?
Anthropic and Perplexity each publish IP ranges for their crawlers, and the same configuration pattern applies. The decision of whether to block depends on the same trade-off math as the OpenAI decision: whether the value of being citable in those AI engines outweighs the value of opting out. Most publishers we work with land on a consistent posture across all three vendors, which means writing the IP-range rules for all of them once and maintaining them through the same refresh script.
What is the right starting point if I have never configured WAF rules before?
Start with monitoring, not blocking. Configure your WAF to Count (AWS) or Log (Cloudflare) requests from OpenAI's IP ranges without taking action. Watch the counts for a week. Compare to your access logs. Verify that the rule matches the volume of OpenAI activity you actually see. Once the rule is firing correctly on every legitimate OpenAI request, switch the action to Block or Challenge. The verification phase catches misconfigurations before they affect production. The same staged-rollout approach works whether you are deploying on Cloudflare, AWS, Fastly, or any other platform.
The IP-range based control surface is the difference between bot policy you can audit and bot policy you can only hope is being honored. Robots.txt is the polite request. The WAF rule grounded in OpenAI's published prefixes is the enforcement that backs the request up. Building both layers in alignment, with the automation that keeps the rules fresh, is the configuration that survives OpenAI's infrastructure changes, scraper evolution, and the inevitable WAF dashboard changes that happen when teams rotate.
If your team wants the full audit (current rules, fetch automation, alerting, and the verification window that confirms the policy is doing what it says), that work sits inside our generative engine optimization program. The configuration is platform-specific in the details and identical in the principles. Once it is in place, the maintenance burden is small enough to disappear into the background, which is exactly where bot policy belongs.
Ready to optimize for the AI era?
Get a free AEO audit and discover how your brand shows up in AI-powered search.
Get Your Free Audit