A common conversation we have with brands new to AI optimization goes like this. The marketing team shows the CMO a report claiming AI Search referral traffic is up 40% month over month. The CMO asks how much of that is real users versus bots. The marketing team says they think most of it is real but they have not actually separated the two. The CMO asks again. The team checks their analytics and discovers that some of what they have been calling AI Search traffic is actually OpenAI's bots crawling the site, not users clicking through from ChatGPT answers. The numbers were not wrong exactly, but they were not what the report implied either.
The cleanup is straightforward once you understand which bots visit your site, how each analytics platform classifies them, and where the gaps are. This piece walks the practical audit: identifying the four named OpenAI bots in your data, configuring filters at the right layers, and producing reports that distinguish bot activity from user activity cleanly enough that executive audiences trust the numbers.
The Four Named OpenAI Bots And Where They Show Up
OpenAI operates at least four named bots in 2026. Each has a distinct purpose and a distinct presence in your analytics.
GPTBot is the training corpus crawler. It fetches pages periodically to ingest content for future GPT model training. The visits are scheduled, the volume is moderate, and the bot identifies itself with the GPTBot user agent string. The fetches are programmatic; no human is waiting on the other end. Documented in OpenAI's bot reference.
OAI-SearchBot is the retrieval index crawler. It maintains the index ChatGPT search draws on, fetching pages more frequently than GPTBot to keep retrieval data fresh. The visits are programmatic too; the bot is building an index, not delivering content to a specific user.
ChatGPT-User is the user-proxy bot. It fires when a specific ChatGPT user takes an action that requires fetching a specific URL, like pasting a link into chat or clicking an inline citation. The fetches are user-initiated even though the fetcher is technically a bot.
OAI-AdsBot is the ad system landing page validator. It fetches pages running OpenAI ads to verify policy compliance. The visits are programmatic and infrequent for most publishers.
All four bots leave traces in your server logs, identifiable by user agent string and verifiable against OpenAI's published IP ranges. The question of how each one shows up in higher-level analytics platforms (GA4, Plausible, Mixpanel, Heap) is where the cleanup work happens.
Why The Distinction Matters
The four bots have different implications for your reporting. GPTBot, OAI-SearchBot, and OAI-AdsBot are pure bot traffic; they should not appear in user-behavior reports because no user is taking the action. ChatGPT-User sits in a gray zone; the visit is initiated by a real user even though the fetch is performed by a bot, and reasonable people disagree about whether to count it as user traffic. Most agencies and publishers default to treating ChatGPT-User as user traffic for reporting purposes because the user intent is real, but the practice varies. The companion piece on ChatGPT-User specifically covers the distinction in more depth.
Why The Bot-Versus-User Split Matters For Reporting
Mixed bot and user data in your reports produces three classes of problem.
The first is misleading volume metrics. A 40% month-over-month increase in AI Search traffic looks great until you realize 30% of the increase came from OAI-SearchBot ramping up its crawl cadence on your site, not from more users clicking citations. The actual user-driven traffic growth might be 10%. The bigger number lands better in a presentation but does not reflect the underlying buyer behavior change.
The second is broken funnel and conversion analysis. Bot visits do not convert; they fetch the page and leave. If bot visits are mixed into your AI Search channel, your conversion rate calculations are diluted by the bot share. A channel that appears to convert at 1.8% might actually convert at 2.6% once bot traffic is excluded. The lower number understates the channel's quality and can lead to misallocated investment.
The third is unreliable per-page analytics. Some bots crawl exhaustively, hitting every URL on your site including obscure pages that real users rarely see. The page-level traffic reports show these pages as more popular than they are. Content strategy decisions based on the inflated page rankings end up prioritizing pages that bots like rather than pages users actually engage with.
The cleanup is worth doing not for its own sake but because every downstream report depends on the underlying classification being correct. Once the bot-versus-user split is clean, the entire stack of metrics (volume, conversion, page popularity, time on site) becomes more trustworthy.
When The Cleanup Is Most Urgent
Brands that have crossed material AI traffic thresholds (say, 1-5% of total site sessions identifiable as AI-related) feel the impact of mis-classification most acutely because the bot share can distort headline metrics meaningfully. Brands with minimal AI traffic can defer the cleanup until the numbers grow. The right time to start is before the AI traffic share crosses the threshold where misclassification produces visible reporting problems.
What GA4 Filters And What It Misses
GA4 has automatic bot filtering enabled by default in all properties. The filter uses Google's internal bot list, which is a non-public registry of known bots and crawlers. The list is reasonably comprehensive for major bots but has known gaps.
What GA4 filters reliably:
- Googlebot in all its variants (Mobile, Desktop, Smartphone, Images, Video, News, AdsBot, etc.)
- Bingbot
- Most commercial scraping tools that identify themselves honestly (Ahrefs, Semrush, Moz, etc.)
- GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot - all four are on Google's list as of 2026 and get filtered by default
What GA4 misses:
- Bots that spoof user agents to appear as real browsers (intentional evasion)
- New bots that have not yet been added to Google's internal list (lag effect)
- Bots that GA4 identifies but classifies inconsistently across reports (rare but possible)
- Server-side analytics that fire before the GA4 client script loads (these can produce phantom sessions)
For typical commercial sites, GA4's default bot filtering catches the OpenAI bot fleet without any custom configuration. The remaining bot traffic that bypasses the filter is small and usually represents scrapers using browser fingerprinting evasion techniques, which is a different problem from AI bot management.
The diagnostic to confirm GA4's filtering is working: compare GA4 session counts to your server log session counts for the same time period and segment. The GA4 count should be lower because of bot filtering. If the counts are comparable or GA4 is higher, something is configured wrong and the bot filter may not be applying correctly.
Custom Filters For Edge Cases
For brands that want belt-and-suspenders filtering, GA4 supports Custom Definition filters in Admin > Data Settings > Data Filters. You can define filters that exclude traffic based on user agent patterns, referrer domains, or IP ranges. The configuration is verbose but effective for edge cases.
The practical use case for custom filters is excluding internal traffic (employees, contractors, automated testing tools) and excluding any specific bot that you have identified is bypassing the default filter. Most brands do not need custom filters for OpenAI bots specifically because the default filtering covers them.
Server Log Classification: The Source Of Truth
Server logs see every request that reaches your origin. Unlike GA4 which only captures sessions where the client-side JavaScript executes, server logs capture every fetch including bot traffic that does not run JavaScript. The full bot picture lives in your server logs.
The classification workflow in server logs:
- Identify the user agent of each request.
- Cross-reference against published IP ranges for verification (covered in our CDN and WAF guide).
- Tag each request as bot or user based on the verified identification.
- Aggregate the tagged data into reports that show bot traffic and user traffic separately.
For Nginx, Apache, or any standard access-log format, the user agent appears in the final quoted field of each log line. A grep or awk pipeline extracts the field. For more sophisticated analysis, log analyzers like GoAccess, AWStats, or commercial platforms like Datadog provide pre-built bot classification.
The output of server-log classification is the highest-fidelity view of what is actually hitting your site. The numbers from this source can be reconciled against GA4 numbers to confirm filtering is working as expected. Differences between the two views are usually attributable to: GA4's client-side JavaScript not running on bot requests (which is the entire point of bot filtering), legitimate user traffic that did not execute the script (rare, often privacy-blocked browsers), or genuine misclassification on either side.
For brands with material AI traffic, server-log analysis is the right tool for the monthly bot share check. The data exists. The classification is reliable when paired with IP verification. The reports it produces are the definitive answer to "how much of our AI Search traffic is bots versus users."
The Reproducible Daily Sanity Check
A one-line command that produces a quick bot share view:
grep -E "GPTBot|OAI-SearchBot|ChatGPT-User|OAI-AdsBot" /var/log/nginx/access.log | wc -l
Compare to total request volume to compute the OpenAI bot share. Run daily as part of a server-log review routine. Anomalies (sudden spikes, unexpected user agents) surface immediately rather than weeks later.
Configuring CDN-Level Bot Classification
CDN-level bot management adds a layer between server logs and the GA4 client. Cloudflare, Fastly, AWS WAF, and Akamai all offer bot management features that classify traffic before it reaches your origin and tag the requests with classification data.
The advantages of CDN-level classification:
- Earlier in the request lifecycle than server-log analysis, which means decisions can be made before the request reaches origin.
- More sophisticated than user-agent matching alone because the CDNs use behavioral fingerprinting, IP reputation, and TLS fingerprinting.
- Configurable actions (challenge, block, allow, rate-limit) that go beyond just classification.
- Native integration with analytics dashboards that show bot share over time without manual log analysis.
The disadvantages:
- CDN dashboards can mis-classify OpenAI bots, especially if configured aggressively. The bots get blocked or challenged, which breaks your AI search visibility.
- The classification logic is opaque; you do not see exactly why the CDN classified a specific request the way it did.
- Costs add up at high traffic volumes.
For brands using Cloudflare, the AI Audit feature specifically tracks AI bot activity and produces dedicated reports. The feature is the right starting point for brands on Cloudflare who want CDN-level visibility into the bot share without custom configuration.
For brands using other CDNs, equivalent features exist but require more setup. AWS WAF supports custom rules that match on user agent and IP, with optional CloudWatch logging that produces analyzable data. Fastly's VCL supports the same pattern with more programmatic control. Akamai's Bot Manager is a paid product that handles AI bot classification natively but requires enterprise plan tier.
The Layered Strategy
The most robust setup uses three layers. GA4 handles client-side bot filtering for user-behavior reporting. CDN bot management handles edge-level classification for traffic-shape analysis. Server logs serve as the source of truth for reconciliation and audit. Each layer has its strengths and the three together produce a complete picture that no single layer covers alone.
The Monthly Audit That Keeps Numbers Honest
Bot classification drifts over time. New bots emerge, existing bots change behavior, CDN classifications get updated, GA4 internal lists evolve. The monthly audit catches the drift before it produces stale or misleading reports.
A monthly audit checklist:
- Compare GA4 session counts to server-log session counts for the AI Search channel specifically. Document the ratio and watch for changes over time.
- Review the top user agents hitting your site by volume in server logs. Confirm that no unfamiliar user agent has appeared in the top 20 without identification.
- Spot-check 5-10 requests claiming to be ChatGPT bots and verify they originated from OpenAI's published IP ranges. Cross-reference against openai.com/gptbot.json and the equivalent files for the other bots.
- Confirm GA4 bot filtering is enabled (Admin > Data Settings > Data Filters > Internal traffic > confirm enabled). The setting can revert in property migrations or admin changes.
- Review your CDN's bot management dashboard for any new rules that may have been added or modified since the last audit.
The audit takes 30-60 minutes monthly once the process is set up. The output is a short report stating the current bot share, any anomalies detected, and any classification changes worth communicating to the broader team.
What Anomalies To Watch For
The patterns that warrant deeper investigation:
- A sudden spike in OpenAI bot fetch volume not explained by your own publishing cadence.
- A drop in identifiable AI user traffic that does not correspond to a drop in AI bot crawl activity (suggests user-side issue rather than bot-side issue).
- A new user agent string appearing in your server logs that matches a known AI engine but is not yet in your classification lists.
- CDN dashboard alerts about bot traffic from OpenAI's IP ranges that suggest the CDN may be misclassifying legitimate AI activity.
Catching these patterns early saves the broader investigation that surfaces when downstream reports start producing unexpected numbers.
Reporting The Bot Share Without Burying The Signal
The right way to report bot share in monthly stakeholder views depends on the audience.
For technical and SEO audiences (in-house technical leads, agency teams), the bot share is part of the standard reporting. Include the bot-versus-user breakdown in the AI Search section of the monthly report. Note any classification changes in the past month. Show the trend over time to surface gradual drifts.
For marketing and executive audiences, the bot share is methodological rather than the primary signal. The primary metric is user traffic (the thing that produces revenue), not bot crawl activity. Include the bot share as a footnote or methodological caveat rather than a headline number. The framing: "Our AI Search user traffic was X sessions this month, up Y% month over month. Bot activity contributed an additional Z sessions, excluded from the user traffic counts."
For finance audiences, the bot share matters specifically for revenue attribution defense. Finance teams probe assumptions in reports they receive, and "we exclude bot traffic from our channel volume calculations" is the answer that demonstrates the underlying user volume is real. The bot share number itself does not need to be on the main report, but the methodology behind the exclusion should be available if asked.
The unifying principle is that bot share is methodological context, not a primary KPI. Reports that lead with bot share statistics confuse the audience about what matters. Reports that exclude bot traffic correctly from the primary KPIs and explain the methodology when relevant produce the right understanding.
Sample Footnote Language
"AI Search session counts in this report exclude identified bot traffic from GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot, and other AI engine crawlers. Bot traffic is filtered automatically by GA4 with verification against server logs and OpenAI's published IP ranges. Bot share in the same period was approximately 12% of all OpenAI-identifiable activity, separated from the user numbers shown."
The footnote is one sentence in the methodology section and addresses the question most likely to come up in a careful review of the report.
Frequently Asked Questions
Does GA4 actually filter ChatGPT-User as a bot?
GA4 filters ChatGPT-User as a bot by default, which is a defensible default for client-side analytics but may not be the framing every brand wants. ChatGPT-User is technically a bot in that it is a programmatic fetcher, but it is initiated by a real user. For most reporting purposes, treating ChatGPT-User as bot traffic (and counting only the actual user click that opens the resulting URL in a browser) is the cleaner approach. Brands that want to treat ChatGPT-User fetches as user activity can override the filter, but the override is unusual and requires care to avoid double-counting.
What if my CDN is blocking GPTBot and I want it to allow the bot?
Check your CDN's bot management settings. Cloudflare's super-bot-fight mode can be configured to block AI bots indiscriminately. AWS WAF's bot control has similar options. The fix is at the CDN dashboard rather than in robots.txt or in GA4. Once the CDN is configured to allow GPTBot (and the other OpenAI bots), the bot traffic resumes and your AI search visibility recovers within OpenAI's normal propagation window. Cloudflare's AI Audit documentation covers the configuration for that platform; the same principle applies to other CDNs.
How do I tell which specific OpenAI bot is responsible for a given log entry?
The user agent string identifies the bot. GPTBot identifies as "GPTBot" plus version information. OAI-SearchBot identifies as "OAI-SearchBot" plus version. ChatGPT-User identifies as "ChatGPT-User" plus version. OAI-AdsBot identifies as "OAI-AdsBot" plus version. The companion piece on reading OAI-SearchBot crawl logs walks the per-bot identification patterns in detail.
Will GA4's bot filtering get better at distinguishing user-proxy bots from scheduled crawlers over time?
Probably yes. Google has been improving GA4's bot classification continuously since the product launched, and AI bot detection has been one of the focus areas in recent updates. The trajectory is toward more nuanced classification that distinguishes scheduled crawlers from user-proxy bots, which would let GA4 count ChatGPT-User as user traffic by default if desired. For now, brands wanting that classification have to override the default; the workaround is becoming less necessary over time but is still required as of 2026.
How does this interact with our broader analytics privacy posture?
The bot filtering work is orthogonal to privacy considerations. Bot filtering removes non-human traffic from your reports, which is unrelated to whether you are honoring user consent, GDPR/CCPA, or other privacy requirements for the human traffic that remains. The two streams of work can be tackled independently. Privacy-friendly analytics tools (Plausible, Fathom, Simple Analytics) typically include similar bot filtering as their commercial competitors, so the choice of analytics platform does not depend on the bot management strategy.
Bot classification is one of those operational details that does not feel strategic until misclassification produces an executive report nobody trusts. The work to get it right is modest: GA4's default filtering plus monthly server-log reconciliation plus CDN-level visibility for brands at meaningful scale. The payoff is reports that withstand scrutiny and metrics that drive correct decisions.
If your team wants the full classification audit (with the GA4 review, the server-log reconciliation, the CDN configuration check, and the monthly reporting template), that work sits inside our generative engine optimization program. The bots are real, the user traffic is real, and the cleanup is what separates the two so the team can act on each appropriately.
Ready to optimize for the AI era?
Get a free AEO audit and discover how your brand shows up in AI-powered search.
Get Your Free Audit