GEOAug 7, 2025·12 min read

HTTP Headers That Influence AI Crawl: X-Robots-Tag, Cache-Control, And Vary

Capconvert Team

GEO Strategy

TL;DR

HTTP response headers shape how AI crawlers (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, GoogleOther, Bytespider, CCBot) fetch, cache, and index pages, and misconfigurations silently throttle months of content investment before any of it reaches the retrieval index. X-Robots-Tag controls per-page indexing at the HTTP layer (noindex, nofollow, noarchive, nosnippet) and works on non-HTML resources like PDFs, images, and JSON files where the robots meta tag cannot reach; bot-specific syntax like 'X-Robots-Tag: googlebot: noindex' or per-AI-crawler exclusion enables granular control. Cache-Control governs how often crawlers re-fetch: max-age values of 3600 to 86400 seconds (1 hour to 1 day) suit blog content with occasional updates; aggressive long-cache values measured in months prevent timely indexing of updates, while CDN defaults of 'private' or 'no-store' actively prevent AI bot indexing. Vary: User-Agent is the underused header that enables differential serving (Markdown to AI bots, HTML to humans) without cache poisoning; the technique is foundational to llms.txt and llms-full.txt workflows and is implementable through Cloudflare Workers, Vercel Edge Functions, or AWS Lambda@Edge. Content-Type with explicit UTF-8 charset, accurate Content-Encoding (gzip or Brotli), and consistent Last-Modified/ETag headers for conditional revalidation round out the technical foundation. Six recurring mistakes consistently throttle AI visibility: default CDN no-store or private Cache-Control, missing Vary: User-Agent on differentially-served pages, accidental noindex headers leaking from staging or CMS plugins, inconsistent Content-Type across CDN nodes, missing charset declarations on international pages, and aggressive long-cache values preventing content updates. User-Agent detection should combine string match plus published IP range verification (OpenAI, Anthropic, Google, and Perplexity each publish their crawler IPs) rather than trusting the User-Agent header alone. Audit production headers with curl -I before scaling content investment.

A SaaS company has been investing heavily in content marketing. The articles are well-written, technically optimized, and properly structured. Six months in, their AI citation rates have barely moved. An audit reveals the cause: their CDN's default cache configuration sends headers that mark pages as private and not cacheable, telling AI crawlers (which respect these signals) that the content should not be indexed for downstream use. The crawlers fetched, then dropped. The brand had been invisible despite the content investment because the HTTP headers had been silently blocking the channel.

HTTP headers are the underweighted technical layer of GEO. Most discussions of AI visibility focus on content, schema, and link structure. The headers your server sends shape what happens before any of that content gets read. Misconfigured headers can silently sabotage months of content work. Correct headers serve as the technical foundation that lets everything else compound.

This guide unpacks the specific headers that matter for AI crawl behavior, what each does, and how to configure them for visibility without exposing content you intend to gate.

Why HTTP Headers Matter For AI Crawl Behavior

HTTP headers travel with every response from your server. They communicate metadata to the client (the browser, the search engine bot, the AI crawler) about the response itself: how to cache it, how to index it, what content type it is, when it was last modified, what other versions exist.

AI crawlers respect headers similarly to how search engine bots do. The header conventions developed for SEO purposes carry over to AI crawling because the crawlers share architecture with search engine crawlers. The robots meta tags and X-Robots-Tag header tell crawlers whether to index a page. The Cache-Control header tells crawlers how long their cached version is valid. The Vary header tells them which request attributes affect the response.

The implications for AI visibility are substantial. A header that says "do not index this page" prevents the page from entering the AI engine's retrieval index regardless of how good the content is. A header that says "this response varies by Accept header" enables content negotiation that can serve AI bots Markdown while serving humans HTML. A header that sets aggressive cache-control values reduces the frequency of crawler re-fetches, which affects how quickly content updates propagate.

Most brands ship default headers from their hosting platform or CDN without auditing them. The defaults are often suboptimal for AI visibility. Auditing and correcting the headers is one-time work with long-tail benefits.

X-Robots-Tag: Page-Level Indexing Control

X-Robots-Tag is the HTTP header equivalent of the robots meta tag. It controls whether and how crawlers index pages. Unlike the meta tag, the X-Robots-Tag header works on non-HTML resources (PDFs, images, JSON files, plain text files) too.

The most relevant values are noindex (do not include in search index), nofollow (do not follow links on this page), noarchive (do not cache or display cached version), and nosnippet (do not display a snippet in search results). Combinations like noindex,nofollow are common for pages the brand wants to exclude entirely.

For AI engines, X-Robots-Tag operates similarly to how it does for search engines. A page marked noindex by the header is excluded from the engine's retrieval index. The exclusion is the same as the robots meta tag noindex but works at the HTTP layer.

The practical use cases include: excluding draft or preview pages from indexing, excluding internal-facing pages that happen to be publicly accessible, excluding paginated archive pages that duplicate content, and excluding utility pages (sitemaps in HTML form, search result pages, filtered listing pages).

Bot-specific X-Robots-Tag is supported by Google and some other engines. A header like X-Robots-Tag: googlebot: noindex excludes the page from Google specifically while allowing other crawlers to index it. The same pattern can be used with GPTBot, ClaudeBot, or other named AI crawlers.

Misconfigured X-Robots-Tag is the most common header mistake we see in audits. Pages intended to be indexed are accidentally marked noindex. Catch this through the audit techniques in our piece on robots.txt for AI crawlers and through tools like Google Search Console's URL Inspection.

Cache-Control And The AI Crawl Refresh Interval

Cache-Control is the header that tells clients (including crawlers) how to cache the response. The header has many possible directives; the ones most relevant to AI crawl behavior are max-age, s-maxage, must-revalidate, private, public, and no-store.

For AI engine retrieval, the practical effects are:

max-age tells the crawler how long its cached version is valid. A page with max-age=86400 (24 hours) tells the crawler not to re-fetch for at least a day. A page with max-age=2592000 (30 days) tells the crawler not to re-fetch for a month.

s-maxage applies to shared caches (CDNs, proxies). It is often more relevant for AI crawlers that may be routed through shared infrastructure.

must-revalidate tells the crawler that it must verify with the server before using its cached version after expiration. This produces conditional requests that confirm the page has not changed.

private tells the cache that the response should not be stored by shared caches. Setting private can prevent some AI bots from indexing the page. Use only when the page genuinely should not be cached by shared infrastructure.

public is the inverse of private. It explicitly allows shared caching.

no-store tells the cache not to store the response at all. AI bots that respect no-store will not cache the content for downstream indexing.

The strategic implication is that Cache-Control affects how quickly content updates propagate to AI engines. A site with aggressive long-cache values has slower update propagation. A site with short or no-cache values has faster propagation but more crawler load.

The typical optimization is moderate cache values (1 to 24 hours for content pages, longer for static assets) combined with conditional revalidation. The pattern balances update speed with crawler load.

For brands shipping frequent updates (news sites, ecommerce with frequent inventory changes), shorter caches and proper conditional revalidation are essential. For brands shipping infrequent updates (corporate sites with static content), longer caches reduce crawler overhead without hurting visibility.

Vary: Content Negotiation And AI Bot Differential Serving

Vary is one of the most underused headers for AI optimization. The header tells caches and clients which request headers affect the response. Vary values commonly include Accept, Accept-Encoding, Accept-Language, and User-Agent.

The relevance to AI bots is that Vary: User-Agent enables differential serving based on the user agent. A server can return different content to ChatGPT-User, ClaudeBot, GoogleBot, and human browsers. The technique allows specific optimizations for AI bots without changing the human user experience.

The common AI optimization patterns this enables include: serving Markdown to AI bots while serving HTML to humans (Accept-based or User-Agent-based negotiation), serving llms.txt-friendly responses to specific user agents, serving simplified pages without modal interference to AI bots, and serving structured data more prominently to crawlers.

Implementing Vary correctly is important. A response that varies by User-Agent without setting Vary: User-Agent can produce cache poisoning where different bots see the wrong cached version. The header must accurately describe what the response varies by.

The practical implementation is at the CDN or origin server level. Cloudflare Workers, Vercel Edge Functions, and AWS Lambda@Edge all support response differentiation based on user agent with proper Vary header setting.

We have written about serving Markdown to AI bots in more depth elsewhere. The Vary mechanism is the technical foundation of that workflow.

Why Vary Matters For CDN-Cached Sites

Sites served through CDNs without proper Vary headers may serve cached human responses to AI bots or vice versa. The mis-serving can produce confusing data: an AI bot that received the human-targeted modal-heavy page concludes the site is unreadable, even though a properly differentiated response would have served clean content. The cost of misconfigured Vary is often invisible until an audit surfaces it.

User-Agent Headers And Bot Detection

User-Agent is a request header (sent by the client) not a response header (sent by the server), but it is so consequential for AI bot interaction that it warrants discussion.

The major AI bots identify themselves with specific User-Agent strings. The principal ones in mid-2026 include:

  • GPTBot (OpenAI's training crawler)
  • OAI-SearchBot (OpenAI's retrieval crawler for ChatGPT search)
  • ChatGPT-User (OpenAI's user-driven retrieval, including Atlas)
  • ClaudeBot (Anthropic's training crawler)
  • ClaudeBot-User or similar (Anthropic's retrieval crawler when distinct)
  • PerplexityBot (Perplexity's indexing crawler)
  • Perplexity-User (Perplexity's in-session retrieval)
  • Google-Extended (Google's AI training opt-out token)
  • GoogleOther (Google's miscellaneous crawler including AI features)
  • Bytespider (ByteDance's crawler)
  • Diffbot (third-party crawler used by many engines)
  • CCBot (Common Crawl's crawler, feeds many training corpora)

Server-side User-Agent detection allows the brand to apply specific behavior per crawler. The most common use cases are differential content (Markdown for one, HTML for another), differential headers (longer cache for some, shorter for others), and access control (blocking some, allowing others).

The detection should rely on multiple signals beyond User-Agent alone. User-Agent strings can be spoofed; checking against published IP ranges (each major engine publishes their crawler IP ranges) confirms legitimate bot traffic. OpenAI, Anthropic, Google, and Perplexity all publish their crawler IP ranges in their respective developer documentation.

For brands implementing differential serving at scale, treating User-Agent detection as a multi-input classifier (User-Agent string plus IP range plus behavior pattern) produces more reliable bot identification than User-Agent alone.

Compression And Content-Type Headers For AI Bots

Content-Type and Content-Encoding headers affect how AI bots parse responses. The correct settings ensure bots interpret the content as intended.

Content-Type should accurately reflect the response format. text/html for HTML pages, text/markdown for Markdown files (used for llms.txt and llms-full.txt), application/json for JSON-LD when served as standalone resources, application/xml for XML sitemaps. Mismatched Content-Type headers can cause bots to misinterpret or skip responses.

Content-Encoding tells the bot how the response is compressed. gzip and br (Brotli) are the most common. AI bots typically handle both correctly, but the Accept-Encoding header in the request determines which compression is used. Mis-handling can produce parsing failures if the bot does not support the chosen compression.

Charset specification in Content-Type matters for international sites. Content-Type: text/html; charset=utf-8 explicitly declares UTF-8 encoding. Missing charset declarations can produce misencoded text that AI bots misread, particularly for non-Latin scripts.

The headers should be consistent across the site. Production environments should not return different Content-Type for the same resource based on which CDN node served it. Audit headers with curl or browser developer tools to verify consistency.

Six Header Mistakes That Hurt AI Crawl Quality

Six recurring header mistakes consistently affect AI crawl quality.

  1. Default CDN no-store or private Cache-Control. Many CDN defaults set Cache-Control to private or no-store. The settings prevent caching and reduce AI bot indexing. Set explicit public, max-age values on content pages.
  2. Missing Vary: User-Agent on differentially-served pages. Sites that serve different content based on User-Agent without setting Vary: User-Agent risk cache poisoning. Set the header explicitly.
  3. Accidental X-Robots-Tag: noindex on production pages. CMS plugins or staging configurations sometimes leak noindex headers into production. Audit headers with curl on production URLs to confirm intended values.
  4. Inconsistent Content-Type for the same URL. Headers should be deterministic per URL. Inconsistencies across CDN nodes or origin servers confuse bots.
  5. Missing charset in Content-Type for international pages. UTF-8 should be explicit in the Content-Type header for any page with non-Latin content.
  6. Aggressive long-cache headers preventing content updates. max-age values measured in months prevent timely indexing of content updates. Use moderate values with conditional revalidation.

Frequently Asked Questions

How do I audit the HTTP headers on my production site?

curl is the simplest tool. curl -I https://yourdomain.com/some-page returns the headers for that URL. The -I flag requests a HEAD response. Compare the headers to what you intend. Browser developer tools (Chrome DevTools Network tab) show headers for any loaded resource. For systematic auditing across hundreds of pages, scripted curl runs or specialized tools like Sitebulb produce the inventory.

Does Cloudflare or Vercel automatically configure AI-friendly headers?

Both have moved toward more AI-friendly defaults through 2025 and 2026. Cloudflare's AI bot management features include header configuration options. Vercel's Edge Network handles common cases well. That said, defaults are not always optimal for every brand; auditing and customizing is still worthwhile.

Should I use X-Robots-Tag instead of robots meta tag?

For HTML pages, either works and they are equivalent. The advantage of X-Robots-Tag is that it works on non-HTML resources (PDFs, images, plain text files) where meta tags are not available. Many brands use the meta tag for HTML and X-Robots-Tag for everything else, which is reasonable.

What is the right max-age value for a typical blog post?

Depends on update frequency. For posts that may receive substantive updates (clarifications, fact corrections, formatting improvements), max-age of 3600 to 86400 (1 hour to 1 day) keeps the content fresh in AI engine caches without excessive crawler load. For posts that rarely change, longer max-age (1 week to 1 month) reduces crawler load.

Does my Last-Modified or ETag header affect AI bot behavior?

Yes. AI bots use these headers for conditional revalidation. When a bot has a cached version and re-fetches, the server can respond with 304 Not Modified if the Last-Modified or ETag matches, avoiding the full response transfer. Correct implementation of these headers reduces bandwidth without reducing crawl coverage.

Should I block AI bots from accessing my staging or development sites?

Yes. Staging and development sites should be either authenticated (behind a login) or sending X-Robots-Tag: noindex,nofollow plus robots.txt disallow. AI bots indexing staging content can leak unreleased information into training corpora.

HTTP headers are the technical layer that controls how AI bots interact with your site before they ever read the content. Misconfigurations are invisible until they are not, at which point months of content investment may have been silently throttled.

The audit and correction work is one-time effort with long-tail benefits. Verify X-Robots-Tag settings on production pages. Tune Cache-Control for update frequency. Set Vary headers properly when serving differential content. Identify bots with multi-input classifiers, not User-Agent alone. The work compounds because every content publication after the correction inherits the corrected header configuration.

If your team wants help auditing your HTTP header configuration for AI crawl optimization, including the differential serving setup if relevant, that work sits inside our generative engine optimization program. The brands cited well by AI engines are the brands whose technical foundation lets the content reach the engines reliably.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit
Free Audit