1,000 ChatGPT Citations: Hidden Ranking Signals

Most published commentary on ChatGPT ranking factors comes from vendor statements, speculation, or small-scale anecdotes. Few publishers run the kind of systematic citation analysis that could distinguish actual patterns from noise. Over the past several months, we logged and analyzed 1,000 ChatGPT search citations across 200 buyer-research queries spanning ecommerce, B2B SaaS, professional services, and consumer research categories. The findings are not a definitive list of ranking factors, because ChatGPT's retrieval system is opaque from the outside. They are empirical patterns that replicated across multiple test runs and that produced statistically meaningful effect sizes large enough to inform operational decisions.

Seven findings stood out. Some confirmed conventional wisdom (Bing ranking matters a lot). Some refined it (the 87% Bing-overlap figure widely cited in industry commentary turns out to be closer to 80% across our broader query set). Some contradicted popular claims (FAQ schema does not appear to move citations independently of content quality). This piece walks the methodology, the findings, and the operational implications, with the honest caveats about what 1,000 citations can and cannot establish.

The companion piece on the OpenAI ranking black box walks the methodology side of this work in more depth, and the search_context_size technical guide covers one of the underlying mechanisms that produces several of the findings below.

The Research Methodology

The data collection ran from late 2025 through early 2026 with the following structure.

Query set - 200 queries spanning four category buckets (50 each): ecommerce comparison queries ("best electric toothbrush for sensitive teeth"), B2B SaaS evaluation queries ("Slack vs Microsoft Teams for 50-person companies"), professional services research queries ("how to choose a financial advisor"), and consumer research queries ("are smart watches worth it in 2026"). Queries were selected to span commercial intent and topical diversity rather than to bias toward any particular outcome.
Citation logging - Each query was run in ChatGPT through the standard consumer interface and through the OpenAI API with web search enabled, with the source citation list logged per response. The dual collection helps separate session-specific variance (which is real and observable) from underlying ranking patterns.
Source profiling - For each cited URL, we logged the source domain, its Bing ranking for the originating query, its content length, presence and quality of structured data, author attribution, publication date, and several other features. The features were chosen to align with hypotheses about what might predict citation outcomes.
Total records - The 1,000 citation figure represents approximately 5 citations per query averaged across 200 queries, with variance from queries that produced fewer citations (some short answers cited only 1-2 sources) to queries that produced many (some longer answers cited 10+). The data was cleaned to remove duplicates and obvious errors before analysis.
Statistical approach - For each hypothesized factor, we computed the correlation between the feature and citation outcomes (cited or not, citation density when cited). Where possible, we ran A/B comparisons (matched pages with and without a specific feature) to isolate causal effects from correlations. The findings below distinguish between correlational evidence and the smaller number of cases where controlled testing produced cleaner causal inference.

What This Data Cannot Establish

A 1,000-citation dataset is large enough for many findings but not for all. Effects smaller than 10-15% are hard to detect with statistical confidence at this sample size. Effects that interact (where two factors together produce a result that neither produces alone) are difficult to disentangle without more sophisticated modeling. Patterns specific to non-English queries, regional content, or highly specialized verticals may not generalize from our sample. Reading the findings as informative-but-not-definitive is the right framing.

Finding 1: The Bing Ranking Correlation

Across the 1,000 citations, 80% of cited URLs ranked in the top 10 Bing results for the corresponding query at the time of the test. The widely-circulated 87% figure from earlier published analyses turns out to be on the high end. Our broader query set produced a slightly lower correlation, but the direction of the finding is unambiguous: Bing top-10 inclusion is strongly correlated with ChatGPT citation, and Bing top-3 inclusion correlates even more strongly (roughly 60% of cited URLs were Bing top-3 for the query).

The 20% of citations that came from outside Bing's top 10 fell into two patterns. Wikipedia citations were heavily represented (about 8% of total citations came from Wikipedia, often ranked outside Bing's top 10 for the same query). High-authority aggregator and reference sites (G2, Capterra, TrustRadius, large encyclopedia-style sites) accounted for another 6%. The remaining 6% was scattered across specialty publishers that had unusual authority for specific topics.

The operational implication is clear: Bing indexation and ranking matter enormously for ChatGPT visibility. Brands that are not in Bing's top 10 for their target queries face an uphill battle for ChatGPT citations even when their content quality is high. The companion piece on why Bing indexation matters again covers the strategic implications of this dependency.

What "Ranking In Bing" Actually Requires

The Bing dependency is operational, not just analytical. Brands that have neglected Bing for the past decade often discover that they rank poorly in Bing relative to Google for the same queries. Bing's algorithm weights some signals differently (more weight on exact-match keywords and backlink quantity, less weight on backlink quality nuances and topical authority depth than Google's), which means content optimized purely for Google can underperform in Bing. Recalibrating for both engines is one of the foundational investments for ChatGPT visibility.

Finding 2: Content Length And Multiple Citations Per Source

URLs that received multiple citations within the same ChatGPT response were dramatically longer on average than URLs that received single citations. The median page length for single-citation URLs was around 1,200 words. The median for URLs cited 2 or more times in the same response was around 3,400 words. The pattern was even more pronounced for URLs cited 3+ times, where median length exceeded 4,500 words.

The mechanism is consistent with what we know about search_context_size: the per-source extraction budget allows the model to read multiple sections of a long page and pull citations from each. Short pages provide fewer extractable sections, so they tend to produce a single citation when cited at all. Long pages with multiple substantive sections produce multiple citation hooks per source.

The operational implication is that content depth on high-priority topics produces compounding citation outcomes. A single 4,000-word pillar piece is not just longer than four 1,000-word pieces; it can earn 2-3x more citations per ChatGPT response while also winning regular search engagement that the shorter pieces would not capture. The depth investment compounds across the citation pipeline.

The pattern also explains why the same brands often appear multiple times in the source list of a single Deep Research report. Long-form substantive content on a topic gives the synthesis system multiple distinct citation hooks, and a brand with deep coverage on a topic ends up over-represented in the eventual source list.

The Caveat About Padding

Length without depth does not produce the effect. URLs that were long but generic (rambling content with little specific information) did not earn multiple citations any more often than short pages on the same topic. The depth is what matters; length is a byproduct. Brands trying to game the pattern by adding filler content to existing pages will not see the citation lift because the underlying citation hooks are still missing.

Finding 3: Named-Author Bylines And Citation Rate

Pages with named-author bylines and verifiable author profiles (LinkedIn, public conference appearances, prior publications) earned citations at approximately 1.4x the rate of comparable pages with anonymous or generic ("Editorial Team") attribution. The effect was visible across all four category buckets in our test set, with the strongest manifestation in professional services and B2B SaaS queries.

The mechanism appears to be the retrieval system's trust scoring. Author authority is one of the signals the system uses to weight source credibility, and verifiable author profiles produce stronger trust signals than anonymous attribution. The effect compounds across multiple articles by the same author, because the author's verified profile strengthens with each cited article.

The operational implication is that named-author programs produce citation lift that justifies the editorial investment. The work to identify named authors (typically 1-3 expert voices per content category), set up author pages on the site, link to their external credentials, and consistently attribute their articles takes weeks but produces ongoing citation benefits across the entire content footprint they create.

The largest effect we observed within the named-author bucket was for authors with strong, topical-specific external authority. An author who has published technical research, spoken at industry conferences, and is recognized in their specific field by other authorities produces stronger trust signals than an author who is a generalist writer for hire. The signal is not just "named author" but "credibly authoritative author."

Anonymous Attribution Costs

A subset of our test pages used author attribution that was technically present but generic ("By the [Brand] Team" or "Editorial Staff"). These performed only marginally better than fully-anonymous pages and well below pages with individual named authors. The signal is binary in a useful sense: either you have a specific identifiable expert behind the byline or you do not, and the citation algorithm treats vague attribution similar to no attribution.

Finding 4: Schema Markup And Modest But Reliable Lift

Pages with valid Schema.org JSON-LD markup for Article, Author, and Organization types earned citations at approximately 1.1-1.2x the rate of pages with no structured data. The effect is smaller than the named-author effect and smaller than the content depth effect, but it replicates reliably across test runs and across category buckets.

The marginal lift is not from any specific schema type doing magic. It comes from the cumulative trust and classification signal that proper schema provides. The retrieval system uses schema to identify content types, authors, publication dates, and topic classifications. Pages with complete and valid schema produce stronger signals than pages with absent or invalid schema, and stronger signals correlate with higher citation rates.

The specific Schema.org types that mattered most:

Article (or BlogPosting, NewsArticle, etc.). Defines the content as editorial content, identifies headline, author, publication date.
Author (typically Person, with linked external profiles). Defines the author identity and credentials.
Organization. Defines the publishing entity, often with logo, sameAs links to social profiles, and other identity signals.
BreadcrumbList. Provides hierarchical context that helps classify the page within the site structure.

Several schema types we tested showed no measurable effect on citation rates. FAQPage schema did not lift citations when the content was not actually FAQ-formatted. Product schema was helpful for ecommerce-specific pages but not broadly. HowTo schema produced no measurable effect across our test set, which contradicts some of the published optimization advice that emphasizes it heavily.

The Pragmatic Reading

The schema effect is small enough that it should not be the primary focus of optimization work. The 1.1-1.2x lift is real and worth capturing, but the same effort allocated to content depth or named-author programs produces 2-3x larger effects. The right operational approach is to ensure schema is correctly implemented as part of the technical foundation but not to expect dramatic citation movements from schema-specific work alone.

Finding 5: The Wikipedia And Aggregator Anomaly

Wikipedia and major aggregator sites (G2, Capterra, TrustRadius, Crunchbase, SimilarWeb) earn citations at rates well above what their Bing rankings would predict. Wikipedia in particular shows up in approximately 8% of all citations across our query set, often when the Wikipedia page does not rank in Bing's top 10 for the originating query.

The mechanism appears to be a structural trust override. The retrieval system treats certain known-authoritative sources as preferred even when their ranking signal is weaker than competitors. Wikipedia is the clearest case (citations frequently appear for queries where Wikipedia is not even in Bing's top 20), but the same pattern is visible at smaller scale for the major aggregators.

The operational implication for commercial publishers is mixed. The override creates a structural disadvantage for non-Wikipedia, non-aggregator sites trying to compete on broadly-researched queries. Wikipedia entries on the same topic will often beat your content for citation share even when your content is more current and more substantive. The competitive response is not to fight Wikipedia directly but to compete on the queries and topics where Wikipedia is weaker (specialized buyer-research queries, recent commercial information, vendor-specific evaluations).

For brands that can earn Wikipedia article presence through legitimate notability (companies, organizations, products meeting Wikipedia's notability bar), the citation lift can be substantial. Building Wikipedia presence is a slow process bound by Wikipedia's editorial standards, but for eligible brands it is one of the highest-leverage AI visibility investments available.

The Aggregator Implication For B2B

For B2B SaaS specifically, aggregator presence on G2, Capterra, and TrustRadius produces dual benefits. The aggregators themselves get cited by ChatGPT for B2B comparison queries, and pages with strong aggregator profiles (high review counts, recent positive reviews, complete vendor data) build authority signals that influence ChatGPT's selection. Brands without aggregator presence miss both pathways.

Finding 6: News Publisher Over-Indexing And Licensing

Licensed news publishers (Conde Nast properties, Financial Times, AP, Reuters, several others) earn citations at approximately 1.5-2.5x their fair-share rate based on Bing ranking. The over-indexing is most pronounced on breaking news queries and current events, where licensed partners dominate. The over-indexing decreases on evergreen content and commercial buyer-research queries, where specialist publishers compete more effectively.

The operational implication for non-partner publishers is that competing for citation share on breaking news queries is structurally difficult. The licensing arrangements between OpenAI and major news publishers produce a citation advantage that organic optimization cannot fully overcome. The strategic response is to compete in categories where licensing partnerships matter less (specialty topics, niche expertise, vendor-specific evaluations) rather than to fight the partner advantage on its strongest ground.

The companion piece on earning ChatGPT citations without a licensing deal covers the independent-path strategy in depth. For brands considering whether to pursue an OpenAI partnership, the over-indexing data is one input to the cost-benefit math, alongside the multi-year licensing fees and editorial constraints partnerships impose.

What Over-Indexing Does Not Eliminate

Over-indexed news publishers still need to produce relevant content for the specific queries they want to win. A licensed partner with weak content quality, slow update cadences, or poor technical accessibility still loses citations to independent publishers with strong fundamentals. The advantage is a thumb on the scale rather than a binary lock-in, and the same investments in content quality, author authority, and technical foundation matter regardless of partnership status.

Finding 7: What Did Not Correlate With Citations

Several factors that appear frequently in published optimization advice produced no measurable citation effect in our analysis, which is worth flagging because it can reallocate attention away from low-leverage interventions.

Specific FAQ schema implementation did not produce citation lift when the page content was not actually FAQ-formatted. FAQ schema can help when the underlying content matches the format (legitimate Q&A pages with clearly delineated questions and answers), but adding FAQ schema to non-FAQ content does not move the needle. The widespread advice to add FAQ schema indiscriminately is not supported by our data.

Keyword density and exact-match keyword placement showed only weak correlations with citation outcomes. The retrieval system clearly handles semantic relevance rather than keyword matching, and pages that win citations do so through topical relevance rather than through keyword optimization patterns from the early SEO era.

Specific dateModified updates without actual content changes showed no measurable effect on citation rates. The retrieval system appears to detect when a date update reflects substantive content change versus when it is purely cosmetic, and the cosmetic updates do not produce citation lift. The practice of updating dateModified on stale content to "freshen" it is not supported by our data.

Internal linking density (number of internal links from other pages on the site to the target page) showed weak correlations with citation outcomes, smaller than the named-author or content depth effects. Internal linking matters for traditional SEO but appears to be a secondary signal for AI citation specifically.

Page speed and Core Web Vitals scores showed surprisingly weak correlation with citation rates within reasonable performance ranges. Pages with excellent vital scores did not measurably out-cite pages with merely adequate scores. The retrieval pipeline appears to penalize pages that completely fail to render but does not differentiate finely above the basic accessibility threshold.

The Implication For Optimization Priorities

These null findings are useful for budget allocation. Time spent adding FAQ schema to non-FAQ content, optimizing keyword density, or chasing minor Core Web Vitals improvements is producing little citation lift. The same time invested in content depth, named-author programs, or Bing visibility produces much larger returns. The right operational response is to redirect optimization energy toward the high-impact areas the data identifies and away from the low-impact areas it does not.

Frequently Asked Questions

How does this analysis compare to publicly-released data from other sources?

The Seer Interactive analysis of 500 citations (published in 2025) reached similar directional conclusions about the Bing dependency, with their headline figure of 87% overlap matching our directionally but coming in slightly higher than the 80% we measured on a broader query set. The differences are likely due to query mix and timing rather than methodological disagreement. Other published analyses from smaller sample sizes have produced more variable results, which is expected given the noise at small N. The right interpretation is that the Bing dependency is real and substantial, with the exact figure varying based on query composition.

Will these findings hold over time as OpenAI updates its retrieval system?

Probably partially. The structural findings (Bing dependency, content depth effect, named-author lift) are unlikely to reverse because they reflect underlying mechanisms (the role of upstream indexes, the per-source extraction budget, the trust scoring) that are integral to how retrieval-augmented systems work. The specific quantitative figures will likely shift as the system evolves. Brands should re-validate findings every 6-12 months rather than assuming any specific number remains current indefinitely.

Can I run this analysis on my own data?

Yes, with modest investment. The methodology requires citation logging, source profiling, and statistical analysis. Manual logging of 200 queries takes 4-6 hours and produces a usable sample. Programmatic logging through the OpenAI API takes 2-3 hours of engineering setup plus minimal compute cost. The analysis can be done in any spreadsheet or analytics tool. The companion piece on running empirical citation tests walks the methodology.

Why does the 80% Bing-overlap figure differ from the 87% widely cited?

Sample composition and timing. Earlier published analyses used smaller query sets concentrated in news-and-current-events categories where the Bing overlap is particularly high. Our broader query set spanning four diverse category buckets produced a slightly lower aggregate overlap, with category-level variation: news-and-current-events queries showed roughly 85% overlap, commercial buyer-research queries showed roughly 75%, and specialty topical queries showed roughly 70%. Both figures are correct for their respective scopes; the higher figure is correct for news contexts and the lower figure is more representative of broader commercial visibility.

Does this analysis apply to Anthropic's Claude, Perplexity, or Google Gemini?

Partially. The general patterns (importance of upstream index inclusion, content depth effect, authority signals) likely generalize across AI engines, but the specific quantitative findings are ChatGPT-specific. Other engines have different upstream indexes (Perplexity uses a hybrid stack, Claude uses Anthropic's own crawling, Gemini uses Google Search directly), so the equivalent Finding 1 would have different numbers and possibly different dominant upstream sources. Brands optimizing across engines should run per-engine analyses to characterize each system separately.

The 1,000-citation dataset is one of the larger empirical analyses of ChatGPT ranking patterns publicly available as of mid-2026. The findings are not definitive, but they are directionally robust and informed by enough data to support operational decisions. Brands willing to invest in their own citation testing can extend the analysis to their specific categories and produce findings tailored to their competitive landscape. The work is rigorous, the methodology is replicable, and the strategic value is substantial in a market where most competitors are still operating on untested guidance.

If your team wants the full empirical research engagement (with the test design, the data collection, the analysis, and the strategic recommendations specific to your category), that work sits inside our generative engine optimization program. The patterns are real. The methodology is well-defined. The competitive advantage from understanding the actual ranking dynamics rather than the speculated ones compounds over the multi-year horizon AI visibility is going to keep mattering.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit