The OpenAI Ranking Black Box: Empirical Tests

Most of the published guidance on ranking in ChatGPT search comes from two sources. Vendor statements, which are corporate-careful and necessarily generic. And vendor-adjacent commentary, which extrapolates from limited public information and is often wrong in specific ways that matter for publisher strategy. The combined result is a discourse where confidence outruns evidence. Practitioners cite "best practices" that nobody has actually tested. Conference talks repeat hypotheses as facts. The brands that take these untested claims seriously misallocate effort and waste time on optimizations that do not produce measurable citation lift.

The alternative is empirical testing. ChatGPT's retrieval system is a black box from the outside, but it produces observable outputs (which URLs get cited) for observable inputs (which queries you run). With enough test queries, controlled variation, and patient log-keeping, you can extract reasonably confident statements about what moves citations and what does not. The work is not glamorous and the experimental design takes care, but the resulting insight is grounded in your own data rather than someone else's untested theory.

This piece is the methodology. The test design, the statistical considerations, the patterns that have already emerged from large-scale testing across the agency engagements where we have run this systematically, and the operational discipline that turns occasional experiments into sustained ranking-factors research.

Why Empirical Testing Is The Only Path

The structural reason ChatGPT's ranking is a black box is straightforward. OpenAI has commercial incentives to keep the retrieval system opaque. Publishing the ranking factors would create gameability, would invite the kind of optimization arms race that has plagued Google search for two decades, and would limit OpenAI's ability to adjust the system over time without producing user-visible disruptions. The opacity is intentional and likely permanent.

The implication for publishers is that anything anyone says about "what ChatGPT rewards" or "what the model prioritizes" is at best informed speculation. The OpenAI bot documentation describes the bot fleet and the technical fetch behavior. It does not describe how the model decides which sources to cite. The gap between "we know how the bots fetch" and "we know how the model ranks" is huge, and the gap is precisely where the operational decisions about content strategy have to be made.

Empirical testing fills the gap. You cannot observe the ranking function directly, but you can observe the rankings it produces. Across enough queries, with enough sources, the patterns become visible. The methodology is statistical rather than analytical: you cannot derive the ranking algorithm, but you can characterize its behavior to a useful approximation.

The patterns that emerge from empirical testing have already produced operational insight that vendor-statement guidance does not. The role of Bing's index in citation selection (covered in our Bing pipeline piece), the per-source extraction budget mechanics (covered in our search_context_size piece), and the conversion-rate lift over non-branded organic (covered in our referrals versus organic piece) all came from systematic citation testing rather than from OpenAI's own documentation.

What Empirical Testing Cannot Do

Empirical testing cannot tell you the underlying mechanism of the ranking algorithm. You can know that certain features predict citation outcomes without knowing why. The "why" might be a specific weight in the retrieval system, a learned pattern in the model, a structural artifact of the training process, or some combination. Practitioners often want explanations and find correlations frustrating; the right framing is that correlations are sufficient for operational decisions even when the underlying mechanism remains opaque.

The Test Design Foundations

A useful citation test has four design elements: a query set, a target observation, a control structure, and a statistical plan.

The query set is the inputs you will run against ChatGPT to observe citations. For most ranking-factor research the right set is 50-200 queries spanning the categories where your business cares about visibility. Queries should be varied (different intent types, different specificity levels, different domains within your topical area) so the findings generalize rather than reflecting one narrow query pattern.

The target observation is what you record from each query response. The minimum is whether your domain appears as an inline citation, whether it appears in the sources panel, whether competitors appear. More sophisticated observations include citation count per domain (some queries cite a domain multiple times), the rank position of citations within the source list, and the specific quoted text when applicable.

The control structure is how you ensure the observations reflect the variable you are testing rather than ambient noise. If you are testing whether a content change affected citations, you need a baseline measurement before the change and a re-measurement after, with the same query set. If you are comparing your domain to competitors, you need to control for query-specific factors that affect all competitors equally. The right control structure depends on the hypothesis you are testing.

The statistical plan is how you will interpret the data. With small query sets and binary outcomes (cited or not), the signal-to-noise ratio is poor. You need either a large enough query set (200+ queries) or a strong enough effect (citation rate change of 30%+ from intervention) for findings to be meaningful. Single-query observations are anecdotes, not evidence.

The companion piece on citation rate measurement covers the broader instrumentation framework where citation testing sits within a complete reporting pipeline.

A Reproducible Starting Test

For brands wanting to start the research program without overinvesting upfront, a useful first test takes the following shape. Pick 50 buyer-intent queries from your category. Run each one in ChatGPT manually, recording your domain's citation status (inline, panel, or not cited) plus the top 5 competitor domains' status for the same queries. The single run takes 90-120 minutes of careful work. The result is a 50x6 matrix that becomes the baseline for any subsequent intervention testing. After making content changes, re-run the same query set monthly and watch for movements. The discipline produces actionable signal within 60-90 days.

What To Vary And What To Hold Constant

Most citation tests fail because too many variables change at once. The brand updates content, refreshes the sitemap, ships a CDN configuration change, and runs PR for the same week. When citations move (or do not), there is no way to attribute the change to any specific intervention. The work has to be more disciplined than that.

The right structure is one-variable-at-a-time. Change one specific thing, hold everything else constant, run the test, measure the result. The intervention list to consider:

1 - Content depth. Take a moderate-length article (1,500 words) on a topic and expand it to 3,500 words with substantive additional analysis. Re-run the relevant query set 6-8 weeks after publication. Measure the citation rate change.

Specific claim density. Take an existing article with mostly generic claims and add 5-10 specific factual claims with numbers, named entities, and methodological notes. Re-run the query set 4-6 weeks after publication.
Author attribution. Take an anonymously-attributed article and replace the byline with a named expert who has verifiable external credentials. Run the relevant query set 8-12 weeks after the change.
Schema markup addition. Take a high-priority page without JSON-LD Article schema and add it. Re-run the query set 4-6 weeks after.
External authority signal. Pursue digital PR for an article and earn 3-5 backlinks from authoritative sources in the topic area. Run the relevant query set 12-16 weeks after the backlinks land.

Each test isolates one variable - Each takes weeks-to-months to produce reliable results because of the propagation delays in OpenAI's retrieval system. The combined research program might run 8-12 tests over a 12-month period, each producing one defensible finding about which interventions move citations.

The discipline is the hard part. Most teams find the time horizon frustrating and want faster results. The alternative (running multiple changes simultaneously and assuming the citation changes are attributable to the work) produces unreliable claims that do not survive scrutiny. The patience pays off in research findings the team can actually trust.

What Definitely Should Not Be Varied

Random variations in OpenAI's retrieval system (model updates, infrastructure changes, query interpretation tweaks) can produce citation changes that look like intervention effects but are not. To control for these, run a baseline query set that you are not making intentional changes to. Track citations for the baseline set over time. If the baseline citation rate moves materially, something at OpenAI's end changed and any observed effect on your intervention set might be coincident rather than causal.

The Five Most Replicable Findings So Far

Across the systematic testing we and other practitioners have published, five findings have emerged repeatedly across independent replications. None should be treated as universal laws, but each has enough evidentiary support to inform operational decisions.

Substantive long-form content cites more than scannable short content for research queries. Pieces in the 2,500-5,000-word range with structured sections earn higher citation rates than pieces under 1,500 words on the same topic. The effect is strongest in Deep Research mode and present but smaller in regular ChatGPT search.
Specific factual claims earn citations more reliably than generic content. Pieces with numbers, named entities, and original analysis cite 1.5-3x more than pieces of equivalent length that summarize generic information. The effect is robust across topic areas and replicates consistently.
Bing index inclusion is necessary but not sufficient. Pieces absent from Bing's top results for a query rarely get cited regardless of content quality. Pieces present in Bing's top results may or may not get cited depending on other factors. The Bing dependency is structural and has been confirmed empirically through multiple independent tests.
Named-author bylines with verifiable expertise lift citation rates over anonymous attribution. The effect is smaller than the content-quality effect but real, typically 10-25% lift on the same content with author attribution added. The effect compounds over multiple articles by the same author as the author's verified profile strengthens.
Schema markup with proper Schema.org Article, Author, and Organization types produces modest but reliable citation rate improvements. The effect is in the 5-15% range across the testing we have done, which is small per intervention but reliably positive. Worth doing for the small lift plus the dual benefit to traditional SEO.

What Has Not Replicated

Several findings appear in vendor-adjacent commentary that have not replicated under controlled testing. The claim that "AI engines prefer FAQ schema" has not survived systematic testing; FAQ schema produces modest citation effects when the content matches FAQ formats but does not lift citations when force-fit to non-FAQ content. The claim that "stuffing pages with named entities boosts citations" has not replicated either; the effect is from authentic specificity rather than from named-entity density specifically. Tests should be skeptical of these and similar untested claims that circulate in the discourse.

Running The Tests At Scale

Manual citation testing produces accurate data but consumes a lot of human time. Scale comes from programmatic testing through APIs.

OpenAI's web search capability is available through the Responses API. Developers can issue queries programmatically with web search enabled and receive structured responses that include the citation list. A small Python script can run 200 queries in 30-45 minutes of compute time, log the citations to a database, and produce the same matrix that took 90-120 minutes of manual work.

The API access has a cost (cents per query at most) but produces dramatically more consistent data than manual testing. Manual testing introduces session-by-session variation (the same query asked in different sessions can produce different citation lists), while programmatic testing through fresh API sessions produces more reproducible results.

For brands without engineering resources, several third-party citation tracking platforms automate the testing. Profound, Otterly.ai, Bluetick, and similar tools run citation queries on a schedule and produce dashboards that summarize trends. The tooling is paid but eliminates the engineering investment.

The right scale depends on the research program. For ad-hoc validation of specific hypotheses, 50-100 queries is sufficient. For systematic monitoring of competitive position, 200-500 queries run monthly produces a strong baseline. For deep ranking-factor research where statistical power matters, 1,000+ queries can be useful.

Across all scales, the principle is the same. More queries reduce noise. More controlled variation isolates effects. More patience over months produces signals that are not visible in shorter windows. The tradeoff is between research rigor and time-to-insight; teams should pick the level appropriate to the decision they are trying to inform.

A Sample Programmatic Setup

A reproducible pattern: a Python script that maintains a local SQLite database of queries and citations, runs the query set monthly via the OpenAI API with web search enabled, parses the citation lists, and writes the results to the database with timestamps. The setup takes 2-3 hours for an engineer comfortable with Python and the OpenAI SDK. The output is a clean longitudinal dataset that can be visualized in any BI tool or analyzed directly with pandas or similar.

Interpreting The Data Without Overclaiming

The hard part of citation testing is not collecting the data. It is interpreting it without overstating what the data supports. Four interpretation traps to watch for:

Correlation is not causation - Observing that pages with feature X are cited more often than pages without feature X does not prove that feature X causes citations. The pages with X might also have other characteristics (length, authority, topical relevance) that produce the citation effect. Controlled intervention testing is what distinguishes correlation from causation, and many of the public claims about ChatGPT ranking factors are based on correlations without the controlled testing to back them up.
Statistical significance versus practical significance - A 5% citation rate lift on a 200-query test might be statistically significant but not practically meaningful for a business decision. A 50% lift on a 20-query test might be practically interesting but statistically noisy. The right interpretation distinguishes between effects that are likely real (statistical significance) and effects that matter for business outcomes (practical significance).
Generalization across contexts - A finding that holds in one topic area might not generalize to another. A finding that holds for B2B SaaS queries might not hold for consumer ecommerce queries. The right framing is to characterize findings narrowly ("for B2B SaaS comparison queries, X correlates with citations") rather than universally ("X is a ranking factor").
Time stability - ChatGPT's ranking system evolves. A finding that holds in May 2026 might not hold in November 2026 if OpenAI updates the retrieval logic. The right approach is to re-validate findings periodically and to flag findings that are old enough to need reconfirmation.

The honest interpretation often involves saying "we found that X correlates with citations across our test set, the effect appears reliable as of [date], and we recommend testing in your specific context before treating it as a universal pattern." The honesty earns more trust than confident overclaiming, and the discipline produces better strategic decisions over time.

Reporting To Stakeholders

When reporting findings to executive or client audiences, the framing matters. "Our citation rate doubled" is dramatic but provides no context for whether the change is meaningful or sustainable. "Our citation rate moved from 12% to 24% on our 100-query test set after we added specific data to 30 pages, sustained over 60 days, suggesting the content change produced the lift" is more informative and more defensible. The longer framing requires writing skill but produces stakeholder confidence that the shorter version does not.

Building The Ongoing Research Program

Citation testing becomes most valuable when it is sustained rather than episodic. A research program with monthly or quarterly testing rhythms produces compounding insight as findings accumulate.

The right structure for most brands has three components.

Baseline monitoring - A standard query set (100-200 queries) run monthly to track competitive citation share over time. The monthly cadence catches both intentional changes (your content investments) and ambient changes (OpenAI updates, competitor moves). The baseline is the canvas against which interventions are evaluated.
Targeted experiments - Specific A/B-style tests of content changes, technical interventions, or authority-signal investments. Each experiment isolates one variable, has a defined hypothesis, and runs for the appropriate time horizon. Experiments are scheduled deliberately rather than running ad-hoc.
Exploratory testing - Periodic broader queries to look for patterns the team has not yet hypothesized. New query types, new topic areas, new comparison points with competitors. Exploratory testing surfaces opportunities the systematic testing might miss because it requires curiosity rather than predefined hypothesis.

The three components together produce a program that informs strategic decisions and builds organizational competence in ranking-factor research. After 12-18 months of operation, the program produces a body of internal knowledge about what works for your category that no external advisor can match.

The Documentation Discipline

The research program is only as useful as its documentation. A simple structure: a wiki page per experiment with hypothesis, methodology, query set, baseline measurements, intervention details, follow-up measurements, and conclusions. The documentation outlives the team members who ran the experiment and produces institutional memory that compounds over years. Brands that document their citation research carefully build a moat that is hard for competitors to replicate.

Frequently Asked Questions

How long does a typical citation test take to produce reliable findings?

Most intervention tests need 6-16 weeks between baseline measurement and reliable follow-up measurement, because OpenAI's retrieval system has propagation delays at every stage (Bing re-indexing, OAI-SearchBot re-crawling, retrieval system updates, model updates). Tests that try to produce findings in less than 4 weeks usually pick up noise rather than signal. The patience requirement is one of the reasons systematic citation testing is rare; teams that internalize the time horizon get the value.

What size query set is statistically sufficient for confident findings?

For binary outcomes (cited or not), with effect sizes in the 20-30% range, around 200 queries provide reasonable statistical confidence. Smaller effect sizes (10-15%) need 500+ queries. Very large effect sizes (50%+) can be detected with as few as 50 queries. The right size depends on the effect you are trying to measure and the confidence level you need for the business decision the finding will inform.

Should I use ChatGPT directly or use the OpenAI API for citation testing?

Both work, with tradeoffs. Direct ChatGPT use produces results closest to what real users experience, but requires manual time and introduces session variation. API use is faster and more reproducible, but the API context may differ slightly from the consumer surface. The right approach is usually API for systematic testing (because of efficiency and reproducibility) supplemented with occasional manual testing to validate that API results match consumer ChatGPT results for your queries.

How do I distinguish ranking effects from random noise in small test sets?

The honest answer is you often cannot. Small test sets (under 50 queries) are not statistically reliable for detecting modest effects. The right approach for budget-constrained teams is to focus on hypotheses where you expect large effects (substantive content overhauls, structural authority changes) rather than small-effect hypotheses (minor schema additions, single-paragraph rewrites). Large-effect interventions can be detected with smaller test sets; small-effect interventions cannot.

Will OpenAI ever publish its actual ranking factors?

Probably not in detail. The same incentives that produce the current opacity will probably continue producing opacity. OpenAI may publish high-level guidance ("we reward original analysis, specific claims, and authoritative sources") but is unlikely to specify the weights, the model architecture, or the precise scoring functions. Empirical testing will likely remain the primary way to understand the ranking behavior for the foreseeable future of AI search.

The black-box nature of OpenAI's ranking is a constraint, but it is not an obstacle. The empirical alternative is rigorous, replicable, and produces operational insight that vendor-adjacent commentary cannot match. Brands that build the citation testing discipline now have a multi-year head start on understanding what actually moves citations for their specific category, and the insight compounds as the testing program matures.

If your team wants the full research program built and operated (with the query set, the testing automation, the monthly analysis, and the experimental design support), that work sits inside our generative engine optimization program. The system is opaque from outside. The behavior is observable. The brands that observe systematically beat the brands that rely on speculation.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit