Do ChatGPT & Perplexity Use Structured Data?

TL;DR

Do ChatGPT and Perplexity use structured data? Yes, but not the way most SEO articles claim. Both platforms read JSON-LD content (Mark Williams-Cook's DUCKYEA t-shirts test placed a fake company address inside invalid JSON-LD with no visible HTML counterpart, and both engines extracted it), but neither validates schema structure or parses key-value relationships during live retrieval; the JSON-LD is tokenized like ordinary text. The correlation evidence is striking despite this: Peter Schanbacher's peer-reviewed study of 1,508 German real estate agents found FAQPage schema with an odds ratio of approximately 13 (p < 0.001) for ChatGPT visibility, and Product schema with an odds ratio of approximately 4. Microsoft's Fabrice Canel confirmed at SMX Munich in March 2025 that schema helps Bing's LLMs (which power Copilot and ChatGPT's web index); OpenAI, Google, and Perplexity have not made equivalent public statements. The bigger mechanism is training: schema flows through Web Data Commons extraction from Common Crawl (106 billion RDF quads describing 3.1 billion entities since 2013) into LLM training corpora via Data-to-Text conversion, anchoring your brand facts in models' parametric memory even when JSON-LD gets no special inference-time treatment. Schema must be server-side rendered. GPTBot, PerplexityBot, ClaudeBot, and OAI-SearchBot do not execute JavaScript (Vercel and MERJ tracked over 500 million GPTBot fetches with zero JS execution), so client-side injected schema is invisible. Priority order based on the evidence: FAQPage (highest correlation), Organization with sameAs, Article with author and datePublished, Product with AggregateRating, and HowTo. Don't expect schema alone to drive citations: lead each section with an extractable answer in the first 50 words after each H2 (55% of AIO citations come from the top 30% of a page), maintain entity consistency across LinkedIn, Wikipedia, and your own site, and treat schema as infrastructure that feeds every system consuming your data, not as a GEO shortcut.

Audience

Senior SEO and GEO leads at B2B and e-commerce brands deciding how much to invest in schema markup for AI search visibility

Cortex

Cortex is modern marketing. Old marketing waited on people. Modern marketing fuses the efficiency of AI with the experience of experts. Meet your optimization engine.

Get Cortex

Effective

Mark Williams-Cook's DUCKYEA t-shirts test placed a fake company address inside invalid JSON-LD with no HTML counterpart, and both ChatGPT and Perplexity surfaced the address, indicating the engines read schema markup as ordinary page text rather than validating it as structured data. [src]

Impact

In a tokenization demonstration, Mark Williams-Cook showed that GPT-4o splits a JSON-LD string like "@type": "Organization" into separate tokens for "type" and "Organization" that are indistinguishable from the same words appearing in plain prose. [src]

Action

The Web Data Commons project has extracted schema.org data from the Common Crawl annually since 2013, with the latest release containing 106 billion RDF quads describing 3.1 billion entities from 12.8 million websites. [src]

Platform

Court documents from U.S. v. Google LLC describe FastSearch, a proprietary retrieval technology based on RankEmbed that grounds Gemini models by retrieving a smaller subset of results using lighter ranking signals than traditional Google Search. [src]

Methodology

Cortex synthesized this post from 15 documents across seroundtable.com, wordlift.io, searchengineland.com, almanac.httparchive.org, yoast.com, ipullrank.com, knownagents.com, and github.com on 2025-03-20, validated against the public evidence base on how ChatGPT, Perplexity, and Bing-powered LLMs handle schema markup.

Every week, another SEO article makes the same claim: add schema markup and watch your brand get cited by AI. The logic sounds airtight-structured data helps machines understand content, and LLMs are machines, so schema must help LLMs. But when you look at the actual evidence, the picture is far messier than the pitch. Some practitioners have run controlled tests showing that ChatGPT and Perplexity read JSON-LD no differently than plain text. A peer-reviewed study found FAQPage schema correlated with a 13x higher odds of ChatGPT visibility. And Microsoft openly confirmed that schema helps its LLMs-while OpenAI has said nothing at all. If you're making decisions about where to invest engineering and content resources for Generative Engine Optimization, you need a clearer map than "schema good." You need to understand what each platform actually does with structured data, where the evidence is strong, where it's speculative, and what to prioritize right now. This piece examines the evidence across three dimensions: what AI platforms have officially said, what independent tests reveal, and how structured data enters AI systems through training pipelines most practitioners never think about.

What Structured Data Actually Means in an LLM Context

Before evaluating whether ChatGPT or Perplexity "use" schema, it helps to distinguish between two fundamentally different ways structured data could matter. The first is at training time. Structured data can be converted into linguistic sentences via Data-to-Text processes, which then flow into LLM training corpora and form model knowledge. Web Data Commons automatically extracts structured data from the Common Crawl, and the Schema.org markups obtained there are often converted into linguistic datasets that are potentially part of the training sources from which models like Gemini or ChatGPT derive their world knowledge.

The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013. The latest release consists of 106 billion RDF quads describing 3.1 billion entities originating from 12.8 million different websites, providing a large pool of training data for tasks such as product matching, information extraction, or question answering.

The second is at inference time-when a user prompts ChatGPT or Perplexity and the system retrieves web content to generate an answer. Modern LLM-based systems that answer queries at scale typically use a retrieval-augmented generation (RAG) approach. RAG combines a retriever with an LLM to supply relevant context for generation. If your structured data is parsed and indexed as fielded attributes or converted into vectorized representations, the retriever can surface those attributes to the model.

These are not the same mechanism, and conflating them leads to bad strategy. Your schema could shape what an LLM "knows" about your brand from training data while having zero special treatment during live retrieval. Or vice versa.

The Official Statements: Who Has Said What

Only one major AI platform has made an explicit public statement about structured data. Fabrice Canel, Principal Product Manager at Microsoft Bing, confirmed in March 2025 at SMX Munich that schema markup helps Microsoft's LLMs understand content. This is an official statement from Microsoft confirming that Microsoft uses structured data to support how its large language models interpret web content, specifically for Bing's Copilot AI.

That confirmation matters. It's unambiguous. But it covers one platform-Bing Copilot.

We don't have a clear statement at the moment whether ChatGPT, Gemini, or Perplexity use schema. So the short answer is we don't know for sure. OpenAI has not published documentation specifying how its crawlers handle JSON-LD during inference. OpenAI did say it gets shopping results data from structured data feeds. John Mueller from Google said "it depends" when it comes to schema helping with LLMs.

This silence is itself informative. OpenAI has three separate crawler systems. GPTBot is an offline and asynchronous bot that crawls websites to collect information and train AI and language models. ChatGPT-User indicates that a real user query made ChatGPT crawl a website in real time to fetch up-to-date content. Requests from this bot indicate the best signal of visibility. But none of their documentation specifies whether the retrieval pipeline parses JSON-LD blocks as structured objects or merely reads them as text.

What the Crawlers Can and Cannot See

One important technical detail: Unlike Googlebot, which fetches, parses, and executes scripts to render dynamic content, OpenAI's ecosystem of bots only sees what's present in the initial HTML. Recent data from Vercel and MERJ tracked over half a billion GPTBot fetches and found zero evidence of JavaScript execution. Even when GPTBot downloads JS files-about 11.5% of the time-it doesn't run them.

This has a direct implication for schema: The most common AI visibility issue is relying on client-side structured data. When data loads via JavaScript, most AI platforms skip it entirely. Embed schema directly in the server-rendered HTML response, working with your development team to ensure JSON-LD or microdata is delivered server-side-not injected post-load. If your JSON-LD is injected via JavaScript frameworks after initial page load, OpenAI's crawlers-and PerplexityBot-will never see it.

The Practitioner Tests: What Independent Experiments Show

The most widely cited test comes from SEO practitioner Mark Williams-Cook. We are hearing that AI engines like ChatGPT and Perplexity are not using structured data in any special way. Mark Williams-Cook created a fake company named DUCKYEA t-shirts. On the page, he did not post the fake company's address. Instead, he put the address within made-up JSON-LD schema markup. Then he waited and prompted both ChatGPT and Perplexity. Both ChatGPT and Perplexity read the fake and made-up schema to find the address. Since the schema was not valid, he figured it was just being read by these AI engines like any other page of text on the web.

This test deserves careful interpretation. It proves two things: first, that both ChatGPT and Perplexity do read the content inside JSON-LD blocks. They aren't ignoring it. Second, that neither platform validates schema structure-they extracted information from invalid markup the same way they would extract it from a paragraph.

When content containing schema is tokenized, the structural relationships-for example, JSON-LD keys like "@type": "Organization"-are turned into sequences that are indistinguishable from ordinary token strings. The model remembers phrases and token probabilities, not the JSON object graph. From a tokenization perspective, this makes sense. The LLM doesn't "see" key-value pairs as a database sees them. It sees tokens. But dismissing schema based on this test alone misses the broader picture. As Williams-Cook himself noted: "The takeaway: Schema, good. Repackaging the basics as some magical new GEO formula, bad."

The Correlation Evidence

While the Williams-Cook test examines mechanism, a peer-reviewed study by Peter Schanbacher examines outcomes at scale. This study examines whether implementing structured metadata (JSON-LD Schema.org markup) on websites can improve a business's visibility in ChatGPT responses, focusing on real estate agencies as a case study. They gathered public data on 1,508 real estate agents in Germany and identified which of these agents ChatGPT could provide information about.

The results were striking. Agents whose websites included FAQPage schema markup were far more likely to be visible on ChatGPT (6.2% of visible agents had FAQ schema vs. only 0.8% of non-visible; p = 0.002). Presence of Product schema also strongly correlated with visibility (17.2% vs 1.8%; p < 0.001).

A multivariate logistic regression confirmed FAQPage schema as the strongest positive predictor of ChatGPT visibility (odds ratio ≈ 13, p < 0.001), followed by Product schema (OR ≈ 4, p < 0.001).

A 13x odds ratio is a powerful signal. But correlation studies require caution. Sites implementing FAQPage schema likely also have better content structure, more comprehensive answers, and stronger technical SEO overall. The schema may be a proxy for the kind of well-organized, information-rich content that LLMs prefer regardless of markup. Still, we are now seeing verified data from experiments. In an experiment by Aiso, sites utilising schema markup saw a "30% improvement in accuracy, completeness and presentation quality" of data provided about marked-up sites vs. unmarked-up sites from ChatGPT.

How Structured Data Enters LLM Knowledge at Scale

Even if ChatGPT's real-time retrieval treats JSON-LD as plain text, structured data shapes LLM knowledge through a pipeline most SEOs never consider.

According to the current state of research, the training processes of foundation models typically use parallel data streams. Large text corpora like C4 and The Pile serve as the primary textual basis. On the other hand, structured web data extracted from Common Crawl via projects like Web Data Commons are processed using Data-to-Text mechanisms. Facts obtained from Microdata and Schema.org sources are serialized into natural language sentences and mixed into pre-training as an additional knowledge source to enrich the models with factual knowledge.

Consider what this means practically. WDC data potentially feeds the machine world knowledge of many LLMs. During pre-training, domain references are removed, so knowledge transitions into the model anonymously. Through structured, consistent entity data-via schema.org and @id-one can specifically influence how permanently brands, products, or organizations remain anchored in the model's knowledge.

Your Organization schema may not get special treatment during a ChatGPT search query. But the facts it contains-your founding year, your CEO's name, your service categories-may have entered the model's parametric memory through Data-to-Text conversion from Common Crawl data. If your schema is the most complete and consistent source of those facts on the web, the model's "knowledge" of your brand reflects that schema. This isn't speculation. The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013, laying the foundation for analyzing adoption while providing machine learning training data for tasks such as product matching, product categorization, information extraction, or question answering.

How Perplexity Differs from ChatGPT-And Why It Matters

Treating all AI search platforms as identical is a common mistake. Perplexity's architecture creates fundamentally different citation dynamics.

The retrieval system is primary in Perplexity's pipeline. The system searches, filters, ranks, deduplicates, and assembles a structured prompt with citations all before the LLM is invoked. The LLM acts as a synthesizer bound by retrieved evidence, not the primary knowledge source. This means Perplexity's citations depend more on whether your page passes the retrieval filter than on what the LLM "knows" from training.

Perplexity generates cited answers through a multi-stage Retrieval-Augmented Generation pipeline consisting of six discrete operations: query intent parsing, real-time web retrieval using hybrid methods (BM25 + dense embeddings), multi-layer ML ranking with a three-tier reranker, structured prompt assembly with pre-embedded citations, and LLM synthesis constrained by retrieved evidence. Each stage filters candidate sources further-meaning a document must pass semantic relevance, freshness, structural quality, authority, and engagement checkpoints before it earns a citation.

Structured data's role in this pipeline is indirect but meaningful. Perplexity can use structured data to better understand content's meaning, relationships, and metadata. While the platform hasn't published details on exactly how much weight schema markup carries in citation decisions, implementing it makes your content more machine-readable for any AI system.

The platforms also differ dramatically in content sourcing. Within ChatGPT's top 10 most-cited sources, Wikipedia accounts for nearly half (47.9%) of citations among leading sources. Reddit emerges as the leading source for both Google AI Overviews (2.2%) and Perplexity (6.6%). Optimizing for one platform doesn't guarantee results on another.

What Actually Moves the Needle: Structure Over Markup

Here's the uncomfortable truth that emerges from all of this evidence: well-structured content matters more than schema markup itself, but schema reinforces and amplifies that structure.

Schema markup does not directly cause AI to cite your page, but it helps AI understand what your page is and what it contains. Think of it as labelling your content so AI does not have to infer the structure.

The practitioners seeing the best results are doing several things simultaneously:

Leading with extractable answers.

55% of AI Overview citations come from the top 30% of a page. The data is unambiguous. Put your most citable statement in the first 50 words after each H2.

Using FAQ structures with matching schema.

Sites with structured data see up to 30% higher visibility in AI overviews. FAQPage schema is particularly effective because it pre-formats content as question-answer pairs that AI systems can easily extract and cite.

Ensuring server-side delivery. If your schema loads via JavaScript, it's invisible to GPTBot, PerplexityBot, ClaudeBot, and every other major AI crawler. This isn't optional-it's a hard technical gate.
Maintaining entity consistency across platforms.

AI systems cross-reference signals from multiple sources and formats. Your brand description on LinkedIn should align with what appears on your site. When signals are consistent across sources, AI systems can categorize and reference your brand with greater confidence. When they conflict, confidence drops.

Implementing Article schema with author attribution.

Article and BlogPosting schema clarify key content attributes like publication date, author, and topic. These signals reinforce content authority and freshness, which can influence how LLMs surface content in responses.

Schema Types Worth Prioritizing

Not all schema types carry equal weight for AI visibility. Based on available evidence, here's the priority order: 1. FAQPage - The highest-performing schema type across every study and test. FAQPage schema was confirmed as the strongest positive predictor of ChatGPT visibility with an odds ratio of approximately 13.

2 - Organization with sameAs - Connects your brand entity across platforms, feeding knowledge graph construction. 3. Article with author and datePublished - Supports freshness and E-E-A-T signals that both traditional search and AI systems evaluate. 4. Product with AggregateRating - Schema gives AI systems entity clarity, defining relationships between entities. Structured fields like datePublished, price, ratingValue, and address are unambiguous. The model doesn't need to parse a sentence to find the publication date.
5 - HowTo - Step-by-step formats align naturally with how users query AI systems.

The Practitioner's Decision Framework

Given the evidence gaps, how should you allocate resources? Here's a framework based on what we actually know: If you're not doing schema at all, start immediately. Even if ChatGPT and Perplexity treat JSON-LD as flat text, they still read it. Your structured data doubles as additional on-page content that these systems ingest. OpenAI parses static HTML, and it's likely that schema embedded as JSON-LD can be processed by crawlers like GPTBot. Structured data is not a shortcut to AI visibility, but it's a vital support mechanism. It helps models understand what each part of the page is, which increases the chances your content will be cited in AI-generated answers.

If you already have basic schema, audit for three things. First, confirm server-side rendering-check your page source, not the rendered DOM. Second, verify that your schema facts match your visible content exactly. Ensure consistency between body text and markup, use unique IDs and stable entities, and maintain your content so that it is machine-readable, unambiguous, and human-citable. Third, add FAQPage schema to your highest-traffic informational pages. If you want to measure impact, adding schema without measuring the impact is guessing. After adding schema, check metrics at 30, 60, and 90 days. Track at the page level, not the site level-a site-wide average will dilute the signal. You want to see whether specific pages that received schema markup are getting cited more than they were before.

Don't over-index on schema at the expense of content quality. Structured data may influence LLM rankings, but the overall quality of your content has a far more significant impact. AI-powered search tools prioritize unique, high-quality content over competitors, even if the page doesn't implement structured data.

Where the Evidence Points Next

The honest answer to "Do ChatGPT and Perplexity use structured data?" is: yes, but not the way most articles claim. Both platforms read JSON-LD content. Neither appears to parse it as a true structured object with validated key-value relationships during live retrieval. The Williams-Cook tests confirm this. But the Schanbacher study confirms that pages with schema-particularly FAQ and Product types-correlate with dramatically higher ChatGPT visibility. And the Web Data Commons pipeline confirms that schema.org data feeds into LLM training corpora at massive scale, shaping parametric model knowledge whether or not the LLM "understands" the JSON format during inference.

The real 2025 perspective is that LLMs integrated with structured data sources don't just generate schema-they use it as a foundation for real-world reasoning. What really matters is how LLMs use structured data to improve accuracy, reduce hallucinations, and enhance decision-making.

For practitioners, the strategic takeaway is clear: don't implement schema because it's a magic signal for AI. Implement it because it forces you to define your entities clearly, structure your content deliberately, and maintain machine-readable facts that feed into every system consuming your content-from Google's Knowledge Graph to Common Crawl to every AI retrieval pipeline built on top of web data. Schema isn't the shortcut. It's the infrastructure. And the sites treating it as infrastructure-not as a GEO tactic-are the ones showing up consistently across ChatGPT, Perplexity, and everywhere else that matters.

Key Takeaways

-Render schema server-side so non-JavaScript-executing AI crawlers like GPTBot and PerplexityBot can read it.
-Prioritize FAQPage, Organization with sameAs, Article, Product with AggregateRating, and HowTo schema in that order.
-Lead each section with an extractable answer in the first 50 words after every H2 heading.
-Keep entity facts consistent across LinkedIn, Wikipedia, and your own site to reinforce parametric memory.
-Treat structured data as data infrastructure that feeds training pipelines, not as a standalone citation hack.

Ready to optimize for the AI era?

Get a free AEO audit and discover how your brand shows up in AI-powered search.

Get Your Free Audit