- Dataset is the schema for any structured data file or table - CSV, JSON, SQL dumps, GeoJSON, research data. Powers Google Dataset Search at
datasetsearch.research.google.com. - Required:
name,description. Realistic floor:creator,distribution,license,temporalCoverage,spatialCoverage,identifier,citation. - distribution is the critical property. A
DataDownloadobject withcontentUrl(the file URL) andencodingFormat(CSV / JSON / etc). - License URL must be machine-readable. Creative Commons, ODbL, public domain dedication, or your own permissive license URL.
- Notify Google Dataset Search by submitting an XML sitemap or by adding the page to your main sitemap. Google indexes Dataset-tagged pages into Dataset Search automatically.
Chapter 1. Before you start
Dataset schema is what gets your structured data file or table indexed by Google Dataset Search - a dedicated search experience used by researchers, data journalists, and increasingly by AI engines for grounded factual answers. Without Dataset schema, your data files exist on your site but are effectively invisible to anyone searching specifically for data.
- Confirm the page is the canonical dataset landing page, not the data file itself. Dataset schema goes on an HTML page that describes the dataset and links to one or more download URLs.
- Pull a clear license URL. Open data should be CC-BY, CC-0, ODbL, or similar. Commercial-only datasets need a publisher's terms-of-use URL.
- Identify all distribution channels. Some datasets ship in multiple formats (CSV + JSON + SQL); each is a separate DataDownload.
- Decide on identifier. DOIs for research datasets; arbitrary stable URLs for everything else.
- Capture temporal and spatial coverage. Time range and geographic region the data covers.
Chapter 2. What does Dataset schema actually do for SEO + AI search?
- Google Dataset Search inclusion. The dedicated search experience at datasetsearch.research.google.com indexes Dataset-tagged pages and is the primary way researchers find data online.
- AI engine grounded answers. Perplexity, ChatGPT, and Gemini cite Dataset-tagged sources heavily when answering factual / statistical queries because the data is structured and citable.
- Knowledge Graph data backing. Datasets feed Google's Knowledge Graph for fact-based entities (population figures, economic indicators, scientific measurements).
- Standard SEO ranking signals. Dataset pages typically rank well for data-specific queries ("X dataset", "X statistics", "X data download") even without targeting them.
Chapter 3. Required and recommended properties
{
"@context": "https://schema.org",
"@type": "Dataset",
"@id": "https://www.example.com/data/us-ecommerce-2025#dataset",
"name": "US Ecommerce Sales 2020-2025",
"description": "Quarterly US ecommerce sales by category, segmented by device and traffic source, covering 2020 Q1 through 2025 Q4. Includes 142 SKU categories across 8 industries.",
"url": "https://www.example.com/data/us-ecommerce-2025",
"sameAs": "https://doi.org/10.5281/zenodo.0000000",
"identifier": "DOI:10.5281/zenodo.0000000",
"keywords": ["ecommerce", "US sales", "quarterly data", "device segmentation"],
"creator": {
"@type": "Organization",
"name": "Capconvert Research",
"url": "https://www.capconvert.com",
"sameAs": "https://www.linkedin.com/company/capconvert"
},
"datePublished": "2026-01-15",
"dateModified": "2026-05-20",
"license": "https://creativecommons.org/licenses/by/4.0/",
"temporalCoverage": "2020-01-01/2025-12-31",
"spatialCoverage": {
"@type": "Place",
"name": "United States"
},
"variableMeasured": [
"Quarterly sales (USD)",
"Device share (desktop / mobile / tablet)",
"Traffic source share (organic / paid / direct / email / social)"
],
"citation": "Capconvert Research (2026). US Ecommerce Sales 2020-2025. https://www.example.com/data/us-ecommerce-2025",
"distribution": [
{
"@type": "DataDownload",
"encodingFormat": "text/csv",
"contentUrl": "https://www.example.com/data/us-ecommerce-2025.csv",
"name": "Full CSV"
},
{
"@type": "DataDownload",
"encodingFormat": "application/json",
"contentUrl": "https://www.example.com/data/us-ecommerce-2025.json",
"name": "JSON snapshot"
}
]
}
Each field earns its place: variableMeasured tells Google what the data describes, temporalCoverage and spatialCoverage drive scoping filters in Dataset Search, and citation is what consumers see when they want to credit the source.
Chapter 4. DataDownload distribution
Each downloadable file is a separate DataDownload in the distribution array. Required properties on DataDownload: contentUrl (the file URL) and encodingFormat (MIME type).
Common encodingFormat values:
text/csv- CSVapplication/json- JSONtext/tab-separated-values- TSVapplication/vnd.ms-excel- XLSapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet- XLSXapplication/xml- XMLapplication/json+geo- GeoJSONapplication/parquet- Parquetapplication/sql- SQL dump
For API-accessible data without a file download, use a DataCatalog entry with API documentation linked, or skip distribution and rely on url alone with the API documentation in the description.
Chapter 5. License, citation, and coverage
License
Machine-readable URL only. Common open-data licenses: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/), CC0 (public domain dedication), ODbL (https://opendatacommons.org/licenses/odbl/). Commercial-only: link your own terms-of-use page.
Citation
A plain-text citation in the schema that consumers can copy-paste to credit the source. Format: Creator (Year). Title. URL or DOI.
Temporal and spatial coverage
temporalCoverage uses ISO 8601 interval format: 2020-01-01/2025-12-31. For single point in time, just the date. Open intervals: 2020-01-01/.. (from 2020 onward).
spatialCoverage uses a Place object with name (country, region, city) or geo with GeoCoordinates / GeoShape for specific polygons.
Chapter 6. Where do you place Dataset schema on the site?
One Dataset record per dataset landing page. Convention: /data/{dataset-slug} or /datasets/{id}. The schema lives in the <head> of the landing page that describes the dataset.
For a data catalog with many datasets, ship a DataCatalog root with each dataset linked via dataset property, plus individual Dataset records on each detail page.
Chapter 7. The breakages we see most often
Ranked by frequency across 11 dataset-publishing site audits:
- No
distributionwith DataDownload, so consumers can't programmatically access the file. 8 of 11. - License as free text ("Creative Commons") instead of a URL. 6 of 11.
- No
temporalCoverage, missing time-range filtering in Dataset Search. 5 of 11. - No
variableMeasured, leaving Google to infer what columns the data describes. 4 of 11. - Dataset schema on the data file URL itself instead of an HTML landing page. 3 of 11.
- No
identifier(DOI or stable URL). Citation reliability suffers. 2 of 11.
We track these on running sites through our Sentry structured-data rule set.
FAQ
Do I need a DOI for Dataset schema to work?
No. DOIs are best practice for research datasets but optional. For non-academic datasets, a stable URL in identifier or sameAs is sufficient.
Can I ship Dataset for an API-only data source?
Yes, but the rich result is weaker without a downloadable file. Use distribution with the API endpoint as contentUrl and link the API documentation in description. For best results, also expose a static snapshot file (CSV, JSON) and link both.
Should I ship Dataset for a Looker Studio / Tableau dashboard?
Borderline. If the dashboard is the canonical data view and visitors can export the underlying data, yes. If it's just a visualization without raw-data access, no - that's a WebPage with about referencing your Dataset entity.
How does Dataset schema relate to JSON-LD on the data file itself?
Dataset schema goes on the HTML landing page. The data file (CSV / JSON) doesn't have a place for JSON-LD itself - that's why distribution exists, to link the HTML metadata to the file binary.
What if my dataset is private / behind a paywall?
You can still ship Dataset schema describing it. Set isAccessibleForFree: false and link the access page in distribution.contentUrl. Users searching Dataset Search see the listing but understand access requires payment / registration.
How do I update the schema when the data is refreshed?
Update dateModified in the schema. For monthly or quarterly updates with stable structure, leave datePublished as the original release date. For brand-new versions with different structure, consider a separate Dataset record with its own identifier and a link to the previous version.
References
- Schema.org. "Dataset." schema.org/Dataset
- Schema.org. "DataDownload." schema.org/DataDownload
- Google Search Central. "Dataset (Dataset) structured data." developers.google.com/search/docs/appearance/structured-data/dataset
- Google. "Dataset Search." datasetsearch.research.google.com
- Creative Commons. "About the licenses." creativecommons.org/licenses
- Schema.org. "Schema Markup Validator." validator.schema.org