The Cortex Corpus · primary-source index
151,055
Documents indexed · hybrid BM25 + voyage-3-large dense
1,528,654
Indexed passages
68
Source platforms
38ms
Retrieval p50

One library. Every primary source that decides whether a brand gets shown by search engines, AI engines, and ad platforms.

Index footprint
13.2 GB
Embedded passages
99.98%
Last write
14 seconds ago
Reranker
cohere-rerank-v3.5
I. Scale

The corpus covers every platform that gates visibility - from search engines to AI engines to ad platforms.

Not a generic web crawl. Not a scraped competitor index. A deliberate library of canonical platform documentation, technical references, and editorial coverage - curated so Cortex can answer with citations, not guesses.

Platform docs
111,039

Google, Microsoft, Shopify, WordPress, Meta, Amazon, MDN, WooCommerce, HubSpot, Klaviyo, Magento, BigCommerce - every platform that ranks or serves.

Editorial authority
23,993

Moz, Search Engine Land, SEJ, Roundtable, Backlinko, Aleyda Solis + community signal from Reddit - independent voices that surface changes before official docs catch up.

AI & ML providers
9,104

Anthropic, OpenAI, Mistral, Cohere, Perplexity, You.com, xAI, Apple Intelligence, DeepSeek, Brave, Mojeek.

Standards & registry
6,919

Schema.org, W3C, IETF, IAB, ICANN, IANA, HTTP Archive, Common Crawl - bodies whose specs the entire web operates against.

II. Quality

Scale without curation is noise. Every passage passes four gates before Cortex can use it.

A primary-source corpus only earns the name if you can trust each chunk in it. The Cortex ingest pipeline rejects fabricated content, reranks for authority, and quarantines anything that fails embedding QC.

01 SOURCE

Tier-weighted selection

Every URL is tagged with a tier before crawl. Tier 1: platform-owned policy and API specs. Tier 2: official partner docs. Tier 3: curated editorial. Tier 4: community signal (used as context, never as citation).

97%Tier 1 + 2 citations
0Untagged sources
02 CHUNK

300-token passages, with overlap

Each document is split with semantic boundaries preserved. Headers, lists, code blocks held intact. Overlap maintains continuity so a single answer never depends on a half-sentence chunk.

10.1Chunks per doc (median)
300Token target
03 EMBED

voyage-3-large, halfvec(3072)

Dense embeddings generated by Voyage-3-large, stored as halfvec to keep recall sharp while cutting index footprint in half. Every chunk also indexed by Postgres BM25 for lexical match.

99.98%Embed success rate
0Unembedded served
04 RERANK

cohere-rerank-v3.5

BM25 and dense candidates fused with Reciprocal Rank Fusion, then reranked by Cohere rerank-v3.5. The model only sees the top 15 passages - quality over breadth at retrieval time.

38msRetrieval p50
94%Cache hit
CITE-READY
web.dev / Core Web Vitals
How CWV thresholds were defined
The worst-performing 10-30% of origins are classified as 'poor'. Mobile and desktop usage typically have very different characteristics.
Tier 1 / Googledoc_type / documentation
CITE-READY
schema.org
Product type, required properties
At least one of: review, aggregateRating, or offers is required. The offers property must contain a Price or PriceSpecification with priceCurrency in ISO 4217.
Tier 1 / Standardsdoc_type / api_reference
CITE-READY
help.shopify.com
Improving SEO of a Shopify store
Shopify automatically generates a robots.txt file. You can edit it using the robots.txt.liquid template, but doing so can affect how search engines crawl your store.
Tier 1 / Shopifydoc_type / help_article
III. Active Learning

And then Cortex learns from it - not just on launch, but every day the corpus changes.

A corpus that is queried is just a search box. A corpus that is studied becomes a memory layer. Cortex runs a five-stage loop on the chunks: ingest, distill, audit, act, refine. Outcomes feed back. Weak rules get pruned. The library does not just stay current - it gets better at being read.

INGEST01DISTILL02AUDIT03ACT04REFINE05
CORTEXdecision engine
01

Ingest - crawl + chunk

Daily crawler hits every monitored source on its freshness schedule. New documents split into 300-token passages, BM25 and voyage-3-large embeddings written in one pass.

1,200 new docs / day
02

Distill - mine the rules

Learnings layer reads new chunks and writes structured rules, benchmarks, and patterns into a table separate from the raw passages. Each learning cites at least one chunk.

30 new learnings / day
03

Audit - sentries query

Sentries pull pre-distilled rules from the learnings layer first, then attach supporting chunks for citation. Verdicts arrive in seconds with primary sources attached.

cite + verdict
04

Act - optimizers ship

Optimizers convert verdicts into concrete edits: schema, copy, redirects, ad changes. Every action records what it changed, why, and where the rule came from.

trace + revert path
05

Refine - outcomes feed back

Optimizer outcomes tell the learnings layer which rules survived contact with reality. Sharp rules stay. Weak ones get pruned. The next cycle starts smarter than the last.

self-correcting
IV. This week in the corpus

The library announces what changed. Nothing learned in silence.

2 days ago+ rule

Google tightens INP threshold

Search Central guidance moved 'Good' INP from 200ms to 175ms. Cortex updated cwv-inp learning; 14 active audits reference the new threshold.

web.dev / articles / inp / threshold-update
3 days ago+ pattern

Schema.org Product v28 ships

New 'hasCertification' property and revised priceValidUntil semantics. Cortex re-distilled 7 product-schema rules; 3 changed thresholds.

schema.org / Product / version / 28.0
4 days ago+ pattern

PMax CPA benchmark recalibrated

Cross-account pass on 142 Cortex-managed PMax accounts dropped the apparel-vertical CPA benchmark to $47.12 (from $52.40).

cortex.learnings / pmax / apparel / cpa_benchmark
5 days ago+ rule

OpenAI structured outputs guide

New JSON-schema enforcement docs. Two LLM-output Sentries adopted the citation pattern; one Optimizer ships with the spec embedded.

platform.openai.com / docs / guides / structured-outputs
6 days ago+ pattern

Microsoft Copilot grounding rules

Bing webmaster team published new IndexNow + Copilot grounding interplay docs. 4 new GEO learnings; sentries.bingbot updated.

about.ads.microsoft.com / copilot-grounding
7 days ago- rule

Retired: meta description length

Three months of outcome data showed no CTR delta from 155-160 char descriptions. Learning archived; supporting chunks remain in the audit trail.

cortex.learnings.archived / 2025 / meta_desc_length

151,055 sources. One engine that reads them all.

Cortex queries the corpus on every audit, learns from every outcome, and refines the rules that come next. The decision engine you can read along with.