The Cortex Corpus · primary-source index

151,055+0 since you opened this page

Documents indexed · hybrid BM25 + voyage-3-large dense

1,528,654

Indexed passages

Source platforms

38ms

Retrieval p50

One library. Every primary source that decides whether a brand gets shown by search engines, AI engines, and ad platforms.

Index footprint: 13.2 GB
Embedded passages: 99.98%
Last write: 14 seconds ago
Reranker: cohere-rerank-v3.5

I. Scale

The corpus covers every platform that gates visibility - from search engines to AI engines to ad platforms.

Not a generic web crawl. Not a scraped competitor index. A deliberate library of canonical platform documentation, technical references, and editorial coverage - curated so Cortex can answer with citations, not guesses.

Platform docs

111,039

Google, Microsoft, Shopify, WordPress, Meta, Amazon, MDN, WooCommerce, HubSpot, Klaviyo, Magento, BigCommerce - every platform that ranks or serves.

Editorial authority

23,993

Moz, Search Engine Land, SEJ, Roundtable, Backlinko, Aleyda Solis + community signal from Reddit - independent voices that surface changes before official docs catch up.

AI & ML providers

9,104

Anthropic, OpenAI, Mistral, Cohere, Perplexity, You.com, xAI, Apple Intelligence, DeepSeek, Brave, Mojeek.

Standards & registry

6,919

Schema.org, W3C, IETF, IAB, ICANN, IANA, HTTP Archive, Common Crawl - bodies whose specs the entire web operates against.

II. Quality

Scale without curation is noise. Every passage passes four gates before Cortex can use it.

A primary-source corpus only earns the name if you can trust each chunk in it. The Cortex ingest pipeline rejects fabricated content, reranks for authority, and quarantines anything that fails embedding QC.

01 SOURCE

Tier-weighted selection

Every URL is tagged with a tier before crawl. Tier 1: platform-owned policy and API specs. Tier 2: official partner docs. Tier 3: curated editorial. Tier 4: community signal (used as context, never as citation).

97%Tier 1 + 2 citations

0Untagged sources

02 CHUNK

300-token passages, with overlap

Each document is split with semantic boundaries preserved. Headers, lists, code blocks held intact. Overlap maintains continuity so a single answer never depends on a half-sentence chunk.

10.1Chunks per doc (median)

300Token target

03 EMBED

voyage-3-large, halfvec(3072)

Dense embeddings generated by Voyage-3-large, stored as halfvec to keep recall sharp while cutting index footprint in half. Every chunk also indexed by Postgres BM25 for lexical match.

99.98%Embed success rate

0Unembedded served

04 RERANK

cohere-rerank-v3.5

BM25 and dense candidates fused with Reciprocal Rank Fusion, then reranked by Cohere rerank-v3.5. The model only sees the top 15 passages - quality over breadth at retrieval time.

38msRetrieval p50

94%Cache hit

CITE-READY

web.dev / Core Web Vitals

How CWV thresholds were defined

“The worst-performing 10-30% of origins are classified as 'poor'. Mobile and desktop usage typically have very different characteristics.”

Tier 1 / Googledoc_type / documentation

CITE-READY

schema.org

Product type, required properties

“At least one of: review, aggregateRating, or offers is required. The offers property must contain a Price or PriceSpecification with priceCurrency in ISO 4217.”

Tier 1 / Standardsdoc_type / api_reference

CITE-READY

help.shopify.com

Improving SEO of a Shopify store

“Shopify automatically generates a robots.txt file. You can edit it using the robots.txt.liquid template, but doing so can affect how search engines crawl your store.”

Tier 1 / Shopifydoc_type / help_article

III. Active Learning

And then Cortex learns from it - not just on launch, but every day the corpus changes.

A corpus that is queried is just a search box. A corpus that is studied becomes a memory layer. Cortex runs a five-stage loop on the chunks: ingest, distill, audit, act, refine. Outcomes feed back. Weak rules get pruned. The library does not just stay current - it gets better at being read.

CORTEXdecision engine

Ingest - crawl + chunk

Daily crawler hits every monitored source on its freshness schedule. New documents split into 300-token passages, BM25 and voyage-3-large embeddings written in one pass.

1,200 new docs / day

Distill - mine the rules

Learnings layer reads new chunks and writes structured rules, benchmarks, and patterns into a table separate from the raw passages. Each learning cites at least one chunk.

30 new learnings / day

Audit - sentries query

Sentries pull pre-distilled rules from the learnings layer first, then attach supporting chunks for citation. Verdicts arrive in seconds with primary sources attached.

cite + verdict

Act - optimizers ship

Optimizers convert verdicts into concrete edits: schema, copy, redirects, ad changes. Every action records what it changed, why, and where the rule came from.

trace + revert path

Refine - outcomes feed back

Optimizer outcomes tell the learnings layer which rules survived contact with reality. Sharp rules stay. Weak ones get pruned. The next cycle starts smarter than the last.

self-correcting

IV. This week in the corpus

The library announces what changed. Nothing learned in silence.

2 days ago+ rule

Google tightens INP threshold

Search Central guidance moved 'Good' INP from 200ms to 175ms. Cortex updated cwv-inp learning; 14 active audits reference the new threshold.

web.dev / articles / inp / threshold-update

3 days ago+ pattern

Schema.org Product v28 ships

New 'hasCertification' property and revised priceValidUntil semantics. Cortex re-distilled 7 product-schema rules; 3 changed thresholds.

schema.org / Product / version / 28.0

4 days ago+ pattern

PMax CPA benchmark recalibrated

Cross-account pass on 142 Cortex-managed PMax accounts dropped the apparel-vertical CPA benchmark to $47.12 (from $52.40).

cortex.learnings / pmax / apparel / cpa_benchmark

5 days ago+ rule

OpenAI structured outputs guide

New JSON-schema enforcement docs. Two LLM-output Sentries adopted the citation pattern; one Optimizer ships with the spec embedded.

platform.openai.com / docs / guides / structured-outputs

6 days ago+ pattern

Microsoft Copilot grounding rules

Bing webmaster team published new IndexNow + Copilot grounding interplay docs. 4 new GEO learnings; sentries.bingbot updated.

about.ads.microsoft.com / copilot-grounding

7 days ago- rule

Retired: meta description length

Three months of outcome data showed no CTR delta from 155-160 char descriptions. Learning archived; supporting chunks remain in the audit trail.

cortex.learnings.archived / 2025 / meta_desc_length

151,055 sources. One engine that reads them all.

Cortex queries the corpus on every audit, learns from every outcome, and refines the rules that come next. The decision engine you can read along with.

Activate Cortex How Cortex uses the corpus