One library. Every primary source that decides whether a brand gets shown by search engines, AI engines, and ad platforms.
- Index footprint
- 13.2 GB
- Embedded passages
- 99.98%
- Last write
- 14 seconds ago
- Reranker
- cohere-rerank-v3.5
The corpus covers every platform that gates visibility - from search engines to AI engines to ad platforms.
Not a generic web crawl. Not a scraped competitor index. A deliberate library of canonical platform documentation, technical references, and editorial coverage - curated so Cortex can answer with citations, not guesses.
Google, Microsoft, Shopify, WordPress, Meta, Amazon, MDN, WooCommerce, HubSpot, Klaviyo, Magento, BigCommerce - every platform that ranks or serves.
Moz, Search Engine Land, SEJ, Roundtable, Backlinko, Aleyda Solis + community signal from Reddit - independent voices that surface changes before official docs catch up.
Anthropic, OpenAI, Mistral, Cohere, Perplexity, You.com, xAI, Apple Intelligence, DeepSeek, Brave, Mojeek.
Schema.org, W3C, IETF, IAB, ICANN, IANA, HTTP Archive, Common Crawl - bodies whose specs the entire web operates against.
Scale without curation is noise. Every passage passes four gates before Cortex can use it.
A primary-source corpus only earns the name if you can trust each chunk in it. The Cortex ingest pipeline rejects fabricated content, reranks for authority, and quarantines anything that fails embedding QC.
Tier-weighted selection
Every URL is tagged with a tier before crawl. Tier 1: platform-owned policy and API specs. Tier 2: official partner docs. Tier 3: curated editorial. Tier 4: community signal (used as context, never as citation).
300-token passages, with overlap
Each document is split with semantic boundaries preserved. Headers, lists, code blocks held intact. Overlap maintains continuity so a single answer never depends on a half-sentence chunk.
voyage-3-large, halfvec(3072)
Dense embeddings generated by Voyage-3-large, stored as halfvec to keep recall sharp while cutting index footprint in half. Every chunk also indexed by Postgres BM25 for lexical match.
cohere-rerank-v3.5
BM25 and dense candidates fused with Reciprocal Rank Fusion, then reranked by Cohere rerank-v3.5. The model only sees the top 15 passages - quality over breadth at retrieval time.
“The worst-performing 10-30% of origins are classified as 'poor'. Mobile and desktop usage typically have very different characteristics.”
“At least one of: review, aggregateRating, or offers is required. The offers property must contain a Price or PriceSpecification with priceCurrency in ISO 4217.”
“Shopify automatically generates a robots.txt file. You can edit it using the robots.txt.liquid template, but doing so can affect how search engines crawl your store.”
And then Cortex learns from it - not just on launch, but every day the corpus changes.
A corpus that is queried is just a search box. A corpus that is studied becomes a memory layer. Cortex runs a five-stage loop on the chunks: ingest, distill, audit, act, refine. Outcomes feed back. Weak rules get pruned. The library does not just stay current - it gets better at being read.
Ingest - crawl + chunk
Daily crawler hits every monitored source on its freshness schedule. New documents split into 300-token passages, BM25 and voyage-3-large embeddings written in one pass.
1,200 new docs / dayDistill - mine the rules
Learnings layer reads new chunks and writes structured rules, benchmarks, and patterns into a table separate from the raw passages. Each learning cites at least one chunk.
30 new learnings / dayAudit - sentries query
Sentries pull pre-distilled rules from the learnings layer first, then attach supporting chunks for citation. Verdicts arrive in seconds with primary sources attached.
cite + verdictAct - optimizers ship
Optimizers convert verdicts into concrete edits: schema, copy, redirects, ad changes. Every action records what it changed, why, and where the rule came from.
trace + revert pathRefine - outcomes feed back
Optimizer outcomes tell the learnings layer which rules survived contact with reality. Sharp rules stay. Weak ones get pruned. The next cycle starts smarter than the last.
self-correctingThe library announces what changed. Nothing learned in silence.
Google tightens INP threshold
Search Central guidance moved 'Good' INP from 200ms to 175ms. Cortex updated cwv-inp learning; 14 active audits reference the new threshold.
Schema.org Product v28 ships
New 'hasCertification' property and revised priceValidUntil semantics. Cortex re-distilled 7 product-schema rules; 3 changed thresholds.
PMax CPA benchmark recalibrated
Cross-account pass on 142 Cortex-managed PMax accounts dropped the apparel-vertical CPA benchmark to $47.12 (from $52.40).
OpenAI structured outputs guide
New JSON-schema enforcement docs. Two LLM-output Sentries adopted the citation pattern; one Optimizer ships with the spec embedded.
Microsoft Copilot grounding rules
Bing webmaster team published new IndexNow + Copilot grounding interplay docs. 4 new GEO learnings; sentries.bingbot updated.
Retired: meta description length
Three months of outcome data showed no CTR delta from 155-160 char descriptions. Learning archived; supporting chunks remain in the audit trail.
151,055 sources. One engine that reads them all.
Cortex queries the corpus on every audit, learns from every outcome, and refines the rules that come next. The decision engine you can read along with.