Spectron combines two statistical retrieval mechanisms – RAKE keyword extraction and BM25 full-text search – to complement dense vector retrieval. Together, they ensure that exact terms, product identifiers, technical phrases, and domain vocabulary are reliably findable even when their embedding representation is weak.
RAKE keyword extraction
During document ingestion, Spectron applies RAKE (Rapid Automatic Keyword Extraction) to each document. RAKE is a statistical, language-agnostic algorithm that identifies multi-word keyphrases by exploiting word co-occurrence and word frequency without requiring a language model or training data.
Spectron extracts the top-5 keyphrases per document with a RAKE score of 0.8 or above. These keyphrases are stored as keyword nodes in the knowledge graph and linked to their source documents via knowledge_has_keyword edges. A co-occurrence edge (keyword_cooccurs_with) is also created between keywords that appear in the same document, weighted by point mutual information (PMI).
Because RAKE runs without an LLM, it adds negligible latency to the ingestion pipeline and incurs no token cost.
Why keyword extraction matters
Embedding models represent meaning holistically. A query for "MLWK3LL/A" (a product SKU) may produce a low-similarity result even against a document that contains that exact string, because the embedding model has not learned to weight arbitrary identifier strings meaningfully. BM25 and keyword graph lookups have no such blind spot – they operate on token identity, not semantic proximity.
Keywords extracted by RAKE also serve as seeds for hybrid_graph retrieval. When a query matches a keyword node, the retrieval pass can expand to all documents linked to that keyword before the final scoring step.
Keyword endpoints
Keywords for a document
Retrieve the keyphrases extracted from a specific document:
List keywords across the corpus
List all keywords in the Context, optionally filtered by minimum document count:
Get a keyword by text
Vector search over keywords
Find keywords semantically similar to a query string. This uses the keyword text embeddings rather than the chunk embeddings, making it useful for building keyword-expansion pipelines:
Co-occurrence neighbours
Retrieve keywords that frequently co-occur with a given keyword across the document corpus. Co-occurrence is weighted by PMI: pairs that appear together more than their individual frequencies would predict receive higher scores.
BM25 full-text search
Spectron indexes all chunk text in a full-text index using the spectron_analyzer. The analyser applies two tokenisation passes:
Blank tokeniser: splits on whitespace only, preserving punctuation within tokens (useful for product codes like
MLWK3LL/A)Class tokeniser: splits on character class boundaries (letter/digit transitions), which separates alphanumeric identifiers without requiring spaces
Both passes feed through two token filters:
Lowercase: case-normalisation for consistent matching
Snowball: English stemming (e.g.
returning→return,purchases→purchas)
This combination means that BM25 matches both exact-cased identifiers (via blank tokeniser) and inflected natural-language terms (via class tokeniser + Snowball).
BM25 is invoked automatically when you use mode="bm25" or mode="hybrid" in the query endpoint. The BM25 score contributes to the RRF fusion in hybrid mode.
Keyword graph in hybrid_graph retrieval
When mode="hybrid_graph" is used, keywords serve as intermediate nodes in the graph-density rerank. The retrieval pipeline:
Runs hybrid (vector + BM25) to produce initial candidates.
Embeds the query and finds the top-k matching keyword nodes.
Expands from those keyword nodes through
knowledge_has_keywordedges to include additional chunks that share relevant keywords.Boosts the score of candidates that are reachable through high-PMI co-occurrence chains from the matched keywords.
This means a query containing "returns" will boost any chunk connected to the keywords RETURN POLICY, RETURN WINDOW, and REFUND – even if the chunk's embedding or term frequency score would not have ranked it highly on its own.
When to use keyword versus vector search
| Query type | Recommended mode |
|---|---|
| Product SKU, model number, API key fragment | bm25 |
| Exact phrase from a document | bm25 |
| Natural-language question (paraphrase likely) | vector or hybrid |
| Mixed: named entities + prose context | hybrid |
| Graph-connected topic exploration | hybrid_graph |
| Keyword discovery and corpus analysis | Keyword endpoints |
In practice, hybrid or hybrid_graph are the right defaults for most production queries. Use bm25 directly only when you have strong reason to believe exact-match precision matters more than recall – for example, in lookup-style queries where the answer is a specific document known to contain an exact string.