Spectron's processing pipeline is divided into six distinct stages, each with a different latency profile, quality requirement, and cost envelope. Because these requirements differ significantly, Spectron allows you to assign a separate model to each stage rather than applying a single model uniformly.
This lets you keep latency-critical paths on fast, inexpensive models while routing quality-sensitive background work to larger, more capable ones.
The six stages
Extraction
The extraction stage runs synchronously on every turn. When your agent submits a conversation turn, Spectron immediately classifies the content and extracts structured facts – entities, attributes, relations, instructions, and uncertainties – before returning the result to your application. Because this sits in the critical path between your user's message and your agent's response, latency matters more here than anywhere else in the pipeline.
Query understanding
Query understanding runs when a recall or search request arrives. Spectron classifies the incoming query – determining whether it is seeking a fact, a document chunk, an instruction, or a broader contextual summary – and dispatches it to the appropriate retrieval tier. This stage is also synchronous and latency-sensitive.
Response
When Spectron's /recall endpoint generates a natural-language answer from retrieved context rather than returning raw records, the response stage handles that synthesis. It receives the retrieved facts and assembles them into a coherent reply. Response generation can be synchronous or buffered depending on how your application consumes the recall endpoint.
Reflection
Reflection is the reconciliation pass that runs in the background after extraction. It compares new facts against existing memory, resolves contradictions, merges duplicates, and synthesises higher-order insights from accumulated context. Because reflection is not in the user-facing critical path, quality matters more than speed here. A larger, more reasoning-capable model typically produces better memory consolidation.
Background
Background covers auxiliary reconciliation sweeps – importance decay, retention-policy enforcement, lifecycle expiry, and index maintenance – that run on a schedule rather than per-request. These jobs are low-priority and tolerate higher latency.
Embedding
The embedding stage converts text into dense vectors for similarity search. Spectron uses these vectors for the semantic response cache, HNSW chunk retrieval, and query expansion. The embedding model must match the dimensionality expected by the index. Changing this setting after initial ingestion requires re-embedding all stored content.
Configuring models per stage
Model assignment is per-Context. A single Spectron deployment can therefore serve multiple products with different model configurations.
Python
JavaScript
You may update any subset of stages. Omitting a key leaves the current assignment unchanged.
Supported providers
Spectron supports two LLM providers: OpenAI and Anthropic. Provider keys are stored per-Context using the provider configuration endpoint.
Provider keys are write-only on the API. Reading the context configuration returns a providers_configured array rather than the keys themselves:
To use an Anthropic model for a stage, reference it by its model identifier:
Latency versus quality trade-offs
The table below summarises the trade-off profile for each stage and a reasonable default strategy:
| Stage | Synchronous | Latency sensitivity | Quality sensitivity | Default |
|---|---|---|---|---|
extraction | Yes | High | Medium | Fast model (e.g. gpt-4o-mini) |
query_understanding | Yes | High | Medium | Fast model |
response | Yes | Medium | Medium | Fast model |
reflection | No | Low | High | Capable model (e.g. gpt-4o) |
background | No | Low | Low | Fast model |
embedding | Both | Medium | Fixed | Embedding model (e.g. text-embedding-3-small) |
A common pattern is to keep extraction, query understanding, and response on the fastest available model for a given provider, and to use a larger model only for reflection – where consolidation quality has a lasting effect on the accuracy of all future recalls.
Reading the current configuration
Changing the embedding model
The embedding model determines the vector dimensionality stored in the HNSW index. text-embedding-3-small produces 1536-dimensional vectors. If you switch to a model with a different output dimension, you must re-embed all stored knowledge chunks before the new model can be used for retrieval:
Re-indexing is asynchronous. Poll memory.knowledge.reindex_status() until status is complete before issuing queries against the new index.