Spectron makes LLM calls during extraction, query resolution, and reflection. Each Context can have a token budget and per-minute rate limits to control costs and prevent runaway usage.
Token tracking
Every LLM call is tracked in the decision_trace table. The token_cost field records the total tokens (input + output) consumed by each operation.
Querying token usage
Token breakdown by stage
| Stage | Model used | Typical cost |
|---|---|---|
| Turn extraction (Stage 1) | models.extraction | 200–800 tokens per turn |
| Turn extraction (Stage 2) | models.extraction_strong | 500–2000 tokens per turn |
| Query understanding | models.query_understanding | 50–200 tokens per query |
| Response synthesis | models.response | 200–1500 tokens per query |
| Reflection | models.reflection | 500–5000 tokens per reflection |
| Embedding | models.embedding | ~100 tokens per chunk |
Stage 2 extraction only runs when Stage 1 confidence falls below the configured threshold. Setting the threshold higher reduces Stage 2 usage at the cost of lower extraction precision on complex turns.
Setting a token limit
Set a monthly token limit per Context via the management API:
When the limit is reached, new extraction and synthesis requests return 429 Too Many Requests. Read-only operations (direct attribute lookups, cache hits) that do not involve LLM calls are not blocked.
Set token_limit: null to remove the limit.
Monitoring approaching limits
Query the current month's usage relative to the limit:
Rate limiting
Rate limits are enforced per Context at two levels:
| Limit | Default | Config field |
|---|---|---|
| Requests per minute | 600 | rate_limit.requests_per_minute |
| LLM tokens per minute | 100,000 | rate_limit.tokens_per_minute |
When the per-minute token limit is hit, new LLM-requiring requests return 429 until the window resets. Direct lookups and cache hits are not counted against the token rate limit.
Configure limits:
Reducing costs
Cache tuning
The semantic response cache serves repeated or similar queries without LLM synthesis. Increase the cache TTL to reduce synthesis calls:
Lowering the similarity threshold (e.g. to 0.94) allows more queries to hit the cache at the cost of slightly lower freshness guarantees.
Extraction stage tuning
Reduce Stage 2 extraction by raising the confidence threshold at which Stage 2 is triggered. A value of 0.9 means Stage 2 only runs when Stage 1 is less than 90% confident:
Model selection
Assign cheaper models to latency-sensitive stages:
Reserve the stronger (more expensive) model for reflection, where synthesis quality matters most and latency is less critical.
Retention policies
Context-category memories auto-expire after retention_days. Expiring stale context reduces the number of memory items retrieved and embedded at query time, lowering synthesis token usage:
Per-tenant billing in multi-tenant deployments
In shared-Context deployments, track per-tenant token consumption via scope:
Use this to implement per-tenant billing or to enforce per-tenant token budgets at the application layer.