The chat endpoint lets Spectron manage the entire LLM conversation loop in a single call. Rather than recording a turn, retrieving context, calling your LLM, and recording the assistant's reply separately, you hand all of that to Spectron and receive the model's reply along with a memory diff in one response.
Two integration patterns
Spectron supports two ways to integrate memory into a conversation:
| Pattern | Use when |
|---|---|
Spectron drives the loop (session.chat()) | You want minimal integration surface. Spectron handles context retrieval, model invocation, and memory persistence. |
Caller drives the loop (session.turn() + session.context()) | You need control over the model call – custom prompts, streaming, tool use, multi-step chains, or your own model infrastructure. |
The two patterns are not mutually exclusive. A single session can use chat for simple turns and turn + context for turns that require tool use or custom prompting.
What Spectron does in chat
When you call session.chat(), Spectron executes the following steps:
Appends the user turn – stores the message with
role: "user"and runs the extraction pipeline (identical to a manualsession.turn()call).Loads the session transcript window – the last 10 user/assistant turns (excluding the turn just written), capped at 8k characters of the newest content. Rendered as a sanitised
# Conversation so farblock inside the system prompt (one line per turn, inside the existing untrusted-data framing).Retrieves context – queries the memory layer for relevant facts, applying the tiered retrieval strategy (direct scope → hybrid → full context) and formats them into an injection-ready string.
Calls the response model – invokes the configured
responsemodel with retrieved context, the transcript window (when non-empty), and the current question.Persists the assistant turn – stores the model's reply as an
assistantturn and runs extraction on it.Returns the reply text and the combined memory diff (entities, attributes, relations, and corrections produced by both the user and assistant turns).
When the transcript window is non-empty, tier-2 response reuse is bypassed — the cache is keyed by query embedding and scope, not session, so transcript-dependent replies must always be freshly synthesised. First-turn and sessionless calls remain cache-eligible.
Multi-turn chat pins a session across messages. Under scope enforcement, each turn's session is scope-tagged to the caller's write region so later messages in the same session authorise correctly.
The chat endpoint
Response:
reply
The plain-text response from the configured response model. This is ready to display directly in your UI.
memory_updates
The combined extraction diff from both the user turn and the assistant turn. The shape matches the extraction result from session.turn(). Use this to update UI state or trigger any downstream logic that depends on changes to the memory graph.
Python SDK
JavaScript SDK
Full loop example
The following pattern shows a minimal chat loop using session.chat():
Configuring the response model
The model used for response generation is configured at the context level in config.models.response. This is separate from the extraction model (config.models.extraction) used to parse turns into structured facts.
You may use any provider supported by your Spectron deployment. The extraction and response models can differ – a faster, cheaper model is often suitable for extraction while a more capable model handles final response generation.
See Models and providers for the full list of supported providers and how to configure API keys.
When to use session.turn() instead
Use session.turn() plus your own model call when you need:
Streaming responses –
session.chat()waits for the full completion before returning. If you require token-level streaming to your UI, drive the LLM loop yourself.Tool use – if your agent needs to call external tools mid-conversation, you need control over the turn-taking loop to interleave tool calls and results.
Custom prompting – if you maintain your own system prompt structure, persona configuration, or prompt template that cannot be expressed through Spectron's context injection format.
Multiple models per turn – if your architecture uses a router, ensemble, or chain of models.
Explicit context control – if you want to inspect or modify the retrieved context string before it reaches the model.
The session.context() method retrieves the formatted context string that Spectron would have injected, so you can replicate the retrieval step in the manual pattern. See Retrieving context for details.