Uploading documents

The knowledge layer holds authoritative material – manuals, policies, product data, and files your agents should treat as curated sources. Documents enter through an asynchronous upload pipeline: bytes land in object storage, then Spectron extracts, chunks, embeds, and indexes structured state in SurrealDB.

Supported formats

Text-first formats

These MIME types are accepted on POST /api/v1/{context_id}/documents under every ingestion profile:

Format	MIME type
Plain text	`text/plain`
Markdown	`text/markdown`
JSON	`application/json`
HTML	`text/html`
PDF	`application/pdf`

Multimodal formats

The upload endpoint also accepts image, audio, and video MIME types. The default Context profile is MultimodalFull, so OCR, transcription, captioning, and modality-native embeddings run when providers are configured. Dial the Context down to StandardMultimodal, TextPlusKeyword, or TextOnly to skip richer stages:

Format	MIME types (examples)
Images	`image/png`, `image/jpeg`, `image/webp`, `image/gif`
Audio	`audio/wav`, `audio/mpeg`, `audio/ogg`, `audio/flac`, `audio/aac`
Video	`video/mp4`, `video/webm`, `video/quicktime`

Under TextOnly, multimodal uploads may be accepted but image/audio/video processing stages are skipped. See Multimodal content for profile details.

Uploading a document

The upload endpoint is asynchronous. It returns 202 Accepted with a document id and initial status; processing continues on the worker tier.

REST

POST /api/v1/{context_id}/documents
Content-Type: multipart/form-data

file=<binary>
metadata={"title":"Returns Policy","scopes":[["org/acme/team/eng"]],"labels":["team=eng"]}

The metadata part is JSON. scopes is a DNF selector (OR of conjunctive slash-path clauses). labels are descriptive key=value tags (same validation as fact ingest — keys must not start with _; count caps return 409). Optional observedAt (RFC 3339) sets the known time of facts derived from this document — essential for page-by-page or episode-by-episode ingest where later plot points must stay hidden until the reader reaches them (spoiler-safe narrative memory). Omit scopes to tag the document with the caller's full memory:write region.

Note

observedAt is caller-supplied metadata — Spectron does not scan document headers or body text to infer a narrative timeline automatically. For serial fiction or journals, split uploads per chapter or episode and set observedAt (or ingest turns with observed_at) explicitly. Bulk single-file upload stamps derived facts at ingest time unless you provide the field.

Response:

{
  "id": "doc:01hx9…",
  "status": "queued",
  "content_hash": "blake3:4f3c…",
  "deduplicated": false
}

CLI

spectron documents upload ./returns-policy.pdf \
  --scope org/acme/team/eng \
  --label team=eng \
  --url "$SPECTRON_URL" \
  --api-key "$SPECTRON_API_KEY" \
  --context-id "$SPECTRON_CONTEXT_ID"

Use the generated OpenAPI clients (Python, TypeScript) for application code – method names follow the spec.

Polling for status

Poll GET /api/v1/{context_id}/documents/{id} until status is ready or failed.

Pipeline stages

Status	Description
`queued`	Waiting to enter the pipeline
`extracting`	Reading content from the uploaded bytes
`chunking`	Splitting content into overlapping segments
`embedding`	Generating dense vectors for chunks
`rendering`	Building document summaries and section metadata
`transcribing`	Transcribing audio or video (multimodal profiles)
`captioning`	Generating captions for images (multimodal profiles)
`keywording`	RAKE keyword extraction
`ready`	Fully indexed and available for retrieval
`failed`	Pipeline error; inspect `error` on the document record

Oversized chunk persists (SurrealDB transaction write-set or WebSocket message caps) are classified as permanent size errors and dead-letter without burning multi-minute re-parse retries. Operators tuning large corpora should keep the Spectron client WS cap (SPECTRON_DB_WS_MAX_MESSAGE_BYTES) aligned with the SurrealDB server's SURREAL_WEBSOCKET_MAX_MESSAGE_SIZE — see Configuration.

Content addressing and deduplication

Every document is identified by a BLAKE3 hash of its raw bytes. Re-uploading identical content returns the existing document with deduplicated: true and skips reprocessing.

When a second uploader in a different scope hits the same hash, Spectron unions their scope clause onto the existing document (and related index records) instead of trapping them with a deduplicated id they cannot read. Each union emits a document.scope_widen audit event; operators can watch the documents.scope_clause_count histogram (values above ~32 clauses on one document warrant investigation).

Outbound links

During ingest, Spectron extracts outbound hyperlinks from the raw bytes and stores them as typed knowledge_links_to edges:

Source	Link kind
Markdown `[]()` syntax	`markdown_link`
HTML `<a href="…">` attributes	`html_link`
PDF page link annotations	`pdf_annotation`

These edges feed hybrid_graph reranking (document-link density) and citation-style navigation between corpus documents.

Scope and labels

Documents and their chunks inherit the caller's resolved write scope from the API key when you omit explicit scope on upload — required for scoped keys to recall their own uploads.

You can narrow tagging with scopes on POST /documents, spectron documents upload --scope …, or MCP upload — the path must lie within the caller's memory:write region (out-of-region scope returns 403). A document's scope is fixed at upload — reprocess rejects a non-empty scopes field with 400.

Optional labels (key=value strings) are stamped on the document, chunks, and sections. They follow the same validation rules as fact ingest and are not copied onto reconciled graph rows.

Note

MCP upload accepts optional scope and labels arguments with the same semantics as REST — scoped keys produce scope-tagged documents and chunks visible under memory:read within that region.

Document management

GET /documents/{id} – status and metadata
GET /documents – list with filters
GET /documents/{id}/chunks – parsed segments
GET /documents/{id}/raw – original bytes
DELETE /documents/{id} – remove document, chunks, graph edges, and object-store bytes

See REST API for request shapes.