Ingest

/

Authoritative

Uploading documents

How to ingest documents into the Spectron knowledge layer.

The knowledge layer holds authoritative material – manuals, policies, product data, and files your agents should treat as curated sources. Documents enter through an asynchronous upload pipeline: bytes land in object storage, then Spectron extracts, chunks, embeds, and indexes structured state in SurrealDB.

These MIME types are accepted on POST /api/v1/{context_id}/documents and processed under the default TextOnly or StandardMultimodal profiles:

FormatMIME type
Plain texttext/plain
Markdowntext/markdown
JSONapplication/json
HTMLtext/html
PDFapplication/pdf

The upload endpoint also accepts image, audio, and video MIME types when the Context uses StandardMultimodal or MultimodalFull. OCR, transcription, captioning, and modality-native embeddings run only when the selected profile enables them:

FormatMIME types (examples)
Imagesimage/png, image/jpeg, image/webp, image/gif
Audioaudio/wav, audio/mpeg, audio/ogg, audio/flac, audio/aac
Videovideo/mp4, video/webm, video/quicktime

Under TextOnly, multimodal uploads may be accepted but image/audio/video processing stages are skipped. See Multimodal content for profile details.

The upload endpoint is asynchronous. It returns 202 Accepted with a document id and initial status; processing continues on the worker tier.

POST /api/v1/{context_id}/documents
Content-Type: multipart/form-data

file=<binary>
metadata={"title":"Returns Policy","scopes":[["org/acme/team/eng"]],"labels":["team=eng"]}

The metadata part is JSON. scopes is a DNF selector (OR of conjunctive slash-path clauses). labels are descriptive key=value tags (same validation as fact ingest — keys must not start with _; count caps return 409). Optional observedAt (RFC 3339) sets the known time of facts derived from this document — essential for page-by-page or episode-by-episode ingest where later plot points must stay hidden until the reader reaches them (spoiler-safe narrative memory). Omit scopes to tag the document with the caller's full memory:write region.

Response:

{
"id": "doc:01hx9…",
"status": "queued",
"content_hash": "blake3:4f3c…",
"deduplicated": false
}
spectron documents upload ./returns-policy.pdf \
--scope org/acme/team/eng \
--label team=eng \
--url "$SPECTRON_URL" \
--api-key "$SPECTRON_API_KEY" \
--context-id "$SPECTRON_CONTEXT_ID"

Use the generated OpenAPI clients (Python, TypeScript) for application code – method names follow the spec.

Poll GET /api/v1/{context_id}/documents/{id} until status is ready or failed.

StatusDescription
queuedWaiting to enter the pipeline
extractingReading content from the uploaded bytes
chunkingSplitting content into overlapping segments
embeddingGenerating dense vectors for chunks
renderingBuilding document summaries and section metadata
transcribingTranscribing audio or video (multimodal profiles)
captioningGenerating captions for images (multimodal profiles)
keywordingRAKE keyword extraction
readyFully indexed and available for retrieval
failedPipeline error; inspect error on the document record

Every document is identified by a BLAKE3 hash of its raw bytes. Re-uploading identical content returns the existing document with deduplicated: true and skips reprocessing.

When a second uploader in a different scope hits the same hash, Spectron unions their scope clause onto the existing document (and related index records) instead of trapping them with a deduplicated id they cannot read. Each union emits a document.scope_widen audit event; operators can watch the documents.scope_clause_count histogram (values above ~32 clauses on one document warrant investigation).

During ingest, Spectron extracts outbound hyperlinks from the raw bytes and stores them as typed knowledge_links_to edges:

SourceLink kind
Markdown []() syntaxmarkdown_link
HTML <a href="…"> attributeshtml_link
PDF page link annotationspdf_annotation

These edges feed hybrid_graph reranking (document-link density) and citation-style navigation between corpus documents.

Documents and their chunks inherit the caller's resolved write scope from the API key when you omit explicit scope on upload — required for scoped keys to recall their own uploads.

You can narrow tagging with scopes on POST /documents, spectron documents upload --scope …, or MCP upload — the path must lie within the caller's memory:write region (out-of-region scope returns 403). A document's scope is fixed at upload — reprocess rejects a non-empty scopes field with 400.

Optional labels (key=value strings) are stamped on the document, chunks, and sections. They follow the same validation rules as fact ingest and are not copied onto reconciled graph rows.

  • GET /documents/{id} – status and metadata

  • GET /documents – list with filters

  • GET /documents/{id}/chunks – parsed segments

  • GET /documents/{id}/raw – original bytes

  • DELETE /documents/{id} – remove document, chunks, graph edges, and object-store bytes

See REST API for request shapes.

Was this page helpful?