Unstructured cleans raw files (PDF, HTML, Office docs …) and, if you wish, embeds each chunk. SurrealDB then keeps those embeddings together with any metadata or graph relations—so your whole retrieval pipeline lives in one database.
pip install unstructured surrealdb # ingestion + SurrealDB SDK surreal start --log trace --auth root root # local server (or connect to your cluster)
Below we extract embedded chunks from a PDF annual report, save them as JSONL, then bulk-insert the file into SurrealDB.
A tiny script can read each line of chunks_jsonl/annual-report-2024.jsonl and push it into SurrealDB (see “Importer” below).
Running python ingest_to_surreal.py will:
{text, page, embedding} rows into the AnnualChunks table with an HNSW index.
SurrealDB gives you the chunk text, the original page number, and a similarity score—all filterable and join-able with any other table.
annual-report-2024.pdf with your own corp docs.By pairing Unstructured’s document intelligence with SurrealDB’s multi-model engine, you get a fully self-contained pipeline for RAG, semantic search and analytics—no extra vector stores, no ETL headaches.