RAG architecture patterns

Map retrieval-augmented generation to SurrealDB using chunk storage, embeddings, hybrid search, filters, and operational concerns for RAG.

Retrieval-augmented generation (RAG) combines retrieval from your own data with generation from a language model. Instead of relying only on the model’s training data, you ground answers in chunks of text (and sometimes structured facts) that you control. SurrealDB is then used for storage and retrieval, allowing you to keep chunk text, metadata, access rules, and vector embeddings in one database and query them with SurrealQL.

What RAG changes

A plain LLM call answers from parametric memory only. RAG adds a knowledge plane to this, in which documents are split into chunks, each chunk gets an embedding from a model you choose, and at query time you retrieve the most relevant chunks before (or while) the LLM writes an answer.

Typical pipeline stages

Ingest: Load sources (files, web pages, tickets, database exports).
Chunk: Split text into segments with stable boundaries (headings, paragraphs, token limits) and optional overlap so ideas are not split awkwardly.
Embed: Call an embedding API or local model; store the resulting vector with each chunk. See Embedding pipelines and the embeddings integrations for options outside the database.
Index: Define vector indexes (for example HNSW) on embedding fields, and optionally full-text indexes on chunk text for lexical search.
Retrieve: For a user question, embed the query (or use the same model family as the corpus), run similarity search, optionally fuse with keyword results (see Hybrid search).
Generate: Pass the top chunks as context to the LLM, with instructions to cite or stay within that context.

Retrieval patterns

Dense retrieval: KNN over embeddings with vector::distance::knn() and the patterns in Similarity search. Good for paraphrases and conceptual match.
Hybrid retrieval: Combine dense scores with full-text search when exact product names, error codes, or legislation matter; use search::rrf() or related helpers as described under Hybrid search.

Re-ranking (a second model that scores the top k candidates) is often implemented in the application layer; SurrealDB supplies the candidate set efficiently.

Chunking, metadata, and citations

There is no single best chunk size: smaller chunks improve precision but lose surrounding context; larger chunks add context but dilute embeddings. Many teams store title, heading path, or summary fields to improve retrieval without inflating the embedded body.

For citations, persist enough metadata to map a chunk back to a human-readable source (page, anchor, ticket id). The LLM should only “see” what you retrieved; clear provenance reduces hallucinated references.

Operations and quality

Embedding model changes usually require re-embedding the corpus or maintaining a version dimension; plan migrations before switching dimensions or distance metrics.
Staleness: when sources update, replace or invalidate affected chunks so answers do not quote obsolete text.
Evaluation: track retrieval hit rate, user thumbs-up/down, or offline benchmarks on labelled questions.