Get tutorials, AI agent recipes, webinars, and early product updates in your inbox every two weeks
While this implementation uses Python, the architectural patterns and SurrealDB integration strategies apply to any language with SurrealDB SDK support.
Introduction
Standard Retrieval-Augmented Generation (RAG) models are powerful but operate on single shot principle: Retrieve then Generate. This approach often fails when faced with complex multi-faceted queries, providing incomplete answers that lack depth. The core limitation is the absence of a feedback loop.
This article introduces the Reflexion RAG Engine, a production-ready system that overcomes these limitations through a multi-cycle, self-correcting architecture powered by SurrealDB. It employs a Multi-LLM strategy for Generation, Evaluation, and Synthesis to iteratively refine answers, ensuring higher accuracy and comprehensiveness. SurrealDB serves as a unified data store that houses both vector embeddings and document metadata.
The architectural core: a unified vector and document store with SurrealDB
Traditional RAG architectures are often fragile combinations of multiple technologies: A vector database (e.g. Pinecone, ChromaDB), a separate document store for metadata (e.g. MongoDB, Postgres), and a caching layer (e.g. Redis). This fragmentation introduces complexity in data synchronisation and security.
Our engine leverages SurrealDB to consolidate these functions into a single, ACID-compliant database. This unified approach provides specific advantages :
Simplified architecture: Documents, their metadata and their 3072-dimensional vector embeddings coexist in the same table. This eliminates data duplication and synchronisation issues.
Transactional integrity: Ingesting a document and its vector is an atomic operation, guaranteeing consistency.
High performance search : SurrealDB’s native HNSW indexing on vector fields enables millisecond-latency similarity searches, even at scale.
The 3072-dimensional vectors from OpenAI text-embedding-3-large model from Azure provide superior semantic understanding compared to standard 1536-dimensional embeddings. SurrealDB’s HNSW (Hierarchical Navigable Small World) indexing enables millisecond-level similarity search across millions of vectors.
The schema in detail
We define a single index hnsw_embedding with a dimension of 3072 to match OpenAI’s text-embedding-3-large model. The use of Cosine distance is standard for normalised embeddings. The EFCEntry point Candidate and MMax connections parameters are tuned to provide a high recall and query speed, making it suitable for production environments.
-- Define a schema-full table named 'documents' DEFINETABLEdocumentsSCHEMAFULL;
-- Define a 'content' field of type string DEFINEFIELDcontentONdocumentsTYPEstring;
-- Define a 'metadata' field of flexible type object -- This allows for storing schemaless data DEFINEFIELDmetadataONdocumentsTYPEobjectFLEXIBLE;
-- Define an 'embedding' field of type array of floats -- This is typically used for storing vector embeddings DEFINEFIELDembeddingONdocumentsTYPEarray<float>;
-- Create an HNSW index on the 'embedding' field -- This index is used for efficient vector similarity searches DEFINEINDEXhnsw_embeddingONdocuments FIELDSembedding HNSW DIMENSION3072-- The number of dimensions in the embedding vectors DISTCOSINE-- Use cosine distance for measuring similarity TYPE F32-- Use 32-bit floating point numbers EFC500-- Entry point candidate size for search performance M16; -- Maximum number of connections per node in the graph
Deep dive: multi-LLM reflexion engine
The core architecture is based on a loop that mimics critical thinking. For each query a series of steps occur, which are guided by specialised LLMs to ensure a detailed and accurate response.
Generate: The Generation LLM produces an initial answer based on retrieved context.
Evaluate: The Evaluation LLM assesses this answer, generates a confidence score and identifies knowledge gaps or uncertainties.
Decide: Based on the evaluation the system makes a decision using the ReflexionDecision enum:
COMPLETE: If confidence is above the threshold (e.g., 0.90), the loop terminates.
REFINE_QUERY: If specific gaps are found, a new, more targeted query is generated to fill them.
CONTINUE: If the answer is incomplete but on the right track, the loop proceeds.
INSUFFICIENT_DATA: If the knowledge base cannot answer, the loop terminates gracefully.
Synthesise: When the loop completes through multiple cycles, the Summary LLM synthesises the final response from earlier generated partial answers.
Multi-LLM orchestration strategy
The system strategically assigns different models to specialised tasks:
# From src/rag/reflexion_engine.py classReflexionRAGEngine: """Advanced RAG engine with dynamic reflexion loop architecture and web search integration"""
def__init__( .... ): # Initialise different LLMs for different purposes self.generation_llm=generation_llmorGitHubLLM( model_override=settings.llm_model, temperature_override=settings.llm_temperature, max_tokens_override=settings.llm_max_tokens, )
The reflexion cycle can be seen in the following diagram.
It works as follows:
Cache check: Looks for a cached answer to the question and streams it if found.
Reflexion loop: For a set number of cycles, it:
Decides whether to use web search in addition to database retrieval.
Retrieves relevant documents and/or web results.
Prepares a prompt and streams the LLM’s partial answer as it is generated.
If the answer is truncated, attempts to continue it.
Evaluates the answer’s quality and confidence.
Decides whether to stop, continue, or generate a follow-up query for the next cycle.
Finalisation: If no confident answer is found after all cycles, synthesises a final answer from all cycles.
Streaming: Streams the final comprehensive answer with metadata.
Error handling: If any step fails, falls back to a simpler RAG approach and streams that answer.
Features:
Streams answers in real time.
Integrates both database and web search.
Self-evaluates and iteratively improves answers.
Handles prompt/answer truncation and errors robustly.
Caches results for future queries.
Prompt engineering: YAML-based prompt management
Maintaining, updating and tuning prompts as per requirements is a must. To maintain flexibility, and enable rapid iteration, all prompts are managed in versioned YAML files located in ./prompts directory. This separates prompt engineering from the application code allowing easier updates.
prompt_template:| You are an expert AI assistant providing detailed, accurate answers with proper source citations. ... Question:{{query}} Available Documents: {{context}} ...
RESPONSE STRUCTURE: ...
IMPORTANT GUIDELINES: ...
Answer:
# A snippet from prompts/evaluation/response_evaluation.yaml RESPONSE FORMAT (JSON): { "confidence_score": 0.35, "decision": "continue|refine_query|complete|insufficient_data", "reasoning": "Detailed explanation of the assessment", ... }
The Prompt Manager
The Prompt Manager is responsible for extracting prompts from the YAML prompt files and add the variables like user query*,queries*, previous answers and confidence** scores at their dedicated places.
defrender(self,**kwargs) -> str: """Renders the prompt template with the given variables.""" ...
# --- Core Manager ---
classPromptManager: """Manages loading, caching, and rendering of YAML-based prompt templates."""
def__init__(self,prompts_dir: Optional[str]="prompts"): """Initialises the manager and loads all prompts from disk.""" self._prompt_cache: Dict[str,PromptTemplate]={} # ... implementation for loading all prompts on startup ...
defget_prompt(self,name: str) -> Optional[PromptTemplate]: """Retrieves a parsed prompt template from the cache by its name.""" ...
defrender_prompt(self,name: str,**kwargs) -> str: """Finds a prompt by name and renders it with provided variables.""" ...
deflist_prompts(self) -> List[str]: """Returns a list of all loaded prompt names.""" ...
defreload_prompts(self): """Clears the cache and reloads all prompts from the directory.""" ...
# --- Global Singleton Instance --- # Provides a single, shared instance of the prompt manager across the application. prompt_manager=PromptManager()
Hybrid retrieval: combining local documents and live web search
A simple RAG system is limited by the freshness of its data. The Reflexion Engine implements a hybrid retrieval strategy to enhance its knowledge base with real-time information from the web.
The SurrealDBVectorStore implements a similarity_search_combined method. Instead of a single vector search, it executes two parallel queries against the documents and web_search tables, each with a separate LIMIT (k_docs & k_web). The results are merged post vector search and re-ranked by their similarity score. This ensures that most relevant information, whether from a local PDF or a recent news article, is available to generation model.
Web Search is configurable via the WEB_SEARCH_MODE setting, which can be off, initial only, or every_cycle. For complex queries, running web search in every reflexion cycle allows the engine to dynamically seek external information, filling current knowledge gaps identified during the evaluation phase.
Hybrid search: documents + web results
The engine’s hybrid search combines local documents with real-time web results:
asyncdefsimilarity_search_combined(self,query: str,k_docs: int=3,k_web: int=2) -> List[Document]: """ Perform a combined similarity search over local documents and web search results. - Embeds the query. - Retrieves top-k similar documents and web results (with limits). - Combines results, enforcing a max token budget. - Sorts all results by similarity score (descending). - Returns a list of Document objects. """ ....
# Query local documents and web search tables docs_query=f""" SELECT id, content, metadata, vector::similarity::cosine(embedding, {query_embedding}) AS score FROM documents WHERE embedding <|300,COSINE|> {query_embedding} ORDER BY score DESC LIMIT {k_docs}; """
# Search web results with limit web_query=f""" SELECT id, content, metadata, vector::similarity::cosine(embedding, {query_embedding}) AS score FROM web_search WHERE embedding <|300,COSINE|> {query_embedding} ORDER BY score DESC LIMIT {k_web}; """
...
# Add document results first forresultindocs_resultsor[]: ...
# Add web results if token budget allows forresultinweb_resultsor[]: ...
# Sort by similarity score all_documents.sort(key=lambdax: x.metadata.get("similarity_score",0),reverse=True) ...
*# 3. Configure your environment* cp .env.example .env *# Edit the .env file with your credentials for SurrealDB, GitHub, etc.*
uv sync: This single command installs all production dependencies including SurrealDB Python SDK, Azure AI Inference, Crawl4AI for web scraping, and all LLM related libraries.
Environment configuration
# Create environment file # .env.example is included in the repository! cp .env.example .env
Get GitHub models access token (Github.com/models)
Define document and web search tables (run as a query or run in Surrealist)
The engine uses two specialised tables, each optimised for different data types while sharing the same vector search capabilities:
-- Define a schema-full table named 'documents' DEFINETABLEdocumentsSCHEMAFULL;
-- Define a 'content' field of type string DEFINEFIELDcontentONdocumentsTYPEstring;
-- Define a 'metadata' field of flexible type object -- This allows for storing schemaless data DEFINEFIELDmetadataONdocumentsTYPEobjectFLEXIBLE;
-- Define an 'embedding' field of type array of floats -- This is typically used for storing vector embeddings DEFINEFIELDembeddingONdocumentsTYPEarray<float>;
-- Create an HNSW index on the 'embedding' field -- This index is used for efficient vector similarity searches DEFINEINDEXhnsw_embeddingONdocuments FIELDSembedding HNSW DIMENSION3072-- The number of dimensions in the embedding vectors DISTCOSINE-- Use cosine distance for measuring similarity TYPE F32-- Use 32-bit floating point numbers EFC500-- Entry point candidate size for search performance M16; -- Maximum number of connections per node in the graph
-- Define a schema-full table named 'web_search' DEFINETABLEweb_searchSCHEMAFULL;
-- Define a 'content' field of type string -- This field will store the textual content of the web search results DEFINEFIELDcontentONweb_searchTYPEstring;
-- Define a 'metadata' field of flexible type object -- This allows for storing additional information in a schemaless manner DEFINEFIELDmetadataONweb_searchTYPEobjectFLEXIBLE;
-- Define an 'embedding' field of type array of floats -- This field is used for storing vector embeddings of the content DEFINEFIELDembeddingONweb_searchTYPEarray<float>;
-- Create an HNSW index on the 'embedding' field -- This index is used for efficient vector similarity searches DEFINEINDEXhnsw_embeddingONweb_search FIELDSembedding HNSW DIMENSION3072-- The number of dimensions in the embedding vectors DISTCOSINE-- Use cosine distance for measuring similarity TYPE F32-- Use 32-bit floating point numbers EFC500-- Entry point candidate size for search performance M16; -- Maximum number of connections per node in the graph
Custom SurrealQL functions for RAG operations
The system includes specialised SurrealQL functions that demonstrate SurrealDB’s programmability:
-- A function to count number of user documents stored DEFINEFUNCTIONOVERWRITEfn::countdocs() ->int{ count(SELECT * FROMdocuments); }PERMISSIONSFULL;
-- A function to count number of web documents stored DEFINEFUNCTIONOVERWRITEfn::count_web() ->int{ count(SELECT * FROMweb_search); }PERMISSIONSFULL;
-- A function to delete user Documents with confirmation DEFINEFUNCTIONOVERWRITEfn::deldocs($confirm: string) ->string{ LET$word = 'CONFIRM'; RETURNIF$confirm==$word {DELETEdocuments; 'DELETED'} ELSE {'NOT DELETED'} }PERMISSIONSFULL;
-- A function to delete web search Documents with confirmation DEFINEFUNCTIONOVERWRITEfn::delweb($confirm: string) ->string{ LET$word = 'CONFIRM'; RETURNIF$confirm==$word {DELETEweb_search; 'DELETED'} ELSE {'NOT DELETED'} }PERMISSIONSFULL;
-- Define a function to perform similarity search on user documents DEFINEFUNCTIONOVERWRITEfn::similarity_search($query_embedding: array<float>, $k: int) ->any{ -- Set a default limit for the number of results if $k is not provided LET$limit = $k??5; -- Perform a similarity search using cosine similarity RETURN ( SELECTid, content, metadata, vector::similarity::cosine(embedding, $query_embedding) ASscore FROMdocuments WHEREembedding<|300, COSINE|>$query_embedding-- Use KNN operator with cosine distance ORDERBYscoreDESC-- Order results by similarity score in descending order LIMIT$limit-- Limit the number of results to the specified limit ); }PERMISSIONSFULL; -- Grant full permissions for this function
-- Define a function to perform similarity search on web search results DEFINEFUNCTIONOVERWRITEfn::similarity_search_web($query_embedding: array<float>, $k: int) ->any{ -- Set a default limit for the number of results if $k is not provided LET$limit = $k??5; -- Perform a similarity search using cosine similarity RETURN ( SELECTid, content, metadata, vector::similarity::cosine(embedding, $query_embedding) ASscore FROMweb_search WHEREembedding<|300, COSINE|>$query_embedding-- Use KNN operator with cosine distance ORDERBYscoreDESC-- Order results by similarity score in descending order LIMIT$limit-- Limit the number of results to the specified limit ); }PERMISSIONSFULL; -- Grant full permissions for this function
These functions showcase SurrealDB’s ability to handle complex vector operations with custom logic directly in the database layer, reducing network overhead and improving performance and security.
Set up Google custom search
1. Obtain a Google custom search API key
The API key authenticates your project's requests to Google's services.
Go to the Google Cloud Console: Navigate to the Google Cloud Console and create a new project if you don't have one already.
Enable the API: In your project's dashboard, go to the "APIs & Services" section. Find and enable the Custom Search API.
Create credentials: Go to the "Credentials" tab within "APIs & Services". Click "Create Credentials" and select "API key".
Copy and secure the key: A new API key will be generated. Copy this key and store it securely. It is recommended to restrict the key's usage to only the "Custom Search API" for security purposes.
2. Create a programmable search engine and get the CSE ID
The CSE ID (also called the Search Engine ID or cx) tells Google what to search (e.g., the entire web or specific sites you define).
Go to the Programmable Search Engine Page: Visit the Google Programmable Search Engine website and sign in with your Google account.
Create a new search engine: Click "Add" or "New search engine" to start the setup process.
Configure your engine:
Give your search engine a name.
Under "Sites to search," you can specify particular websites or enable the option to "Search the entire web."
Click "Create" when you are done.
Find your search engine ID (CSE ID): After creating the engine, go to the "Setup" or "Overview" section of its control panel. Your Search engine ID will be displayed there. Copy this ID.
3. Update your project configuration
Finally, take the two values you have obtained and place them in your project's .env file:
The ingestion pipeline handles multiple formats: PDF .pdf, Text .txt, Markdown .md , and Word .docx
These formats are handled by the system's DocumentLoader, which is configured to recognise and parse these specific file extensions when you ingest documents from a directory.
Interactive chat and CLI usage
The system provides a rich command-line interface for all operations:
Start interactive chat session
# Launch interactive chat with reflexion engine uv run rag.py chat
Document management commands
# View current configuration and status uv run rag.py config # Sample output: RAG Engine Configuration ======================== LLM Model: meta/Meta-Llama-3.1-405B-Instruct Evaluation Model: cohere/Cohere-command-r SurrealDB URL: wss://your-instance.surreal.cloud Documents in store: 127 Web results in store: 45 Memory cache: Enabled (hit rate: 73%)
# Delete all documents (requires confirmation) uv run rag.py delete
# Ingest new documents with progress tracking4 uv run rag.py ingest --docs_path=./new_docs
Sample output: reflexion in action
Here we see some real output from the reflexion engine processing a complex query about SurrealDB’s vector capabilities.
Query: compare SurrealDB's vector search with traditional vector databases
🔄 Cycle 1 - initial generation 📚 Retrieved: 3 documents (similarity: 0.85-0.91) 💭 SurrealDB offers several advantages over traditional vector databases:
1. **Unified data model**: unlike Pinecone or Weaviate, SurrealDB combines document storage, graph relationships, and vector search in one system [Source: architecture.md]
2. **Native SQL integration**: SurrealQL enables complex queries that join vector similarity with traditional filters [Source: surrealdb_guide.md]...
🔍 Self-evaluation - confidence: 0.78 (Below threshold: 0.85) ❓ Gap identified: missing performance benchmarks and indexing details
**Performance characteristics**: - HNSW indexing provides ~10x faster queries than brute force [Source: benchmarks.md] - Memory usage: ~4 bytes per dimension per vector for F32 type - EFC parameter (500) balances accuracy vs speed...
📊 Final metrics: - Total cycles: 2 - Processing time: 6.8s - Documents used: 8 - Final confidence: 0.89 - Cache hits: 3/8
Conclusion
The Reflexion RAG Engine demonstrates SurrealDB’s power as a unified platform for modern AI applications. By combining vector search, document storage, and real-time web integration in a single database, it eliminates architectural complexity while delivering superior performance.
Ready to build your own self-correcting AI system? Clone the repository, follow this setup guide, and start experimenting with SurrealDB’s vector capabilities. The future of AI is reflexive, multimodal, and built on unified data platforms.
Any questions about implementing reflexion loops with SurrealDB? Join the SurrealDB Discord, star the project repository, or share your own SurrealDB vector search experiments with the community!