DeepEval
DeepEval by Confident AI is an open-source framework for testing large language model systems. Similar to Pytest but designed for LLM outputs, it evaluates metrics like G-Eval, hallucination, answer relevancy.
DeepEval can be integrated with SurrealDB to evaluate RAG pipelines — ensuring your LLM applications return relevant, grounded, and faithful responses based on retrieved vector search context.
SurrealDB's native vector engine, allows you to store vectors, documents and metadata in the same database that already stores the rest of your app data.
Install & run
Note
Set up a vector table & index (one-time)
Python helper: Surreal client with add/query
Unique DeepEval example
We'll test whether an LLM correctly answers "Which fruit is botanically a berry but commonly mistaken for a vegetable?" using information fetched from SurrealDB.
Running this script prints a local score report and uploads the run to the Confident AI dashboard for historical tracking (after you've logged in with deepeval login).
Because the context objects already include score and source, DeepEval can show traceability back to the exact SurrealDB rows that justified the answer.
Scaling up
Replace the quick
rag.add()list with a real corpus (CSV, PDFs, etc.).Encapsulate the embed + insert logic inside a Dagster or Airflow asset if you already orchestrate ETL.
Use SurrealDB's metadata fields and SurrealQL predicates (
WHERE metadata.topic = 'law') to test retrieval recall for specific slices of your knowledge base.Evaluate hundreds of examples by looping through a HuggingFace dataset and appending each
LLMTestCaseto a list before callingevaluate().
Why SurrealDB + DeepEval?
| Benefit | Why it matters |
|---|---|
| Single data plane | Store documents, vectors and relational metadata together – fewer moving parts. |
| Built-in ANN index | Define HNSW with one DDL statement; no external vector service to deploy. |
| SurrealQL | Flexible `SELECT … WHERE … ` queries mix Boolean filters with vector similarity. |
| DeepEval dashboards | Track how retrieval quality + answer faithfulness change as you tweak prompts or embeddings. |
With these snippets you can drop SurrealDB into any DeepEval-based RAG test harness and keep the rest of your metric logic unchanged. Happy evaluating!