import Container from '@src/components/Container.astro'; import NewsletterSubscription from '@src/components/pages/blog/NewsletterSubscription.astro';
I have not failed. I've just found 10,000 ways that won't work.
โ Thomas A. Edison
In the rapidly evolving field of AI, most of the time is spent optimizing. You are either maximizing your accuracy, or minimizing your latency. Itโs therefore easy to find yourself running experiments and iterating, whenever you build a RAG solution.
This blog post presents an example of such a process, which helped me play around with some LangChain components, test some prompt engineering tricks, and identify specific use-case challenges (like time awareness).
I also wanted to test some of the ideas in LightRAG. Although I built a much simpler graph (inferring only keywords and not the relationships), the process of reverse engineering LightRAG into a simpler architecture was very insightful.
Before we start, you may want to look at our integrations page for more AI frameworks.
Use case
What data do I have easily available, that I know well? What type of questions could I ask a chatbot?
I chose to download my WhatsApp group chats, and tried to ask questions about things we talked about.
But since Iโm too ashamed to show you my private conversations, I asked Gemini to generate some chat conversations about topics I provided (favorite food, movie plan, books, AI). This is one of those conversations (from chattest.txt):
RAG architecture
Ingestion
Letโs take a look at how all of this was put together, starting with the ingestion into a vector and a graph store.
DB connection and store instances:
Then, I used LangChain document loaders to load my backup files:
Then, it splits the full group chat backup into chunks. Ideally each chunk is a different conversation topic, but to heuristically identify them, I split the chunks on every 3-hour-of-silence-gap between messages (look for max_gap_in_s in ingest.py).
Then, for each chunk, it populates the vector and graph stores like this:
Destination: vector store The SurrealDBVectorStore LangChain component appends the corresponding vector embeddings and stores the chunks. It was set up to use
all-minilm:22membedding model from Ollama when creating its instance.Destination: graph store
infer keywords (prompts in llm.py)
generate vector embeddings and insert in vector store (same as above)
create a graph that relates keywords with conversations: conversation -> described_by -> keyword (using SurrealDBGraph LangChain component, the code to generate the graph is in ingest.py)
Hereโs the full ingest.py code.
Inferring keywords
To come up with the prompts for the LLM, I compared prompts from different LangChain and Ollama examples, and from the appendix in the LightRAG paper. Then wrote my own prompts and after some iterations they end up looking like this:
Hereโs the full and latest llm.py code.
The use of examples in the prompt is called โfew-shot promptingโ. Itโs very powerful when a few examples are enough for the LLM to understand the pattern. Learn more about other techniques for prompt optimization in this LangChain blog post.
To visualize how the document -> keyword graph looks like, run the following query using Surrealist:

Retrieval
From the vector store: user query -> generate embedding -> vector search conversations
Source: retrieve.py.
At this step, itโs worth checking your k and threshold values.
k: how many chunks we wantthreshold: whatโs the minimum acceptable similarity score. Opposite of โdistanceโ, which you could get if you are using a different vector search function.
Try out some searches and see the results. What values define the line between good and bad? A threshold value that is higher than acceptable will pollute your context, but if you leave it too low youโll leave out documents that contained the knowledge you needed but just didnโt match que userโs query that well (for example when users like to talk too much, and say please and thank you in the prompt). Again, it depends on the use case, but around 30% and 55% are good places to start.
This use case requires "low" threshold values. In the example test above, because I know the data, only the first result is acceptable. Ideally, you should have a list of questions, acceptable results, and programatically run all the experiments using a real data set. This will allow you to re-run the same experiments, with a different embedding model too.
From the graph: user query -> infer keywords -> search keywords in graph -> get related conversations
The code is in retrieve.py.
Generation
With both retrieval methods, thereโs more than 2 ways of generating a context. This is what I explored while trying different questions and watching at the retrieved documents and the generated answers.
Hereโs another example prompt: Are we going to eat before going to the movies?
1. From vector search context
Only the first result is used, since the threshold is set at 30%.
With that first chunk as context, the LLM generated a good answer:
You had previously agreed that meeting at 6pm for food would be a good plan, so it's likely that eating before the movie will be part of the plan. You mentioned grabbing a bite on the South Bank and could get there by 6pm.
llama3.2 (temperature 0.8)
2. From graph context
The first step is to generate keywords from the userโs questions using the LLM. The result for this particular test was: movie, film, food, and hunger, which pointed to 3 different conversations.
The question was crafted on purpose to get a bad context, because it includes keywords that relate to different conversations. This shows that a semantic knowledge graph requires more than just inferring keywords.
LLM generated:
Hello Liam! Yes, you mentioned that you thought it would be great to grab food beforehand when you go to see "Apex" at the BFI IMAX on Tuesday, 12th August. You suggested meeting at 6 pm for food and then heading to the movie showing at 7:45 pm.
llama3.2 (temperature 0.8)
It doesnโt mean simple graphs are useless! For example, with this specific use case, I got bad contexts when asking questions that were time/location specific. This could be fixed by representing those implicit links in a graph, so we can trim down the context to remove chunks about the same topic but related to a different event in the past.
Here I let the LLM infer only the keywords (which are nodes in my graph), but you can also let the LLM infer the edges.
3. From documents that are found only on both
This alternative was a way I found to try to have both previous methods work together, keeping each other in check. So if vector search retrieved documents that are really not relevant, maybe the graph can help trim them out.
LLM generated:
You're right, it's your name that I'm responding to now. Yes, you mentioned earlier that when going to see "Apex" at the BFI IMAX, you suggested meeting for food beforehand with Ben and Chloe. So, yes, we are planning to eat before the movie!
llama3.2 (temperature 0.8)
During my test, it was actually easier to make the graph fail: it takes more resources, itโs therefore slower, and retrieves wrong documents. But I donโt blame the graph here, the issue is that just keywords themselves are not enough for semantic retrieval. LightRAG, for example, also infers relationships, generating a semantic graph. For this use case, I can imagine better ways to use a graph store to improve the results.
Future work
What would be the effects of overlapping chunks?
Define some user questions that are currently generating non-optimal contexts. With this, run new experiments to optimize accuracy.
Add more nodes and edges to the graph to represent time, location, members, conversation topic, etc.
Conclusions
Set up yourself for success by structuring your code in a way you can test variations and measure the results, so you can confidently choose the right options. This applies to prompt engineering, vector indexes, graph edges, model temperatures, and any other tweak-able option you think can impact the metric you are optimizing (accuracy, latency, costs, โฆ).
A multi-model RAG architecture can either be general (e.g. LightRAG) or use-case specific (graph that links chats, members, dates, pre-defined edges, โฆ). In both cases, the aim is to find better source documents to build a context, and to provide a faster and cost-effective solution. That said, some use cases donโt need to be over-engineered, so make sure you are clear on the metric you are optimizing, and that you try at least one alternative.
Ready to build?
Get started for free with Surreal Cloud.
Any questions or thoughts about this or semantic search using SurrealDB? Feel free to drop by our community to get in touch.
