Skip to main content
Version: 1.x

Vector Search

SurrealDB supports Full-Text Search and Vector Search. Full-Text search(FTS) involves indexing documents using the FTS index and breaking down the content of the document into smaller tokens with the help of analysers and tokenizers.

Vector Search in SurrealDB is introduced to support efficient and accurate searching of high-dimensional data. This guide will walk you through the essentials of working with vectors in SurrealDB, from storing vectors in embeddings to performing computations and optimizing searches with various indexing strategies.

Vector search is a search mechanism that goes beyond traditional keyword matching and text-based search methods to capture deeper characteristics and similarities between data.

It converts data such as text, images, or sounds into numerical vectors, called vector embeddings.

You can think of Vector embeddings as cells. Like how cells form the basic structural and biological unit of all known living organisms, vector embeddings serve as the basic units of data representation in vector search.

Vector search isn't new to the world of data science. Gerard Salton, known as the Father of Information Retrieval, introduced the Vector Space Model, cosine similarity, and TF-IDF for information retrieval around 1960.

If you’re interested in understanding Vector search in depth, checkout this academic paper on Vector Retrieval written by Sebastian Bruch.

Google search for the word leadGoogle search for the word lead

The image above is a Google search for the word “lead”. The search has pulled up different definitions of the word “lead”. Lead can mean taking initiative, as well as the chemical element with the symbol Pb.

Now let’s add some context to the word. Consider a database of liquid samples which note down harmful chemicals that are found in them.

In the example below, we have a table called liquids with a sample field and a content field. Next, we can do a Full-Text index on the content field by first defining an analyzer called liquid_analyzer. We can then Define an index on the content field in the liquid table and set our custom analyzer(liquid_analyzer) to search through the index.

Then, using the select statement to retrieve all the samples containing the chemical lead will also bring up samples that mention the word lead.

If you read through the content of the tap water sample, you’ll notice that it does not contain any lead in it but it has the mention of the word lead under “The team lead by Dr. Rose…” which means that the team was guided by Dr. Rose.

The search pulled up both the records although the tap water sample had no lead in it. This example shows us that while Full-Text Search does a great job at matching query terms with indexed documents, it may not be the best solution for use cases where the query terms have deeper context and scope for ambiguity.

Vector Search in SurrealDB

Google search for the word leadGoogle search for the word lead

The Vector Search feature of SurrealDB will help you do more and dig deeper into your data.

For example, still using the same liquids table, you can store the chemical composition of the liquid samples in a vector format.

-- Insert a sample & content field into a liquids table
INSERT INTO liquidsVector [
{sample:'Sea water', content: 'The sea water contains some amount of lead', embedding: [0.1, 0.2, 0.3, 0.4] },
{sample:'Tap water', content: 'The team lead by Dr. Rose found out that the tap water in was potable', embedding:[1.0, 0.1, 0.4, 0.3]},
{sample:'Sewage water', content: 'High amounts of a were found in Sewage water', embedding : [0.4, 0.3, 0.2, 0.1]}
];

Notice that we have added an embedding field to the table. This field will store the vector embeddings of the content field so we can perform vector searches on it.

You have the option of using 3 different approaches for performing Vector search.

As you want to perform a nearest neighbour search and not an exact search, you would typically use an index like HNSW or M-Tree.

In the example above you can see that the results are more accurate. The search pulled up only the Sea water sample which contains the harmful substance lead in it. This is the advantage of using Vector Search over Full-Text Search.

Another use-case for Vector Search is in the field of facial recognition. For example, if you wanted to search for an actor or actress who looked like you from an extensive dataset of movie artists, you would first use an ML model to convert the artist's images and details into vector embeddings and then use SurrealQL to find the artist with the most resemblance to your face vector embeddings. The more characteristics you decide to include in your vector embeddings, the higher the dimensionality of your vector will be, potentially improving the accuracy of the matches but also increasing the complexity of the vector search.

Now that you know how to handle a vector search query in SurrealDB, let's take a step back and understand some of its terms and concepts.

How to store vector embeddings

To store vectors in SurrealDB, you typically define a field within your data schema dedicated to holding the vector data. These vectors represent space data points and can be used for various applications, from recommendation systems to image recognition. Below is an example of how to create records with vector embeddings:

CREATE Document:1 CONTENT {
"items": [
{
"content": "apple",
"embedding": [0.00995, -0.02680, -0.01881, -0.08697]
}
]
};

The vector is represented as an array of numbers, commonly using floating-point values. However, depending on the model, it can also be an array of integers. The exact type can be defined in the index.

There are no strict rules or limitations regarding the length of the embeddings, and they can be as large as needed. Just keep in mind that larger embeddings lead to more data to process and that can affect performance and query times based on your physical hardware.

Computation on Vectors: "vector::" Package of Functions

SurrealDB provides Vector Functions for most of the major numerical computations done on vectors. They include functions for element-wise addition, division and even normalisation.

They also include similarity and distance functions, which help in understanding how similar or dissimilar two vectors are. Usually, the vector with the smallest distance or the largest cosine similarity value (closest to 1) is deemed the most similar to the item you are trying to search for.

Google search for the word leadGoogle search for the word lead

The choice of distance or similarity function depends on the nature of your data and the specific requirements of your application.

In the liquids examples, we assumed that the embeddings represented the harmfulness of lead (as a substance). We used the vector::similarity::cosine function because cosine similarity is typically preferred when absolute distances are less important, but proportions and direction matter more.

This means that, in our use case, we are less concerned with two noses being the same size and more interested in two faces having similar proportions. As cosine similarity measures the orientation of the vectors, rather than their absolute magnitudes, this also makes it suitable to help capture the overall similarity in facial recognition example for features and proportions of faces, while being less sensitive to factors like scale or image brightness.

Vector Indexes

When it comes to search, you can always use brute force. In SurrealDB, you can use the Brute force approach to search through your vector embeddings and data.

Brute force search compares a query vector against all vectors in the dataset to find the closest match. As this is a brute-force approach, you do not create an index for this approach.

The brute force approach for finding the nearest neighbour is generally preferred in the following use cases:

  • Small Datasets / Limited Query vectors: For applications with small datasets, the overhead of building and maintaining an index might outweigh its benefits. In such cases, the brute force approach is optimal.

  • Guaranteed Accuracy: Since the brute force method compares the query vector against every vector in the dataset, it guarantees finding the exact nearest vectors based on the chosen distance metric (like Euclidean, Manhattan, etc.).

  • Benchmarking Models: The Brute force approach can be used as a reference, and help benchmark the performance of other approximate alternatives like HNSW

While brute force can give you exact results, it's computationally expensive for large datasets.

In most cases, you do not need a 100% exact match, and you can give it up for faster, high-dimensional searches to find the approximate nearest neighbour to a query vector.

This is where Vector indexes come in.

In SurrealDB, you can perform a vector search using the two secondary indexes:

  1. M-Tree Index:
    • The M-Tree index is a metric tree-based index suitable for similarity search in metric spaces.
    • The M-Tree index can be configured with parameters such as the distance function to compare the vectors.
  2. Hierarchical Navigable Small World (HNSW) Index:
    • HNSW (Hierarchical Navigable Small World) is a state-of-the-art algorithm for approximate nearest neighbour search in high-dimensional spaces. It offers a balance between search efficiency and accuracy.
    • The HNSW index is a proximity graph-based index.

By design, HNSW currently operates as an "in-memory" structure. Introducing persistence to this feature, while beneficial for retaining index states, is an ongoing area of development. Our goal is to balance the speed of data ingestion with the advantages of persistence.

You can also use the REBUILD statement, which allows for the manual rebuilding of indexes as needed (e.g. if you restart the server). This approach ensures that while we explore persistence options, we maintain the optimal performance that users expect from HNSW and MTree, providing flexibility and control over the indexing process.

Both indexes are designed to handle the challenges of searching in spaces where traditional indexing methods become inefficient. The choice between HNSW and M-Tree would depend on the application's specific requirements, such as the need for an exact versus approximate nearest neighbour search, the distance metric used, and the nature of the data.

Conclusion

Vector Search does not need to be complicated and overwhelming. Once you have your embeddings available, you can try out different vector functions in combination with your query vector to see what works best for your use case. As discussed in the reference guide, you have 3 options to perform Vector Search in SurrealDB. Based on the complexity of your data and accuracy expectations, you can choose between them. You can design your select statements to query your search results along with filters and conditions. In order to avoid recalculation of the KNN distance for every single query, you also have the vector::distance::knn() Due to GenAI, most applications today deal with intricate data with layered meanings and characteristics. Vector search plays a big role in analyzing such data to find what you’re looking for or to make informed decisions.

You can start using Vector Search in SurrealDB by installing SurrealDB on your machines or by using Surrealist. And if you’re looking for a quick video explaining Vector Search, check out our YouTube channel.