Vector Search

SurrealDB supports Full-Text Search and Vector Search. Full-Text search(FTS) involves indexing documents using the FTS index and breaking down the content of the document into smaller tokens with the help of analyzers and tokenizers.

Vector Search in SurrealDB is introduced to support efficient and accurate searching of high-dimensional data. This guide will walk you through the essentials of working with vectors in SurrealDB, from storing vectors in embeddings to performing computations and optimizing searches with various indexing strategies.

What is Vector Search

Vector search is a search mechanism that goes beyond traditional keyword matching and text-based search methods to capture deeper characteristics and similarities between data.

It converts data such as text, images, or sounds into numerical vectors, called vector embeddings.

You can think of Vector embeddings as cells. Like how cells form the basic structural and biological unit of all known living organisms, vector embeddings serve as the basic units of data representation in vector search.

Vector search isn’t new to the world of data science. Gerard Salton, known as the Father of Information Retrieval, introduced the Vector Space Model, cosine similarity, and TF-IDF for information retrieval around 1960.

If you’re interested in understanding Vector search in depth, checkout this academic paper on Vector Retrieval written by Sebastian Bruch.

Vector Search Cheat Sheet

M-Tree: when you need the exact nearest neighbour
HNSW: efficient approximation for high dimensions or large datasets
Brute force: when you don’t define an index, or you want exact nearest neighbours, or when you provide a distance function to the query

M-Tree Index

Parameter	Default	Options	Description
DIMENSION			Size of the vector
DIST	EUCLIDEAN	EUCLIDEAN, COSINE, MANHATTAN	Distance function
TYPE	F64	F64, F32, I64, I32, I16	Vector type
CAPACITY	40		Max number of records that can be stored in the index
DOC_IDS_ORDER	100
DOC_IDS_CACHE	100
MTREE_CACHE	100

Examples:

-- User statement:
DEFINE INDEX mt_idx ON pts FIELDS point MTREE DIMENSION 3;
-- Defaults to:
DEFINE INDEX mt_idx ON pts FIELDS point MTREE DIMENSION 3 DIST EUCLIDEAN TYPE F64 CAPACITY 40 DOC_IDS_ORDER 100 DOC_IDS_CACHE 100 MTREE_CACHE 100;

HNSW Index

Parameter	Default	Options	Description
DIMENSION			Size of the vector
DIST	EUCLIDEAN	EUCLIDEAN, COSINE, MANHATTAN	Distance function
TYPE	F64	F64, F32, I64, I32, I16	Vector type
EFC	150		EF construction
M	12		Max connections per element
M0	24		Max connections in the lowest layer
LM	0.40242960438184466f		Multiplier for level generation. This value is automatically calculated with a value considered as optimal.

Examples:

-- User statement:
DEFINE INDEX hnsw_idx ON pts FIELDS point HNSW DIMENSION 4;
-- Defaults to:
DEFINE INDEX hnsw_idx ON pts FIELDS point HNSW DIMENSION 4 DIST EUCLIDEAN TYPE F64 EFC 150 M 12 M0 24 LM 0.40242960438184466f;
-- Users are strongly suggested not to set an LM value, as it is computed based on other parameters. Only users completely versed in the field should manually set it

More details in DEFINE INDEX statement.

Querying

DEFINE INDEX hnsw_idx ON pts FIELDS point HNSW DIMENSION 4;

LET $vector = [2,3,4];
SELECT
    id,
    vector::distance::knn() as dist  -- distance from $vector
                                     -- knn reuses the value computed during
                                     -- the query, in this case the euclidean
                                     -- distance
FROM pts
WHERE point
    <|2|>  -- return 2, in this case using the distance function defined in the
           -- index: euclidean
    $vector;

SELECT Functions
`vector::distance::knn()`	reuses the value computed during the query
`vector::distance::chebyshev(point, $vector)`
`vector::distance::euclidean(point, $vector)`
`vector::distance::hamming(point, $vector)`
`vector::distance::manhattan(point, $vector)`
`vector::distance::minkowski(point, $vector, 3)`	third param is 𝑝
`vector::similarity::cosine(point, $vector)`
`vector::similarity::jaccard(point, $vector)`
`vector::similarity::pearson(point, $vector)`

WHERE statement

Query	M-tree index	HNSW index
`<\|2\|>`	uses distance function defined in index	same
`<\|2, EUCLIDEAN\|>`	brute force methood	same
`<\|2, COSINE\|>`	brute force method	same
`<\|2, MANHATTAN\|>`	brute force method	same
`<\|2, MINKOWSKI, 3\|>`	brute force method (third param is 𝑝)	same
`<\|2, CHEBYSHEV\|>`	brute force method	same
`<\|2, HAMMING\|>`	brute force method	same
`<\|2, 10\|>`	invalid, only for HNSW	second param is effort*

*effort: Tells HNSW how far to go in trying to find the exact response. HNSW is approximate, and may miss some vectors.

Notes

Verify index utilization in queries using the EXPLAIN FULL clause. E.g: SELECT id FROM pts WHERE point <|10|> [2,3,4,5] EXPLAIN FULL;
𝑝 values: (more about 𝑝 in Minkowski distance)
- 2⁰ = 1 → manhattan/diamond ◇
- 2¹ = 2 → euclidean/circle ○
- 2² = 4 → squircle ▢
- 2^∞ = ∞ → square □

Vector Search vs Full-Text Search

The image above is a Google search for the word “lead”. The search has pulled up different definitions of the word “lead”. Lead can mean taking initiative, as well as the chemical element with the symbol Pb.

Now let’s add some context to the word. Consider a database of liquid samples which note down harmful chemicals that are found in them.

In the example below, we have a table called liquids with a sample field and a content field. Next, we can do a Full-Text index on the content field by first defining an analyzer called liquid_analyzer. We can then Define an index on the content field in the liquid table and set our custom analyzer (liquid_analyzer)to search through the index.

Then, using the select statement to retrieve all the samples containing the chemical lead will also bring up samples that mention the word lead.

If you read through the content of the tap water sample, you’ll notice that it does not contain any lead in it but it has the mention of the word lead under “The team lead by Dr. Rose…” which means that the team was guided by Dr. Rose.

The search pulled up both the records although the tap water sample had no lead in it. This example shows us that while Full-Text Search does a great job at matching query terms with indexed documents, it may not be the best solution for use cases where the query terms have deeper context and scope for ambiguity.

Vector Search in SurrealDB

The Vector Search feature of SurrealDB will help you do more and dig deeper into your data.

For example, still using the same liquids table, you can store the chemical composition of the liquid samples in a vector format.

-- Insert a sample & content field into a liquids table
INSERT INTO liquidsVector [
    {sample:'Sea water', content: 'The sea water contains some amount of lead', embedding: [0.1, 0.2, 0.3, 0.4] },
    {sample:'Tap water', content: 'The team lead by Dr. Rose found out that the tap water in was potable', embedding:[1.0, 0.1, 0.4, 0.3]},
    {sample:'Sewage water', content: 'High amounts of a were found in Sewage water', embedding : [0.4, 0.3, 0.2, 0.1]}
];

Notice that we have added an embedding field to the table. This field will store the vector embeddings of the content field so we can perform vector searches on it.

You have the option of using 3 different approaches for performing Vector search.

As you want to perform a nearest neighbour search and not an exact search, you would typically use an index like HNSW or M-Tree.

In the example above you can see that the results are more accurate. The search pulled up only the Sea water sample which contains the harmful substance lead in it. This is the advantage of using Vector Search over Full-Text Search.

Another use-case for Vector Search is in the field of facial recognition. For example, if you wanted to search for an actor or actress who looked like you from an extensive dataset of movie artists, you would first use an ML model to convert the artist’s images and details into vector embeddings and then use SurrealQL to find the artist with the most resemblance to your face vector embeddings. The more characteristics you decide to include in your vector embeddings, the higher the dimensionality of your vector will be, potentially improving the accuracy of the matches but also increasing the complexity of the vector search.

Now that you know how to handle a vector search query in SurrealDB, let’s take a step back and understand some of its terms and concepts.

How to store vector embeddings

To store vectors in SurrealDB, you typically define a field within your data schema dedicated to holding the vector data. These vectors represent space data points and can be used for various applications, from recommendation systems to image recognition. Below is an example of how to create records with vector embeddings:

CREATE Document:1 CONTENT {
  "items": [
    {
      "content": "apple",
      "embedding": [0.00995, -0.02680, -0.01881, -0.08697]
    }
  ]
};

The vector is represented as an array of floating-point numbers.

There are no strict rules or limitations regarding the length of the embeddings, and they can be as large as needed. Just keep in mind that larger embeddings lead to more data to process and that can affect performance and query times based on your physical hardware.

Computation on Vectors: “vector::” Package of Functions

SurrealDB provides Vector Functions for most of the major numerical computations done on vectors. They include functions for element-wise addition, division and even normalisation.

They also include similarity and distance functions, which help in understanding how similar or dissimilar two vectors are. Usually, the vector with the smallest distance or the largest cosine similarity value (closest to 1) is deemed the most similar to the item you are trying to search for.

The choice of distance or similarity function depends on the nature of your data and the specific requirements of your application.

In the liquids examples, we assumed that the embeddings represented the harmfulness of lead (as a substance). We used the vector::similarity::cosine function because cosine similarity is typically preferred when absolute distances are less important, but proportions and direction matter more.

Vector Indexes

When it comes to search, you can always use brute force. In SurrealDB, you can use the Brute force approach to search through your vector embeddings and data.

Brute force search compares a query vector against all vectors in the dataset to find the closest match. As this is a brute-force approach, you do not create an index for this approach.

The brute force approach for finding the nearest neighbour is generally preferred in the following use cases:

Small Datasets / Limited Query vectors: For applications with small datasets, the overhead of building and maintaining an index might outweigh its benefits. In such cases, the brute force approach is optimal.
Guaranteed Accuracy: Since the brute force method compares the query vector against every vector in the dataset, it guarantees finding the exact nearest vectors based on the chosen distance metric (like Euclidean, Manhattan, etc.).
Benchmarking Models: The Brute force approach can be used as a reference, and help benchmark the performance of other approximate alternatives like HNSW

While brute force can give you exact results, it’s computationally expensive for large datasets.

In most cases, you do not need a 100% exact match, and you can give it up for faster, high-dimensional searches to find the approximate nearest neighbour to a query vector.

This is where Vector indexes come in.

In SurrealDB, you can perform a vector search using the two primary indexes:

M-Tree Index:
- The M-Tree index is a metric tree-based index suitable for similarity search in metric spaces.
- The M-Tree index can be configured with parameters such as the distance function to compare the vectors.
Hierarchical Navigable Small World (HNSW) Index:
- HNSW (Hierarchical Navigable Small World) is a state-of-the-art algorithm for approximate nearest neighbour search in high-dimensional spaces. It offers a balance between search efficiency and accuracy.
- The HNSW index is a proximity graph-based index.

By design, HNSW currently operates as an “in-memory” structure. Introducing persistence to this feature, while beneficial for retaining index states, is an ongoing area of development. Our goal is to balance the speed of data ingestion with the advantages of persistence.

You can also use the REBUILD statement, which allows for the manual rebuilding of indexes as needed. This approach ensures that while we explore persistence options, we maintain the optimal performance that users expect from HNSW and MTree, providing flexibility and control over the indexing process.

Both indexes are designed to handle the challenges of searching in spaces where traditional indexing methods become inefficient. The choice between HNSW and M-Tree would depend on the application’s specific requirements, such as the need for an exact versus approximate nearest neighbour search, the distance metric used, and the nature of the data.

Filtering through Vector Search

The vector::distance::knn() function from SurrealDB returns the distance computed between vectors by the KNN operator. This operator can be used to avoid recomputation of the distance in every select query.

Consider a scenario where you’re searching for actors who look like you but they should have won an Oscar. You set a flag, which is true for actors who’ve won the golden trophy.

Let’s create a dataset of actors and define an HNSW index on the embeddings field.

Important
You need to be running SurrealDB version 2.0.0 or higher to use the vector::distance::knn() function.

-- Create a dataset of actors with embeddings and flags
CREATE actor:1 SET name = 'Actor 1', embedding = [0.1, 0.2, 0.3, 0.4], flag = true;
CREATE actor:2 SET name = 'Actor 2', embedding = [0.2, 0.1, 0.4, 0.3], flag = false;
CREATE actor:3 SET name = 'Actor 3', embedding = [0.4, 0.3, 0.2, 0.1], flag = true;
CREATE actor:4 SET name = 'Actor 4', embedding = [0.3, 0.4, 0.1, 0.2], flag = true;

-- Define an embbedding to represent a face
LET $person_embedding = [0.15, 0.25, 0.35, 0.45];

-- Define an HNSW index on the actor table
DEFINE INDEX hnsw_pts ON actor FIELDS embedding HNSW DIMENSION 4;

-- Select actors who look like you and have won an Oscar
SELECT id, flag, vector::distance::knn() AS distance FROM actor WHERE flag = true AND embedding <|2,40|> $person_embedding ORDER BY distance;

[
	[
		{
			distance: 0.09999999999999998f,
			flag: true,
			id: actor:1
		},
		{
			distance: 0.412310562561766f,
			flag: true,
			id: actor:4
		}
	]
];

actor:1 and actor:2 have the closest resemblance with your query vector and also have won an Oscar.

Conclusion

Vector Search does not need to be complicated and overwhelming. Once you have your embeddings available, you can try out different vector functions in combination with your query vector to see what works best for your use case. As discussed in the reference guide, you have 3 options to perform Vector Search in SurrealDB. Based on the complexity of your data and accuracy expectations, you can choose between them. You can design your select statements to query your search results along with filters and conditions. In order to avoid recalculation of the KNN distance for every single query, you also have the vector::distance::knn().

Due to GenAI, most applications today deal with intricate data with layered meanings and characteristics. Vector search plays a big role in analyzing such data to find what you’re looking for or to make informed decisions.

You can start using Vector Search in SurrealDB by installing SurrealDB on your machines or by using Surrealist. And if you’re looking for a quick video explaining Vector Search, check out our YouTube channel.

Edit this page on GitHub

Security Best Practices Performance Best Practices