This course requires authentication.
Please sign in to continue
Fundamental to performance in any database is how it handles indexing.
In this lesson, we’re focusing on
SurrealDB offers a range of indexing capabilities designed to optimise data retrieval and search efficiency.
We’ll go over traditional indexing examples and a full-text search example, but leave vector search for another time. If you’re interested in vector search right now, you can check out our vector search reference guide, which explains how it works with full-text search and vector functions.
A user-defined index is called a secondary index because it’s a secondary data structure that maintains a copy of part of your data. It copies the data from the primary index, which is your primary data structure that contains all of your data. You can change the primary index by changing the storage layer as we’ll explore in the next lesson.
That’s enough computer science for now, for the rest of this lesson we are just going to refer to secondary indexes as indexes.
The important thing to remember is simply that when you add an index, you are creating a copy of part of your data, which the database has to keep up to date anytime you do a write operation (Create, Update, Delete) to the tables you created the index on.
This means adding an index is always a compromise between impacting your write performance to improve your read performance. Which is why we would never just add an index to everything.
So, where should we add indexes? There is no one-size-fits-all answer to this as It depends on how you are querying the database.
We might not even need an index for certain queries as it wouldn’t improve the read performance.
That is generally the case for simple CRUD operations that rely on using Record IDs.
SELECT name, email FROM person:01FTP9H7BG8VDANQPN8J3Y857R; SELECT name, email FROM person:01FTP9H7BG8VDANQPN8J3Y857R..=01HG9EFC0R8DA8F87VNYP0CD8A;
Such as this SELECT
query that is selecting two fields from a single record or a range of records.
The reason why it doesn’t need an index, is because it’s already using an index, the primary index of the storage engine you chose for your storage layer.
SELECT name, email FROM person WHERE id = person:01FTP9H7BG8VDANQPN8J3Y857R; SELECT name, email FROM person WHERE time.created_at >= d'2022-01-30T20:06:30Z' AND time.created_at <= d'2023-11-27T22:30:23Z';
When writing our queries like we normally might in relational databases we suddenly need an index as we’re doing table scans instead of directly fetching data, like record IDs allow you to do. Resulting in dramatically slower performance.
DEFINE INDEX person_time ON TABLE person FIELDS time.created_at; SELECT name, email FROM person WHERE time.created_at >= d'2022-01-30T20:06:30Z' AND time.created_at <= d'2023-11-27T22:30:23Z' EXPLAIN;
To improve this we can define an index on the field we are using in our query with the DEFINE INDEX
statement. Then make sure that we know that the query is using the index by using the EXPLAIN
clause on the SELECT
statement.
If we see that we are using Iterate Index
instead of Iterate Table
then we know we are using the index.
Looking at the performance of the query now with an added index, we can see that there is a huge improvement in performance. However its still much slower than our original query using record IDs!
SELECT name, email FROM person WHERE time.created_at >= d'2022-01-30T20:06:30Z' AND time.created_at <= d'2023-11-27T22:30:23Z';
This is why Record IDs are so important, as you’ve heard me say all through out the course.
The process we just went through is a good way to start optimising your queries, which is as follows:
WHERE
, GROUP BY
or ORDER BY
fields other then our Record IDDEFINE INDEX
for those fieldsEXPLAIN
clause to check if the index is working, or if we need to adjust it.
DEFINE INDEX unique_order_relationships ON TABLE order FIELDS in, out UNIQUE;
A more appropriate index would be adding an index on the wishlist, such that each person can only wishlist a product once.
DEFINE INDEX unique_wishlist_relationships ON TABLE wishlist FIELDS in, out UNIQUE;
Moving on to our next example, here we have a composite index, meaning an index that has more than one field.
It’s also a UNIQUE
index, meaning it has a UNIQUE
constraint.
This is not only for improving read performance, but the UNIQUE
constraint also ensures that the order
table always has a UNIQUE
combination of the in
and out
fields.
Since the in
and out
fields signify a relationship, this means we cannot insert a duplicate relationship into the order table.
Another way of looking at this, is that the UNIQUE
index we are creating here functions in a similar way as a multi-field schema constraint. Because as we saw in part 3 where we made things schemafull, you can only DEFINE
a single TABLE
or a single FIELD
at a time.
DEFINE ANALYZER IF NOT EXISTS blank_snowball TOKENIZERS blank FILTERS lowercase, snowball(english); DEFINE INDEX IF NOT EXISTS review_content ON TABLE review FIELDS review_text SEARCH ANALYZER blank_snowball BM25;
In addition to traditional indexes, SurealDB also supports full-text search indexes. We won’t cover these in detail here as they’re highly configurable with many advanced features like basic and advanced text matching, proximity searches, result ranking, and keyword highlighting. If you want dive deeper check out our reference guide for full-text search.
The definition is made up of 2 parts.
First, we define the ANALYZER
which uses TOKENIZERS
to split our text into words. Here we are using blank
to split based on spaces.
Then we also need to DEFINE INDEX
like we normally would on either single or multiple text FIELDS
and then add the SEARCH ANALYZER
we defined previously, selecting the BM25
ranking algorithm, BM simply meaning Best Match.
SELECT id, rating, review_text FROM review WHERE review_text @@ 'wear nonstop';
Once we’ve done this we can double at @@
characters in our SELECT
statement to perform the search.
We can see it’s not just doing an exact search from the results but rather a semantic search, if we take this review as an example, which was generated using the llama3
LLM
Moving onto data modelling, it’s important to know that the perfect data model doesn’t exist.
The best way to model your data sits on a spectrum between One Big Table (OBT), which as the name suggest, advocates for putting as much as possible in a single table.
The other end of the spectrum being third normal form (3NF), which is the traditional data modelling approach for relational databases since the 1970s. 3NF advocates for putting as little as possible in a single table.
I cannot tell you exactly how to model your data as the most performant data model is generally the one that most closely matches the business logic of your application.
Therefore it’s important to be able prototype and iterate quickly then lock down your schema once you have found what works. That is the reason this course follows this particular journey from schemaless prototyping to schemafull constraints.
Database performance is all about finding the right balance.
Adding an index is a balance between impacting your write performance and improving your read performance.
Finding the most performant data model is a balance between putting everything in as few tables as possible and as many tables as possible.
Therefore it’s important to enjoy the journey and always be improving.
I hope you have enjoyed the journey so far and I’ll see you in our last lesson.