Surreal Cloud beta is LIVE! Get started for free.

Gradient
SurrealDB University

Authentication Required

This course requires authentication.
Please sign in to continue

Background Gradient
Next
Back to Courses

Indexing & data model considerations

Fundamental to performance in any database is how it handles indexing.

In this lesson, we’re focusing on

  • User-defined indexes, the indexes you can create as a user of the database, usually referred to as secondary indexes.
  • General data modeling and query optimisation tips

Creating Indexes

SurrealDB offers a range of indexing capabilities designed to optimise data retrieval and search efficiency.

  • Traditional indexes such as single field indexes, composite indexes and unique indexes.
  • As well as full-text search indexes and vector search indexes.

We’ll go over traditional indexing examples and a full-text search example, but leave vector search for another time. If you’re interested in vector search right now, you can check out our vector search reference guide, which explains how it works with full-text search and vector functions.

Traditional indexes

A user-defined index is called a secondary index because it’s a secondary data structure that maintains a copy of part of your data. It copies the data from the primary index, which is your primary data structure that contains all of your data. You can change the primary index by changing the storage layer as we’ll explore in the next lesson.

That’s enough computer science for now, for the rest of this lesson we are just going to refer to secondary indexes as indexes.

The important thing to remember is simply that when you add an index, you are creating a copy of part of your data, which the database has to keep up to date anytime you do a write operation (Create, Update, Delete) to the tables you created the index on.

This means adding an index is always a compromise between impacting your write performance to improve your read performance. Which is why we would never just add an index to everything.

So, where should we add indexes? There is no one-size-fits-all answer to this as It depends on how you are querying the database.

We might not even need an index for certain queries as it wouldn’t improve the read performance.

That is generally the case for simple CRUD operations that rely on using Record IDs.

SELECT name, email FROM person:01FTP9H7BG8VDANQPN8J3Y857R; SELECT name, email FROM person:01FTP9H7BG8VDANQPN8J3Y857R..=01HG9EFC0R8DA8F87VNYP0CD8A;

Such as this SELECT query that is selecting two fields from a single record or a range of records.

The reason why it doesn’t need an index, is because it’s already using an index, the primary index of the storage engine you chose for your storage layer.

SELECT name, email FROM person WHERE id = person:01FTP9H7BG8VDANQPN8J3Y857R; SELECT name, email FROM person WHERE time.created_at >= d'2022-01-30T20:06:30Z' AND time.created_at <= d'2023-11-27T22:30:23Z';

When writing our queries like we normally might in relational databases we suddenly need an index as we’re doing table scans instead of directly fetching data, like record IDs allow you to do. Resulting in dramatically slower performance.

DEFINE INDEX person_time ON TABLE person FIELDS time.created_at; SELECT name, email FROM person WHERE time.created_at >= d'2022-01-30T20:06:30Z' AND time.created_at <= d'2023-11-27T22:30:23Z' EXPLAIN;

To improve this we can define an index on the field we are using in our query with the DEFINE INDEX statement. Then make sure that we know that the query is using the index by using the EXPLAIN clause on the SELECT statement.

If we see that we are using Iterate Index instead of Iterate Table then we know we are using the index.

Looking at the performance of the query now with an added index, we can see that there is a huge improvement in performance. However its still much slower than our original query using record IDs!

SELECT name, email FROM person WHERE time.created_at >= d'2022-01-30T20:06:30Z' AND time.created_at <= d'2023-11-27T22:30:23Z';

This is why Record IDs are so important, as you’ve heard me say all through out the course.

The process we just went through is a good way to start optimising your queries, which is as follows:

  • Start by using Record IDs whenever possible
  • If not possible, such as needing a WHERE , GROUP BY or ORDER BY fields other then our Record ID
  • Then we look at the common fields that we are using and DEFINE INDEX for those fields
  • We then use the EXPLAIN clause to check if the index is working, or if we need to adjust it.
  • Also importantly check if the index is actually improving the performance as there might not be a noticeable difference for small tables.

Unique indexes

DEFINE INDEX unique_order_relationships ON TABLE order FIELDS in, out UNIQUE;

Why does this query fail? What we are effectively doing here is making sure each person can only order the same product once.
However, since we have persons ordering the same product multiple times it gives us an error.
This is just to show that indexes apply to your entire table, not just from the time you applied the index.
Adding an index like this doesn’t really make sense unless there is such a order limit in place on the business side.

A more appropriate index would be adding an index on the wishlist, such that each person can only wishlist a product once.

DEFINE INDEX unique_wishlist_relationships ON TABLE wishlist FIELDS in, out UNIQUE;

Moving on to our next example, here we have a composite index, meaning an index that has more than one field.

It’s also a UNIQUE index, meaning it has a UNIQUE constraint.

This is not only for improving read performance, but the UNIQUE constraint also ensures that the order table always has a UNIQUE combination of the in and out fields.

Since the in and out fields signify a relationship, this means we cannot insert a duplicate relationship into the order table.

Another way of looking at this, is that the UNIQUE index we are creating here functions in a similar way as a multi-field schema constraint. Because as we saw in part 3 where we made things schemafull, you can only DEFINE a single TABLE or a single FIELD at a time.

Full-Text Search Indexes

DEFINE ANALYZER IF NOT EXISTS blank_snowball TOKENIZERS blank FILTERS lowercase, snowball(english); DEFINE INDEX IF NOT EXISTS review_content ON TABLE review FIELDS review_text SEARCH ANALYZER blank_snowball BM25;

In addition to traditional indexes, SurealDB also supports full-text search indexes. We won’t cover these in detail here as they’re highly configurable with many advanced features like basic and advanced text matching, proximity searches, result ranking, and keyword highlighting⁠. If you want dive deeper check out our reference guide for full-text search.

The definition is made up of 2 parts.

First, we define the ANALYZER which uses TOKENIZERS to split our text into words. Here we are using blank to split based on spaces.

Then we also need to DEFINE INDEX like we normally would on either single or multiple text FIELDS and then add the SEARCH ANALYZER we defined previously, selecting the BM25 ranking algorithm, BM simply meaning Best Match.

SELECT id, rating, review_text FROM review
WHERE review_text @@ 'wear nonstop';

Once we’ve done this we can double at @@ characters in our SELECT statement to perform the search.

We can see it’s not just doing an exact search from the results but rather a semantic search, if we take this review as an example, which was generated using the llama3 LLM

One big table (OBT) vs third normal form (3NF)

Moving onto data modelling, it’s important to know that the perfect data model doesn’t exist.

The best way to model your data sits on a spectrum between One Big Table (OBT), which as the name suggest, advocates for putting as much as possible in a single table.

The other end of the spectrum being third normal form (3NF), which is the traditional data modelling approach for relational databases since the 1970s. 3NF advocates for putting as little as possible in a single table.

I cannot tell you exactly how to model your data as the most performant data model is generally the one that most closely matches the business logic of your application.

Therefore it’s important to be able prototype and iterate quickly then lock down your schema once you have found what works. That is the reason this course follows this particular journey from schemaless prototyping to schemafull constraints.

Summary

Database performance is all about finding the right balance.

Adding an index is a balance between impacting your write performance and improving your read performance.

Finding the most performant data model is a balance between putting everything in as few tables as possible and as many tables as possible.

Therefore it’s important to enjoy the journey and always be improving.

I hope you have enjoyed the journey so far and I’ll see you in our last lesson.