Skip to main content

DEFINE INDEX statement

Just like in other databases, SurrealDB uses indexes to help optimize query performance. An index can consist of one or more fields in a table and can enforce a uniqueness constraint. If you don't intend for your index to have a uniqueness constraint, then the fields you select for your index should have a high degree of cardinality, meaning that there is a high amount of diversity between the data in the indexed table records.

Requirements

Statement syntax

SurrealQL Syntax
DEFINE INDEX [ OVERWRITE | IF NOT EXISTS ] @name ON [ TABLE ] @table [ FIELDS | COLUMNS ]
@fields
[ UNIQUE
| SEARCH ANALYZER @analyzer [ BM25 [(@k1, @b)] ] [ HIGHLIGHTS ]
| MTREE DIMENSION @dimension [ TYPE @type ] [ DIST @distance ] [ CAPACITY @capacity]
| HNSW DIMENSION @dimension [ TYPE @type ] [DIST @distance] [ EFC @efc ] [ M @m ]
]
[ COMMENT @string ]
[ CONCURRENTLY ]

Index Types

SurrealDB offers a range of indexing capabilities designed to optimize data retrieval and search efficiency.

Unique Index

Ensures each value in the index is unique. A unique index helps enforce uniqueness across records by preventing duplicate entries in fields such as user IDs, email addresses, and other unique identifiers.

Let's create a unique index for the email address field on a user table.

-- Makes sure that the email address in the user table is always unique
DEFINE INDEX userEmailIndex ON TABLE user COLUMNS email UNIQUE;

The created index can be tested using the INFO statement.

INFO FOR TABLE user;

The INFO statement will help you understand what indexes are defined in your TABLE.

{
"events": {},
"fields": {},
"indexes": {
"userEmailIndex": {
sql: "DEFINE INDEX userEmailIndex ON user FIELDS email UNIQUE"
}
},
"lives": {},
"tables": {}
}

As we defined a UNIQUE index on the email column, a duplicate entry for that column or field will throw an error.

-- Create a user record and set an email ID.
CREATE user:1 SET email = 'test@surrealdb.com';
Response
[
{
"email": "test@surrealdb.com",
"id": "user:1"
}
]

Creating another record with the same email ID will throw an error.

-- Create another user record and set the same email ID.
CREATE user:2 SET email = 'test@surrealdb.com';
Response
Database index `userEmailIndex` already contains 'test@surrealdb.com',
with record `user:1`

To set the same email for user:2, the original record must be deleted

DELETE user:1;
CREATE user:2 SET email = 'test@surrealdb.com'
[
{
"email": "test@surrealdb.com",
"id": "user:2"
}
]

Non-Unique Index

This allows for the indexing of attributes that may have non-unique values, facilitating efficient data retrieval. Non-unique indexes help index frequently appearing data in queries that do not require uniqueness, such as categorization tags or status indicators.

Let's create a non-unique index for an age field on a user table.

-- optimise queries looking for users of a given age
DEFINE INDEX userAgeIndex ON TABLE user COLUMNS age;

Composite Index

The composite index spans multiple fields and columns of a table. Similar to a single field index, composite indexes can also be UNIQUE

-- Create an index on the account and email fields of the user table
DEFINE INDEX test ON user FIELDS account, email;

Full-Text Search Index

Enables efficient searching through textual data, supporting advanced text-matching features like proximity searches and keyword highlighting.

The Full-Text search index helps implement comprehensive search functionalities in applications, such as searching through articles, product descriptions, and user-generated content.

Let's create a full-text search index for a name field on a user table.

-- Allow full-text search queries on the name of the user
DEFINE INDEX userNameIndex ON TABLE user COLUMNS name SEARCH ANALYZER ascii BM25 HIGHLIGHTS;
  • SEARCH: By using the SEARCH keyword, you enable full-text search on the specified column.
  • ANALYZER ascii: Uses a custom analyzer called ascii for tokenizing and analyzing the text.
  • BM25: Ranking algorithm used for relevance scoring.
  • HIGHLIGHTS: Allows keyword highlighting in search results output when using the search::highlight function

Vector Search Indexes

Vector search indexes in SurrealDB support efficient k-nearest neighbors (kNN) and Approximate Nearest Neighbor (ANN) operations, which are pivotal in performing similarity searches within complex, high-dimensional datasets and data types.

Types

When defining a vector index with MTREE or HNSW, you can define the types the vector will be stored in. The TYPE clause is optional and can be used to specify the data type of the vector. SurrealDB supports the following types: F64 | F32 | I64 | I32 | I16

  • F64: Represents 64-bit floating-point numbers (double precision floating-point numbers).
  • F32: Represents 32-bit floating-point numbers (single precision floating-point numbers).
  • I64: Represents 64-bit signed integers.
  • I32: Represents 32-bit signed integers.
  • I16: Represents 16-bit signed integers.

Note: In SurrealDB the default type for vectors isF64.

For example, to define a vector index with 64-bit signed integers, you can use the following query:

DEFINE INDEX idx_mtree_embedding ON Document FIELDS items.embedding MTREE DIMENSION 4 TYPE I64;

M-Tree index

Available since: v1.3.0

The M-Tree index is suitable for the task of finding exact nearest neighbors based on a distance metric (like Euclidean distance). Mtree currently supports Euclidean, Cosine, Manhattan and Minkowski distance functions.

Note: When no function is specified it chooses Euclidean for the default distance function.

Capacity

The CAPACITY clause is used to specify the maximum number of records that can be stored in the index. This is useful when you want to limit the number of records stored in the index to optimize performance by default the capacity is set to 40.

Example usage: Define a M-Tree index on a table with MANHATTAN and COSINE distance

For example, consider an index made for records with 4 dimensional vectors using the MANHATTAN and COSINE distance function:

DEFINE INDEX idx_mtree_embedding_manhattan ON Document FIELDS items.embedding MTREE DIMENSION 4 DIST MANHATTAN;
DEFINE INDEX idx_mtree_embedding_cosine ON Document FIELDS items.embedding MTREE DIMENSION 4 DIST COSINE;

Because the Document table has a unique index on the items.embedding field, you can use the index to perform vector searches to find records based on the distance from a given point. Any vector dimensions that don't match 4 dimensions will throw an error. 'Incorrect vector dimension (3). Expected a vector of 4 dimension.'

Example usage: Perform a vector search using an M-Tree index with just a Distance

Another example is to create an M-tree index on a table with 3-dimensional vectors.

CREATE pts:1 SET point = [1,2,3];
CREATE pts:2 SET point = [4,5,6];
CREATE pts:3 SET point = [8,9,10];

To define an M-tree index on the point field of the pts table, you can use the query below:

DEFINE INDEX mt_pt ON pts FIELDS point MTREE DIMENSION 3;

In the above example, the MTREE DIMENSION clause specifies the dimension of the vector. The point field is a 3-dimensional vector array, so we set the dimension to 3. If this is successfully created, you can use the index to perform vector searches to find records based on the distance from a given point.

HNSW (Hierarchical Navigable Small World)

Available since: v1.5.0

This method uses a graph-based approach to efficiently navigate and search in high-dimensional spaces. While it is an approximate technique, it offers a high-performance balance between speed and accuracy, making it ideal for very large datasets.

Note: Keep in mind the in-memory nature of HNSW when considering system resource allocation.

In the example above, you may notice the EFC and M parameters. These are optional to your query but are parameters of the HNSW algorithm and can be used to tune the index for better performance. At a glance

  • M (Max Connections per Element): Defines the maximum number of bi-directional links (neighbors) per node in each layer of the graph, except for the lowest layer. This parameter controls the connectivity and overall structure of the network. Higher values of MM generally improve search accuracy but increase memory usage and construction time.

  • EFC (EF construction): Stands for "exploration factor during construction." This parameter determines the size of the dynamic list for the nearest neighbor candidates during the graph construction phase. A larger efConstruction value leads to a more thorough construction, improving the quality and accuracy of the search but increasing construction time. The default value is 150.

  • M0 (Max Connections in the Lowest Layer): Similar to M, but specifically for the bottom layer (the base layer) of the graph. This layer contains the actual data points. M0 is often set to twice the value of M to enhance search performance and connectivity at the base layer, at the cost of increased memory usage.

  • LM (Multiplier for Level Generation): Used to determine the maximum level ll for a new element during its insertion into the hierarchical structure. It is used in the formula l←⌊−ln⁡(unif(0..1))⋅mL⌋, where unif(0..1) is a uniform random variable between 0 and 1. This parameter influences the distribution of elements across different levels, impacting the overall balance and efficiency of the search structure.

Note: You can only provide TYPE, M, and EFC. SurrealDB automatically computes M0 and LM with the most appropriate value. If not specified, M AND EFC are set to 12 and 150, respectively.

Brute Force method

The Brute Force method is suitable for tasks with smaller datasets or when the highest accuracy is required. Brute Force currently supports Euclidean, Cosine, Manhattan and Minkowski distance functions.

In the example below, the query searches for points closest to the vector [2,3,4,5] and uses vector functions to calculate the distance between two points, indicated by <|2|>.

Verifying Index Utilization in Queries

The EXPLAIN clause from SurrealQL helps you understand the execution plan of the query and provides transparency around index utilization.

SELECT * FROM user WHERE email='test@surrealdb.com' EXPLAIN FULL;

It also reveals details about which operation was used by the query planner and how many records matched the search criteria.

[
{
"detail": {
"plan": {
"index": "userEmailIndex",
"operator": "=",
"value": "test@surrealdb.com"
},
"table": "user"
},
"operation": "Iterate Index"
},
{
"detail": {
"count": 1
},
"operation": "Fetch"
}
]

Rebuilding Indexes

Available since: v1.5.0

Indexes can be rebuilt using the REBUILD statement. This can be useful when you want to update the index definition or when you want to rebuild the index to optimize performance.

You may want to rebuild an index overtime to ensure that the index is up-to-date with the latest data in the table.

REBUILD INDEX userEmailIndex ON user;

Using IF NOT EXISTS clause

Available since: v1.3.0

The IF NOT EXISTS clause can be used to define an index only if it does not already exist. You should use the IF NOT EXISTS clause when defining a index in SurrealDB if you want to ensure that the index is only created if it does not already exist. If the index already exists, the DEFINE INDEX statement will return an error.

It's particularly useful when you want to safely attempt to define a index without manually checking its existence first.

On the other hand, you should not use the IF NOT EXISTS clause when you want to ensure that the index definition is updated regardless of whether it already exists. In such cases, you might prefer using the OVERWRITE clause, which allows you to define a index and overwrite an existing one if it already exists, ensuring that the latest version of the index definition is always in use

-- Create a INDEX if it does not already exist
DEFINE INDEX IF NOT EXISTS example ON example FIELDS example;

Using OVERWRITE clause

Available since: v2.0.0

The OVERWRITE clause can be used to define an index and overwrite an existing one if it already exists. You should use the OVERWRITE clause when you want to modify an existing index definition. If the index already exists, the DEFINE INDEX statement will overwrite the existing definition with the new one.

-- Create an INDEX and overwrite if it already exists
DEFINE INDEX OVERWRITE example ON example FIELDS example;

Using CONCURRENTLY clause

Available since: v2.0.0

Building indexes can be lengthy and may time out before they're completed. Use the CONCURRENTLY option to build the index without blocking operations. The statement will return immediately, allowing you to monitor the index-building progress by executing the INFO statement.

-- Create an INDEX concurrently
DEFINE INDEX test ON user FIELDS email CONCURRENTLY;
-- Check the indexing status
INFO FOR TABLE example;
{
events: {},
fields: {},
tables: {},
indexes: {
test: {
building: { status: 'built' },
sql: 'DEFINE INDEX example ON example FIELDS example'
}
},
lives: {},
}

Performance Implications

When defining indexes, it's essential to consider the fields most frequently queried or used to optimize performance.

Indexes may improve the performance of SurrealQL statements. This may not be noticeable with small tables but it can be significant for large tables; especially when the indexed fields are used in the WHERE clause of a SELECT statement.

Indexes can also impact the performance of write operations (INSERT, UPDATE, DELETE) since the index needs to be updated accordingly. Therefore, it's essential to balance the need for read performance with write performance.