Skip to main content

DEFINE INDEX statement

Just like in other databases, SurrealDB uses indexes to help optimize query performance. An index can consist of one or more fields in a table and can enforce a uniqueness constraint. If you don't intend for your index to have a uniqueness constraint, then the fields you select for your index should have a high degree of cardinality, meaning that there is a high amount of diversity between the data in the indexed table records.

Requirements

Statement syntax

SurrealQL Syntax
DEFINE INDEX [ IF NOT EXISTS ] @name ON [ TABLE ] @table [ FIELDS | COLUMNS ] @fields
[ UNIQUE
| SEARCH ANALYZER @analyzer [ BM25 [(@k1, @b)] ] [ HIGHLIGHTS ]
| MTREE DIMENSION @dimension [ TYPE @type ] [ DIST @distance ] [ CAPACITY @capacity]
| HNSW DIMENSION @dimension [ TYPE @type ] [DIST @distance] [ EFC @efc ] [ M @m ]
]

Index Types

SurrealDB offers a range of indexing capabilities designed to optimize data retrieval and search efficiency.

Unique Index

Ensures each value in the index is unique. A unique index helps enforce uniqueness across records by preventing duplicate entries in fields such as user IDs, email addresses, and other unique identifiers.

Let's create a unique index for the email address field on a user table.

-- Makes sure that the email address in the user table is always unique
DEFINE INDEX userEmailIndex ON TABLE user COLUMNS email UNIQUE;

The created index can be tested using the INFO statement.

INFO FOR TABLE user;

The INFO statement will help you understand what indexes are defined in your TABLE.

{
"events": {},
"fields": {},
"indexes": {
"userEmailIndex": "DEFINE INDEX userEmailIndex ON user FIELDS email UNIQUE"
},
"lives": {},
"tables": {}
}

As we defined a UNIQUE index on the email column, a duplicate entry for that column or field will throw an error.

-- Create a user record and set an email ID. 
CREATE user:1 SET email = 'test@surrealdb.com';
Response
[
{
"email": "test@surrealdb.com",
"id": "user:1"
}
]

Creating another record with the same email ID will throw an error.

-- Create another user record and set the same email ID.
CREATE user:2 SET email = 'test@surrealdb.com';
Response
Database index `userEmailIndex` already contains 'test@surrealdb.com', 
with record `user:1`

To set the same email for user:2, the original record must be deleted

DELETE user:1;
CREATE user:2 SET email = 'test@surrealdb.com'
[
{
"email": "test@surrealdb.com",
"id": "user:2"
}
]

Non-Unique Index

This allows for the indexing of attributes that may have non-unique values, facilitating efficient data retrieval. Non-unique indexes help index frequently appearing data in queries that do not require uniqueness, such as categorization tags or status indicators.

Let's create a non-unique index for an age field on a user table.

-- optimise queries looking for users of a given age
DEFINE INDEX userAgeIndex ON TABLE user COLUMNS age;

Composite Index

The composite index spans multiple fields and columns of a table. Similar to a single field index, composite indexes can also be UNIQUE

-- Create an index on the account and email fields of the user table
DEFINE INDEX test ON user FIELDS account, email;

Full-Text Search Index

Enables efficient searching through textual data, supporting advanced text-matching features like proximity searches and keyword highlighting.

The Full-Text search index helps implement comprehensive search functionalities in applications, such as searching through articles, product descriptions, and user-generated content.

Let's create a full-text search index for a name field on a user table.

-- Allow full-text search queries on the name of the user
DEFINE INDEX userNameIndex ON TABLE user COLUMNS name SEARCH ANALYZER ascii BM25 HIGHLIGHTS;

Vector Search Indexes

Vector search indexes in SurrealDB support efficient k-nearest neighbors (kNN) and Approximate Nearest Neighbor (ANN) operations, which are pivotal in performing similarity searches within complex, high-dimensional datasets and data types.

Types

When defining a vector index with MTREE or HNSW, you can define the types the vector will be stored in. The TYPE clause is optional and can be used to specify the data type of the vector. SurrealDB supports the following types: F64 | F32 | I64 | I32 | I16

  • F64: Represents 64-bit floating-point numbers (double precision floating-point numbers).
  • F32: Represents 32-bit floating-point numbers (single precision floating-point numbers).
  • I64: Represents 64-bit signed integers.
  • I32: Represents 32-bit signed integers.
  • I16: Represents 16-bit signed integers.

Note: In SurrealDB the default type for vectors isF32.

For example, to define a vector index with 64-bit signed integers, you can use the following query:

DEFINE INDEX idx_mtree_embedding ON Document FIELDS items.embedding MTREE DIMENSION 4 TYPE I64;

M-Tree index Since 1.3.0

The M-Tree index is suitable for the task of finding exact nearest neighbors based on a distance metric (like Euclidean distance). Mtree currently supports Euclidean, Cosine, Manhattan and Minkowski distance functions.

Note: When no function is specified it chooses Euclidean for the default distance function.

Capacity

The CAPACITY clause is used to specify the maximum number of records that can be stored in the index. This is useful when you want to limit the number of records stored in the index to optimize performance by default the capacity is set to 40.

Example usage: Define a M-Tree index on a table with MANHATTAN and COSINE distance

For example, consider an index made for records with 4 dimensional vectors using the MANHATTAN and COSINE distance function:

DEFINE INDEX idx_mtree_embedding_manhattan ON Document FIELDS items.embedding MTREE DIMENSION 4 DIST MANHATTAN;
DEFINE INDEX idx_mtree_embedding_cosine ON Document FIELDS items.embedding MTREE DIMENSION 4 DIST COSINE;

Because the Document table has a unique index on the items.embedding field, you can use the index to perform vector searches to find records based on the distance from a given point. Any vector dimensions that don't match 4 dimensions will throw an error. 'Incorrect vector dimension (3). Expected a vector of 4 dimension.'

Example usage: Perform a vector search using an M-Tree index with just a Distance

Another example is to create an M-tree index on a table with 3-dimensional vectors.

CREATE pts:1 SET point = [1,2,3];
CREATE pts:2 SET point = [4,5,6];
CREATE pts:3 SET point = [8,9,10];

To define an M-tree index on the point field of the pts table, you can use the query below:

DEFINE INDEX mt_pt ON pts FIELDS point MTREE DIMENSION 3;

In the above example, the MTREE DIMENSION clause specifies the dimension of the vector. The point field is a 3-dimensional vector array, so we set the dimension to 3. If this is sucessfully created, you can use the index to perform vector searches to find records based on the distance from a given point.

HNSW (Hierarchical Navigable Small World) Since 1.5.0

This method uses a graph-based approach to efficiently navigate and search in high-dimensional spaces. While it is an approximate technique, it offers a high-performance balance between speed and accuracy, making it ideal for very large datasets.

Note: Keep in mind the in-memory nature of HNSW when considering system resource allocation.

In the example above, you may notice the EFC and M parameters. These are optional to your query but are parameters of the HNSW algorithm and can be used to tune the index for better performance. At a glance

  • M (Max Connections per Element): Defines the maximum number of bi-directional links (neighbors) per node in each layer of the graph, except for the lowest layer. This parameter controls the connectivity and overall structure of the network. Higher values of MM generally improve search accuracy but increase memory usage and construction time.

  • EFC (EF construction): Stands for "exploration factor during construction." This parameter determines the size of the dynamic list for the nearest neighbor candidates during the graph construction phase. A larger efConstruction value leads to a more thorough construction, improving the quality and accuracy of the search but increasing construction time. The default value is 150.

  • M0 (Max Connections in the Lowest Layer): Similar to M, but specifically for the bottom layer (the base layer) of the graph. This layer contains the actual data points. M0 is often set to twice the value of M to enhance search performance and connectivity at the base layer, at the cost of increased memory usage.

  • LM (Multiplier for Level Generation): Used to determine the maximum level ll for a new element during its insertion into the hierarchical structure. It is used in the formula l←⌊−ln⁡(unif(0..1))⋅mL⌋, where unif(0..1) is a uniform random variable between 0 and 1. This parameter influences the distribution of elements across different levels, impacting the overall balance and efficiency of the search structure.

Note: You can only provide TYPE, M and EFC. M0 and LM are automatically computed but SurrealDB with the most appropriate value. If not specified M AND EFC are set to 12 and 150 respectively.

Brute Force method

The Brute Force method is suitable for the task with smaller datasets or when the highest accuracy is required. Brute Force currently supports Euclidean, Cosine, Manhattan and Minkowski distance functions.

In the example below, the query searches for points closest to the vector [2,3,4,5] and uses vector::distance::euclidean to calculate the distance between two points, indicated by <|2|>.

Verifying Index Utilization in Queries

The EXPLAIN clause from SurrealQL helps you understand the execution plan of the query and provides transparency around index utilization.

SELECT * FROM user WHERE email='test@surrealdb.com' EXPLAIN FULL;

It also reveals details about which operation was used by the query planner and how many records matched the search criteria.

[
{
"detail": {
"plan": {
"index": "userEmailIndex",
"operator": "=",
"value": "test@surrealdb.com"
},
"table": "user"
},
"operation": "Iterate Index"
},
{
"detail": {
"count": 1
},
"operation": "Fetch"
}
]

Using IF NOT EXISTS clause Since 1.3.0

The IF NOT EXISTS clause can be used to define a index only if it does not already exist. If the index already exists, the DEFINE INDEX statement will return an error.

-- Create a INDEX if it does not already exist
DEFINE INDEX IF NOT EXISTS example ON example FIELDS example;

Performance Implications

When defining indexes, it's essential to consider the fields most frequently queried or used to optimize performance.

Indexes may improve the performance of SurrealQL statements. This may not be noticeable with small tables but it can be significant for large tables; especially when the indexed fields are used in the WHERE clause of a SELECT statement.

Indexes can also impact the performance of write operations (INSERT, UPDATE, DELETE) since the index needs to be updated accordingly. Therefore, it's essential to balance the need for read performance with write performance.