• Start

Full-text search

Analyzers and tokenizers

Learn how analyzers turn raw text into searchable tokens: tokenizers, filters, and testing with search::analyze before you build an index.

Full-text search does not compare your query string to the document byte for byte. Instead, SurrealDB tokenizer text into terms, optionally filters those terms (case folding, stemming, and more), and indexes what comes out.

If you are new to FTS in SurrealDB, read the overview first. This guide walks through analyzers from the ground up; exact grammar, every clause, and diagrams live under DEFINE ANALYZER.

Roughly, processing flows like this:

  1. Optional FUNCTION: transforms the raw input string once (for example normalising punctuation or stripping markup) via a user-defined function that accepts and returns a string.

  2. Tokenizers — split the string into tokens (words, symbols, or other chunks) using one or more built-in tokenizers.

  3. Filters — transform each token (lowercase, strip accents, stem, n-grams, and so on).

The same analyzer is used when indexing and matching queries, so spending time here pays off for relevance and performance.

Use search::analyze() to print the token array an analyzer would produce, which is ideal for experimentisation.

Start with the simplest split, whitespace-only tokenization:

DEFINE ANALYZER words TOKENIZERS blank;

RETURN search::analyze("words", "hello world");

Output

[
'hello',
'world'
]

Once you are happy with the tokens, you attach the analyzer name to a full-text index and query with @@ (covered on Search indexes and Scoring and ranking).

Tokenizers answer: where are the boundaries between tokens? Some examples of tokenizers are blank, camel, and class.

DEFINE ANALYZER example_blank TOKENIZERS blank;
search::analyze("example_blank", "hello world");

DEFINE ANALYZER example_camel TOKENIZERS camel;
search::analyze("example_camel", "helloWorld");

DEFINE ANALYZER example_class TOKENIZERS class;
search::analyze("example_class", "123abc!XYZ");

Filters answer: what should each token look like before indexing?

Some examples of filters are ascii, snowball, and ngram.

DEFINE ANALYZER example_ascii TOKENIZERS class FILTERS ascii;
search::analyze("example_ascii", "résumé café");

DEFINE ANALYZER english_snowball TOKENIZERS class FILTERS snowball(english);
DEFINE ANALYZER german_snowball TOKENIZERS class FILTERS snowball(german);

RETURN [
search::analyze("english_snowball", "Looking at some running cats"),
search::analyze("german_snowball", "Sollen wir was trinken gehen?")
];

DEFINE ANALYZER example_ngram TOKENIZERS class FILTERS ngram(1, 3);
search::analyze("example_ngram", "apple banana");

The mapper(path) filter rewrites tokens using a tab-separated file: canonical form first, variant second, one pair per line. That supports lemmatisation beyond what stemming alone catches, or izenormalising arbitrary phrasesize (for example mapping multilingual error strings to a single code).

Point path at a file the server can read. Here is a very short example dictionary:

drive	driven
drive drives
swim swam

An analyzer making use of this dictionary can be defined as follows:

DEFINE ANALYZER lemme_english TOKENIZERS blank, class FILTERS lowercase, mapper('/path/to/lemmatization-en.txt');

RETURN [
search::analyze("lemme_english", "He drove and swam"),
];

To add an analyzer only if it is missing, or to replace an existing definition, use IF NOT EXISTS or OVERWRITE on DEFINE ANALYZER. Examples and caveats are in the statement reference.

Was this page helpful?