Analyzers and tokenizers

Learn how analyzers turn raw text into searchable tokens: tokenizers, filters, and testing with search::analyze before you build an index.

Full-text search does not compare your query string to the document byte for byte. Instead, SurrealDB tokenizer text into terms, optionally filters those terms (case folding, stemming, and more), and indexes what comes out.

If you are new to FTS in SurrealDB, read the overview first. This guide walks through analyzers from the ground up; exact grammar, every clause, and diagrams live under DEFINE ANALYZER.

What an analyzer does

Roughly, processing flows like this:

Optional FUNCTION: transforms the raw input string once (for example normalising punctuation or stripping markup) via a user-defined function that accepts and returns a string.
Tokenizers — split the string into tokens (words, symbols, or other chunks) using one or more built-in tokenizers.
Filters — transform each token (lowercase, strip accents, stem, n-grams, and so on).

The same analyzer is used when indexing and matching queries, so spending time here pays off for relevance and performance.

See the tokens before you index

Use search::analyze() to print the token array an analyzer would produce, which is ideal for experimentisation.

Start with the simplest split, whitespace-only tokenization:

DEFINE ANALYZER words TOKENIZERS blank;

RETURN search::analyze("words", "hello   world");

Output

[
	'hello',
	'world'
]

Once you are happy with the tokens, you attach the analyzer name to a full-text index and query with @@ (covered on Search indexes and Scoring and ranking).

Step 1 — Choose how to split text (tokenizers)

Tokenizers answer: where are the boundaries between tokens? Some examples of tokenizers are blank, camel, and class.

DEFINE ANALYZER example_blank TOKENIZERS blank;
search::analyze("example_blank", "hello world");

DEFINE ANALYZER example_camel TOKENIZERS camel;
search::analyze("example_camel", "helloWorld");

DEFINE ANALYZER example_class TOKENIZERS class;
search::analyze("example_class", "123abc!XYZ");

Step 2 — Normalise and enrich tokens (filters)

Filters answer: what should each token look like before indexing?

Some examples of filters are ascii, snowball, and ngram.

DEFINE ANALYZER example_ascii TOKENIZERS class FILTERS ascii;
search::analyze("example_ascii", "résumé café");

DEFINE ANALYZER english_snowball TOKENIZERS class FILTERS
  snowball(english);
DEFINE ANALYZER german_snowball TOKENIZERS class FILTERS
  snowball(german);

RETURN [
    search::analyze("english_snowball",
      "Looking at some running cats")
    search::analyze("german_snowball",
      "Sollen wir was trinken gehen?")
];

DEFINE ANALYZER example_ngram TOKENIZERS class FILTERS ngram(1, 3);
search::analyze("example_ngram", "apple banana");

Custom dictionaries with `mapper(path)`

The mapper(path) filter rewrites tokens using a tab-separated file: canonical form first, variant second, one pair per line. That supports lemmatisation beyond what stemming alone catches, or normalising arbitrary phrasing (for example mapping multilingual error strings to a single code).

The server reads the dictionary from the host filesystem when you define the analyzer. Configure SURREAL_FILE_ALLOWLIST so the path lies under an allowed directory, without which no filesystem paths are permitted. See DEFINE ANALYZER — mapper(path) for startup examples.

Point path at a dictionary file under your allowlist. Here is a very short example dictionary:

drive	driven
drive	drives
swim	swam

An analyzer making use of this dictionary can be defined as follows:

DEFINE ANALYZER lemme_english TOKENIZERS blank,
  class FILTERS lowercase,
  mapper('/path/to/lemmatization-en.txt');

RETURN [
    search::analyze("lemme_english", "He drove and swam"),
];

Next steps

Search indexes — attach FULLTEXT ANALYZER to a field.
Scoring and ranking — @@, BM25, search::score, and highlights.
Reference: DEFINE ANALYZER, DEFINE INDEX, Search functions.

Updating or creating analyzers safely

To add an analyzer only if it is missing, or to replace an existing definition, use IF NOT EXISTS or OVERWRITE on DEFINE ANALYZER. Examples and caveats are in the statement reference.