Skip to main content

DEFINE ANALYZER statement

In the context of a database, an analyzer plays a crucial role in text processing and searching. It is defined by its name, a set of tokenizers, and a collection of filters.

Requirements

Statement syntax

SurrealQL Syntax
DEFINE ANALYZER [ IF NOT EXISTS ] @name [ TOKENIZERS @tokenizers ] [ FILTERS @filters ]

Tokenizers

Tokenizers are responsible for breaking down a given text into individual tokens based on a set of instructions. There are a couple of tokenizers that can be used while defining an analyzer as seen below:

blank

The blank tokenizer breaks down a text into tokens by creating a new token each time it encounters a space, tab, or newline character. It's a straightforward way to split text into words or chunks based on whitespace.

For example, if you had the text "hello world", the blank tokenizer would create two tokens, ["hello" and "world"]. Below is an example of how to use the blank tokenizer:

DEFINE ANALYZER example_blank TOKENIZERS blank;

camel

The camel tokenizer is used for identifying and creating tokens when the next character in the text is uppercase. This is particularly useful for processing camelCase or PascalCase text, common in programming, to split them into meaningful words.

For example, if you had the text "helloWorld", the camel tokenizer would create two tokens, ["hello", "World"]. Below is an example of how to use the camel tokenizer:

DEFINE ANALYZER example_camel TOKENIZERS camel;

class

The class tokenizer segments text into tokens by detecting changes (digit, letter, punctuation, blank) in the Unicode class of characters. It creates a new token when the character class changes, distinguishing between digits, letters, punctuation, and blanks. This allows for flexible tokenization based on character types.

For example, if you had the text "123abc!XYZ", the class tokenizer would create four tokens, ["123", "abc", "!", "XYZ"]. Below is an example of how to use the class tokenizer:

DEFINE ANALYZER example_class TOKENIZERS class;

punct

The punct tokenizer generates tokens by breaking the text whenever a punctuation character is encountered. It's suitable for tokenizing sentences or breaking text into smaller units based on punctuation marks.

For example, if you had the text "Hello, World!", the punct tokenizer would create four tokens, ["Hello", ",", "World", "!"]. Below is an example of how to use the punct tokenizer:

DEFINE ANALYZER example_punct TOKENIZERS punct;

Filters

Filters take on the task of transforming these tokens for further processing and analysis.

ascii

The ascii filter is responsible for processing tokens by replacing or removing diacritical marks (accents and special characters) from the text. It helps standardize text by converting accented characters to their basic ASCII equivalents, making it more suitable for various text analysis tasks.

For example, if you had the text "résumé café", the ascii filter would create two tokens, ["resume", "cafe"]. Below is an example of how to use the ascii filter:

DEFINE ANALYZER example_ascii TOKENIZERS class FILTERS ascii;

ngram(min,max)

The ngram filter is used to create a sequence of 'n' tokens from a given sample of text or speech. These items can be syllables, letters, words or base pairs according to the application. It accepts two parameters min and max which indicates that you want to create n-grams starting from min to size of max.

For example, if you had the text "apple banana", the ngram filter would create these tokens: ["a", "p", "p", "l", "e", "ap", "pp", "pl", "le", "app", "ppl", "ple", "b", "a", "n", "a", "n", "a", "ba", "an", "na", "an", "na", "ban", "ana", "nan", "ana"]. Below is an example of how to use the ngram filter:

DEFINE ANALYZER example_ngram TOKENIZERS class FILTERS ngram(1,3);

edgengram(min,max)

The edgengram filter is used to create tokens that represent prefixes of terms. It generates a sequence of tokens that gradually build up a term, which can be useful for autocomplete or searching based on partial words. It accepts two parameters min and max which define the minimum and maximum amount of characters in the prefix.

For example, if you had the text "apple banana", the edgengram filter would create six tokens, ["a", "ap", "app", "b", "ba", "ban"]. Below is an example of how to use the edgengram filter:

DEFINE ANALYZER example_edgengram TOKENIZERS class FILTERS edgengram(1,3);

lowercase

The lowercase filter converts tokens to lowercase, ensuring that text is consistently in lowercase format. This is often used to make text case-insensitive for search and analysis purposes.

For example, if you had the text "Hello World", the lowercase filter would create two tokens, ["hello", "world"]. Below is an example of how to use the lowercase filter:

DEFINE ANALYZER example_lowercase TOKENIZERS class FILTERS lowercase;

snowball(language)

The snowball filter applies Snowball stemming to tokens, reducing them to their root form and converts the case to lowercase. The following supported languages can be passed as a parameter in snowball: Arabic, Danish, Dutch, English, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.

For example, if you had the text "running cats", the snowball filter would create two tokens, ["run", "cat"]. Below is an example of how to use the snowball filter with the English language:

DEFINE ANALYZER example_snowball TOKENIZERS class FILTERS snowball(english);

uppercase

The uppercase filter converts tokens to uppercase, ensuring text consistency in uppercase format. It can be useful when case-insensitivity is required for specific analysis or search operations.

For example, if you had the text "Hello World", the uppercase filter would create two tokens, ["HELLO", "WORLD"]. Below is an example of how to use the uppercase filter:

DEFINE ANALYZER example_uppercase TOKENIZERS class FILTERS uppercase;

Using IF NOT EXISTS clause Since 1.3.0

The IF NOT EXISTS clause can be used to create an analyzer only if it does not already exist. If the analyzer already exists, the DEFINE ANALYZER statement will return an error.

-- Create a ANALYZER if it does not already exist
DEFINE ANALYZER IF NOT EXISTS example TOKENIZERS blank;

More examples

Examples on application of analyzers to indexes can be found in the documenation on DEFINE INDEX statement

This example creates an analyzer that tokenizes text based on the class of characters and then applies the lowercase filter to the tokens.

-- Creates a simple analyzer removing diacritics marks
DEFINE ANALYZER ascii TOKENIZERS class FILTERS lowercase,ascii;

This command statement creates an analyzer specifically designed for processing English texts.

-- Creates an analyzer suitable for English text
DEFINE ANALYZER english TOKENIZERS class FILTERS snowball(english);

This statement creates an analyzer specifically designed for auto-completion tasks.

-- Creates an analyzer suitable for auto-completion.
DEFINE ANALYZER autocomplete FILTERS lowercase,edgengram(2,10);

This command statement creates an analyzer specifically designed for source code analysis.

-- Creates an analyzer suitable for source code analysis.
DEFINE ANALYZER code TOKENIZERS class,camel FILTERS lowercase,ascii;