DEFINE
statement
ANALYZER
DEFINE ANALYZER
statement
In the context of a database, an analyzer plays a crucial role in text processing and searching. It is defined by its name, a set of tokenizers, and a collection of filters.
Requirements
- You must be authenticated as a root, namespace, or database user before you can use the
DEFINE INDEX
statement. - You must select your namespace and database before you can use the
DEFINE INDEX
statement.
Statement syntax
DEFINE ANALYZER @name [ TOKENIZERS @tokenizers ] [ FILTERS @filters ]
Tokenizers
Tokenizers are responsible for breaking down a given text into individual tokens.
blank
: creates a new token each time a space, tab, or newline character is encountered.camel
: creates a new token when the next character is uppercase.class
: creates a new token when the Unicode class of the next character changes (digit, letter, punctuation, blank).punct
: creates a new token each time a punctuation character is encountered.
Filters
Filters take on the task of transforming these tokens for further processing and analysis.
ascii
: replaces or removes diacritical marks.edgengram
: useful for finding a term by its prefix.lowercase
: converts the token to lowercase.-
snowball
: applies snowball stemming to the token. The following languages are supported: Arabic, Danish, Dutch, English, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish. uppercase
: converts the token to uppercase.
Example usage
This example creates an analyzer that splits tokens on blank characters and removes diacritical marks.
-- Creates a simple analyzer removing diacritics marks
DEFINE ANALYZER ascii TOKENIZERS class FILTERS lowercase,ascii;
This command statement creates an analyzer specifically designed for processing English texts.
-- Creates an analyzer suitable for English text
DEFINE ANALYZER english TOKENIZERS class FILTERS snowball(english);
This statement creates an analyzer specifically designed for auto-completion tasks.
-- Creates an analyzer suitable for auto-completion.
DEFINE ANALYZER autocomplete FILTERS lowercase,edgengram(2,10);
This command statement creates an analyzer specifically designed for source code analysis.
-- Creates an analyzer suitable for source code analysis.
DEFINE ANALYZER code TOKENIZERS class,camel FILTERS lowercase,ascii;