SurrealDB offers a large variety of ways to work with text, from simple operators to fuzzy searching, customized ordering, full-text search and more. In this guide, we will cover comparing and sorting text, contains functions and operators, equality and fuzzy equality, regex matching, string functions, and full-text search.
This will give you a comprehensive overview of the different ways to work with text in SurrealDB and which one to use in your specific use case.
SELECT
queriesThe following example shows a few records created from an array of strings in an order that is sorted to the human eye: lowest to highest numbers, then A to Z.
FOR $word IN ['1', '2', '11', 'Ábaco', 'kitty', 'Zoo'] { CREATE data SET val = $word; };
Inside a SELECT
query, an ORDER BY
clause can be used to order the output by one or more field names. For the above data, an ordered SELECT
query looks like this.
SELECT VALUE val FROM data ORDER BY val;
However, in the case of strings, sorting is done by Unicode rank which often leads to output that seems out of order to the human eye. The output of the above query shows the following:
Output[ '1', '11', '2', 'Zoo', 'kitty', 'Ábaco' ]
This is because:
To sort strings in a more natural manner to the human eye, the keywords NUMERIC
and COLLATE
(or both) can be used. NUMERIC
will instruct strings that parse into numbers to be treated as such.
SELECT VALUE val FROM data ORDER BY val NUMERIC;
Numberic strings now sorted as numbers[ '1', '2', '11', 'Zoo', 'kitty', 'Ábaco' ]
COLLATE
instructs unicode strings to sort by alphabetic order, rather than Unicode order.
SELECT VALUE val FROM data ORDER BY val COLLATE;
Output[ '1', '11', '2', 'Ábaco', 'kitty', 'Zoo' ]
And for the data in this example, COLLATE NUMERIC
is likely what will be desired.
SELECT VALUE val FROM data ORDER BY val COLLATE NUMERIC;
Output[ '1', '2', '11', 'Ábaco', 'kitty', 'Zoo' ]
As of SurrealDB 2.2.2
, the functions array::sort_natural()
, array::sort_lexical()
, and array::sort_lexical_natural()
can be used on ad-hoc data to return the same output as the COLLATE
and NUMERIC
clauses in a SELECT
statement.
The most basic way to see if one string is contained inside another is to use the IN
operator, or the string::contains()
function.
-- false "Umple" IN "Rumplestiltskin"; string::contains("Rumplestiltskin", "Umple"); -- Same function using method syntax "Rumplestiltskin".contains("Umple"); -- true "umple" IN "Rumplestiltskin"; string::contains("Rumplestiltskin", "umple"); "Rumplestiltskin".contains("umple");
SurrealDB has a number of operators to determine if all or some of the values of one array are contained in another, such as CONTAINSALL
and CONTAINSANY
, or ALLINSIDE
and ANYINSIDE
. The operators with CONTAINS
and INSIDE
perform the same behaviour, just in the opposite order.
-- If 1,2,3 contains each item in 1,2... [1,2,3] CONTAINSALL [1,2]; -- then each item in 1,2 is inside 1,2,3 [1,2] ALLINSIDE [1,2,3];
Because strings are essentially arrays of characters, these operators work with strings as well.
NoteThe above capability was added in SurrealDB version
2.2.2
.
Both of these queries will return true
.
"Rumplestiltskin" CONTAINSALL ["umple", "kin"]; "kin" ALLINSIDE "Rumplestiltskin"; ["kin", "someotherstring"] ANYINSIDE "Rumplestiltskin";
While strings can be compared for strict equality in the same way as with other values, fuzzy searching can also be used to return true
if two strings are approximately equal. The fuzzy operators are:
~
to check if two strings have fuzzy equality!~
to check if two strings do not have fuzzy equality?~
to check if any strings have fuzzy equality*~
to check if all strings have fuzzy equalityAll of the following will return true.
"big" ~ "Big"; "big" !~ "small"; ["Big", "small"] ?~ "big"; ["Big", "big"] *~ "big";
Fuzzy matching is based on the Smith-Waterman algorithm that requires some time to understand. It is a convenient option due to the ~
operator, but can sometimes produce surprising results.
-- true "United Kingdom" ~ "United kingdom"; -- true (second string entirely contained in first) "United Kingdom" ~ "ited"; -- Also true! "United Kingdom" ~ "i"; -- false "United Kingdom" ~ "United Kingdóm";
The string::similarity::fuzzy
function can be useful in this case, as it returns a number showing the similarity between strings, not just whether they count as a fuzzy match. In the following example, while the strings ited
and i
do have a similarity score above 0, they are ranked much lower than the better matches United kingdom
and United Kingdom
.
SELECT $this AS word, string::similarity::fuzzy("United Kingdom", $this) AS similarity FROM ["United Kingdom", "United kingdom", "ited", "United Kingdóm", "i"] ORDER BY similarity DESC;
Output[ { similarity: 295, word: 'United Kingdom' }, { similarity: 293, word: 'United kingdom' }, { similarity: 75, word: 'ited' }, { similarity: 15, word: 'i' }, { similarity: 0, word: 'United Kingdóm' } ]
SurrealDB offers quite a few other algorithms inside the string functions module for distance or similarity comparison. They are:
string::distance::damerau_levenshtein()
string::distance::normalized_damerau_levenshtein()
string::distance::hamming()
string::distance::levenshtein()
string::distance::normalized_levenshtein()
string::distance::osa_distance()
string::similarity::jaro()
string::similarity::jaro_winkler()
Which of these functions to choose depends on your personal use case.
For example, fuzzy similarity and distance scores are not a measure of absolute equality and ordered similarity scores should only be used in comparisons against the same string. Take the following queries for example which return the score for the string “United” and “Unite”:
-- return 131 and 111 string::similarity::fuzzy("United Kingdom", "United"); string::similarity::fuzzy("United Kingdom", "Unite"); -- also return 131 and 111 string::similarity::fuzzy("United", "United"); string::similarity::fuzzy("United", "Unite");
While the word “Unite” is clearly closer to the word “United” than it is to “United Kingdom”, the algorithm used for this function only considers how much of the second string is found in the first string.
However, the string::similarity::jaro()
function returns an output that approaches 1 if two strings are equal, making it a more apt solution when the first and second string may be entirely different. Using the same input strings as above shows that “Unite” is clearly the most similar of the strings that are not outright equal to “United”.
-- 0.8095238095238096f string::similarity::jaro("United Kingdom", "United"); -- 0.7857142857142857f string::similarity::jaro("United Kingdom", "Unite"); -- 1 string::similarity::jaro("United", "United"); -- 0.9444444444444445f string::similarity::jaro("United", "Unite");
Another example of the large difference between algorithms is the Hamming distance algorithm, which only compares strings of equal length.
-- Error: strings have different length string::distance::hamming("United Kingdom", "United"); -- Returns 0 string::distance::hamming("United", "United"); -- Returns 1 string::distance::hamming("United", "Unitéd"); -- Returns 6 string::distance::hamming("United", "uNITED");
The string::matches()
function can be used to perform regex matching on a string.
-- true string::matches("Cat", "[HC]at"); -- Also true string::matches("Hat", "[HC]at");
SurrealDB has a large number of string functions that can be used manually to refine string searching, such as string::lowercase()
, string::starts_with()
, and string::ends_with()
.
SELECT $this AS word, $this.lowercase() = "sleek" AS is_sleek FROM ["sleek", "SLEEK", "Sleek", "sleeek"];
Output[ { is_sleek: true, word: 'sleek' }, { is_sleek: true, word: 'SLEEK' }, { is_sleek: true, word: 'Sleek' }, { is_sleek: false, word: 'sleeek' } ]
For more customized text searching, full-text search can be used.
Full-Text search supports text matching, proximity searches, result ranking, and keyword highlighting, making it a much more comprehensive solution when precise text searching is required.
It is also ACID-compliant, which ensures data integrity and reliability.
The first step to using full-text search is to define an analyzer using a DEFINE ANALYZER
statement. An analyzer is not defined on a table, but a set of tokenizers (to break up text) and filters (to modify text).
The DEFINE ANALYZER
page contains a detailed explanation of each type of tokenizer and analyzer to choose from. To define the analyzer that most suits your needs, it is recommended to use the search::analyze
function which returns the output of an analyzer for an input string.
Take the following analyzer for example, which uses blank
to split a string by whitespace, and edgengram(3, 10)
to output all of the instances of the first three to ten letters of a word.
DEFINE ANALYZER blank_edgengram TOKENIZERS blank FILTERS edgengram(3, 10); search::analyze("blank_edgengram", "The Wheel of Time turns, and Ages come and pass, leaving memories that become legend.");
The output includes strings like ‘turns,’ and ‘legend.’, which include punctuation marks.
Output['The', 'Whe', 'Whee', 'Wheel', 'Tim', 'Time', 'tur', 'turn', 'turns', 'turns,', 'and', 'Age', 'Ages', 'com', 'come', 'and', 'pas', 'pass', 'pass,', 'lea', 'leav', 'leavi', 'leavin', 'leaving', 'mem', 'memo', 'memor', 'memori', 'memorie', 'memories', 'tha', 'that', 'bec', 'beco', 'becom', 'become', 'leg', 'lege', 'legen', 'legend', 'legend.']
If this is not desired, some looking through the DEFINE ANALYZER
page will turn up another tokenizer called punct
that can be included, now creating an analyzer that splits on whitespace as well as punctuation.
DEFINE ANALYZER blank_edgengram TOKENIZERS blank, punct FILTERS edgengram(3, 10); search::analyze("blank_edgengram", "The Wheel of Time turns, and Ages come and pass, leaving memories that become legend.");
Output['The', 'Whe', 'Whee', 'Wheel', 'Tim', 'Time', 'tur', 'turn', 'turns', 'and', 'Age', 'Ages', 'com', 'come', 'and', 'pas', 'pass', 'lea', 'leav', 'leavi', 'leavin', 'leaving', 'mem', 'memo', 'memor', 'memori', 'memorie', 'memories', 'tha', 'that', 'bec', 'beco', 'becom', 'become', 'leg', 'lege', 'legen', 'legend']
The available tokenizers and filters are as follows:
blank
, camel
, punct
to split by whitespace, camelcase, and punctuation. The class
tokenizer splits when a class change is detected, such as letter to number, space to letter, punctuation to letter, and so on.ascii
, lowercase
, uppercase
to change to ASCII, lowercase, and uppercase.The ngram
filter is similar to the edgengram
filter above in that it takes a minimum and maximum length, but instead moves from character to character inside a string as it attempts to find all the possible outputs in between these two lengths.
DEFINE ANALYZER example_analyzer FILTERS ngram(1,4); search::analyze("example_analyzer", "cars!");
Here is the output modified slightly to show the output of the ngram
filter at each step of the way.
Output[ 'c', 'ca', 'car', 'cars', 'a', 'ar', 'ars', 'ars!', 'r', 'rs', 'rs!', 's', 's!', '!' ]
The snowball
and mapper
filters are the most complex and versatile.
snowball
filterThe snowball filter is used to perform stemming: the reduction of a word to as basic and universal a form as possible. It is available for the languages Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, and Turkish.
Stemming involves using an algorithm to reduce a word, but is unable to incorporate complex changes like the plural and verbal vowel changes in English.
DEFINE ANALYZER snowball_test TOKENIZERS blank,punct FILTERS snowball(english); search::analyze("snowball_test", " manager managing management running ran foot feet introspective introspection introspected ");
Output[ 'manag', 'manag', 'manag', 'run', 'ran', 'foot', 'feet', 'introspect', 'introspect', 'introspect' ]
Stemming is particularly useful in languages with complex but regular declension, such as Finnish. In the following example, the snowball filter is able to turn all declined forms of the word “talo” (house) into its root form.
DEFINE ANALYZER snowball_test TOKENIZERS blank,punct FILTERS snowball(finnish); search::analyze("snowball_test", "talo talon taloa talossa talostani taloonsa talolla talolta talolle talona taloksi taloin talotta taloineen");
Output['talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talo', 'talot', 'talo']
mapper
filterThe mapper
filter is the most customizable of all, involving a list of strings and the strings they are to be mapped to. This filter requires a path to a text file, inside which each base form is followed by a word to map to it, separated by a comma.
mapper.txtrun ran foot feet
In the case of the above example, the mapper
will allow the output to show the base forms of the words “ran” and “feet”.
DEFINE ANALYZER mapper TOKENIZERS blank FILTERS snowball(english),mapper('mapper.txt'); search::analyze("mapper", " manager managing management running ran foot feet introspective introspection introspected ");
[ 'manag', 'manag', 'manag', 'run', 'run', 'foot', 'foot', 'introspect', 'introspect', 'introspect' ]
The word mapper
was intentionally chosen to be ambiguous, as this feature can be used to map any string to another string. It could be used for example to map cities to provinces, planets to stars, hieroglyphs to English words, and so on.
mapper.txtseated_man 𓀀 man_with_hand_to_mouth 𓀁 seated_woman 𓁐 goddess_with_feather 𓁦
DEFINE ANALYZER mapper TOKENIZERS blank FILTERS mapper('mapper.txt'); search::analyze("mapper", "𓀀 𓁦");
Output[ 'seated_man', 'goddess_with_feather' ]
Once a search analyzer is defined, it can be applied to the fields of a table to make them searchable by defining an index that uses the SEARCH ANALYZER
clause. Having a full-text index in place makes it possible to use the @@
operator (the MATCHES
operator).
DEFINE ANALYZER my_analyzer TOKENIZERS class FILTERS lowercase, ascii; DEFINE INDEX body_index ON TABLE article FIELDS body SEARCH ANALYZER my_analyzer; DEFINE INDEX title_index ON TABLE article FIELDS title SEARCH ANALYZER my_analyzer; CREATE article SET title = "Machine Learning!", body = "Machine learning, or ML, is all the rage these days. Developers are..."; CREATE article SET title = "History of machines", body = "The earliest 'machine' used by our ancestors was a simple sharpened stone tool. It was..."; SELECT body, title FROM article WHERE body @@ "machine" OR title @@ "machine";
Output[ { body: 'Machine learning, or ML, is all the rage these days. Developers are...', title: 'Machine Learning!' }, { body: "The earliest 'machine' used by our ancestors was a simple sharpened stone tool. It was...", title: 'History of machines' } ]
To use highlighting and best match scoring on searches, the BM25
and HIGHLIGHTS
clauses can be added to the DEFINE INDEX
statement. These enable you use the search::highlight
and search::score
functions.
Inside a query, the @@
operator takes a number that is matched with the same number passed into one of these functions. In the example below, the WHERE text @0@ "night"
part of the query will match with search::highlight("->", "<-", 0)
and search::score(0) AS text_score
, while title @1@ "hound"
will match with search::score(1) AS title_score
.
DEFINE ANALYZER my_analyzer TOKENIZERS class, blank FILTERS lowercase, ascii; DEFINE INDEX text_index ON TABLE article FIELDS text SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS; DEFINE INDEX title_index ON TABLE article FIELDS title SEARCH ANALYZER my_analyzer BM25 HIGHLIGHTS; INSERT INTO article (title, text) VALUES ("A Study in Scarlet", "IN the year 1878 I took my degree of Doctor of Medicine of the University of London, and proceeded to Netley to go through the course prescribed for surgeons in the army."), ("A Study in Scarlet", "Having completed my studies there, I was duly attached to the Fifth Northumberland Fusiliers as Assistant Surgeon."), ("The Sign of the Four", "SHERLOCK HOLMES took his bottle from the corner of the mantel-piece and his hypodermic syringe from its neat morocco case."), ("The Hound of the Baskervilles", "MR. SHERLOCK HOLMES, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table."), ("The Hound of the Baskervilles", "I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before."); SELECT text, title, search::highlight("->", "<-", 0) AS title, search::score(0) AS text_score, search::score(1) AS title_score FROM article WHERE text @0@ "night" OR title @1@ "hound";
Output[ { text: 'MR. SHERLOCK HOLMES, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table.', text_score: 0.30209195613861084f, title: 'MR. SHERLOCK HOLMES, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all ->night<-, was seated at the breakfast table.', title_score: 0.32491400837898254f }, { text: 'I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before.', text_score: 0.35619309544563293f, title: 'I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the ->night<- before.', title_score: 0.32491400837898254f } ]