A full-text search database is designed to index and retrieve text-based data (like articles, messages, or comments) based on the meaning and structure of the text itself, rather than exact, literal matches. This allows you to:
Find documents containing certain keywords. Search for phrases or words with variants (e.g., “run,” “runs,” “running”). Rank results by relevance, not just by literal string matches. Traditional relational databases have some limited text search capabilities (with varying degrees of support for text indexes), but more specialized systems—like Elasticsearch or Lucene-based solutions—excel at performing complex, scalable full-text queries.
However, new multi-model databases, including SurrealDB, are developing integrated full-text search capabilities so that you can store your data (as documents, graphs, or tables) and query it with advanced text search features. This guide will explain how to “think” in a full-text search model and show how SurrealDB helps you implement these concepts seamlessly.
Tokenization: Splits text into smaller units (“tokens”). Depending on your use case, tokens may be entire words, word stems, or even n-grams. For example, tokenizing the sentence “The quick brown fox.” might produce [“the”, “quick”, “brown”, “fox”].
Normalization and Filtering: After splitting text into tokens, these tokens might be converted to lowercase, stripped of punctuation, or transformed (e.g., removing accents). Additional filters can include stemming or lemmatization, which reduce words to a base form (“running” -> “run”).
Indexing: Full-text search engines build specialized data structures (inverted indexes, suffix arrays, etc.) to let you quickly locate documents containing certain tokens. These indexes often store frequency, position, and other details that enable relevance scoring.
Ranking / Scoring: Once matches are found, an FTS engine ranks them to show the most relevant results first. Algorithms such as BM25 or TF-IDF look at how often terms appear in a document, or whether those terms appear in the title vs. the body, etc.
Highlighting : A good search experience shows where in the text the matches occur, often by wrapping matched terms in HTML tags or otherwise emphasizing them.
Analyzers: In a robust full-text search system, analyzers define how the text is processed. An analyzer typically includes tokenizers (which split text) and filters (which modify tokens). Different languages or data types need different analyzers.
In traditional databases, you might do something like:
SELECT * FROM articles WHERE title = 'fox'
This approach:
Full-text search, by contrast, uses an inverted index or other specialized structures for fast lookups and can handle a variety of linguistic transformations. It can highlight results and rank them by how relevant or frequent the terms are.
SurrealDB combines multi-model data storage (document, graph, vector, etc.) with an integrated full-text search engine. This means you can:
Below is an example of how you might define a table for “articles,” define an analyzer that lowercases and removes accents, and create a full-text index on the “title” and “body” fields:
USE NAMESPACE myapp DB content; -- Create a table for articles (schemaless or define fields explicitly) CREATE TABLE articles SCHEMALESS; -- Define a custom analyzer DEFINE ANALYZER my_custom_analyzer TOKENIZERS class FILTERS lowercase, ascii; -- Create a full-text search index DEFINE INDEX articles_ft_index ON TABLE articles COLUMNS title, body SEARCH ANALYZER my_custom_analyzer BM25 HIGHLIGHTS;
With this setup:
my_custom_analyzer
splits text by Unicode class changes (letters, digits, punctuation) and then lowercases and removes accents.HIGHLIGHTS
are enabled, which means SurrealDB can highlight matched terms in queries.Assume you want to search for articles about “machine learning.” SurrealDB syntax might look like:
SELECT *, search::highlight(my_custom_analyzer, title, "machine learning") AS titleHighlights, search::highlight(my_custom_analyzer, body, "machine learning") AS bodyHighlights FROM articles WHERE title @@ "machine learning" OR body @@ "machine learning" ORDER BY score() DESC LIMIT 10;
Explanation:
title @@ "machine learning"
: SurrealDB will check if the tokens “machine” and “learning” appear in the title field’s FTS index.search::highlight(...)
: SurrealDB can highlight the matched terms in the text.score()
: is SurrealDB’s built-in function that returns the relevance score for the matched document.DEFINE INDEX
DEFINE ANALYZER
search::highlight
: Highlights the matching keywords for the predicate reference number.search::offsets
: Returns the position of the matching keywords for the predicate reference number.search::score
: Helps with scoring and ranking the search results based on their relevance to the search terms.