Using SurrealDB as a Vector Database

A vector database is specialized for storing these high-dimensional vectors and for efficiently performing queries such as nearest neighbor search or similarity search. Rather than searching on exact values or text-based queries, vector databases let you search based on semantic similarity. For instance, in a text embedding scenario, you can find documents that are semantically similar to a given query, even if they do not share the same keywords.

But how do you “think” in a vector database? Unlike relational or document models, where the focus is on well-defined schemas and relationships, vector databases revolve around embeddings—numerical representations of objects (like text, images, audio snippets, etc.) in a continuous vector space. You have to design data structures and queries to exploit these embeddings for similarity search or AI-driven retrieval.

In this guide, you’ll learn about the basics of vector databases, how they differ from other data models, and how SurrealDB—a multi-model database—can store and query vector data alongside documents, graphs, and more.

What Are Embeddings?

Embeddings are typically dense vectors of real numbers that capture the semantic or contextual meaning of data. For instance, in Natural Language Processing (NLP), a word or sentence can be transformed into a vector of length 128, 256, 768, or even more dimensions. The idea is that similar objects (in meaning) end up having similar vector representations, making it possible to compute how close they are in the vector space.

Why Store Embeddings?

Embeddings enable a range of powerful use cases:

Semantic Text Search: Retrieve documents similar in meaning to a query, regardless of exact keyword overlap.
Recommendation Systems: Suggest items (movies, products, etc.) similar to what a user has liked based on learned embeddings.
Image or Audio Search: Identify images or audio clips semantically similar to a given sample.
Clustering and Classification: Perform unsupervised clustering of data points or quickly identify which category a vector is close to.

When these embeddings need to be indexed and queried at scale, a specialized data structure—often called an Approximate Nearest Neighbor (ANN) index or vector index—becomes necessary. This is where vector databases come in.

Core Concepts of Vector-Oriented Modeling

When transitioning to a vector database, you need to adapt your mindset to revolve around embeddings and distance metrics:

Embeddings Each data item (a text document, an image, a user profile, or an event) might be represented as a high-dimensional vector. These embeddings are often produced by machine learning models (like a Transformer for text).
Similarity Metrics Vectors are compared using a distance or similarity metric, such as cosine similarity, dot product, or Euclidean distance.
Nearest Neighbor Search The most common operation is to find the k-nearest neighbors of a query vector. Rather than matching on keywords or IDs, the system retrieves items that have minimal distance from the query in the vector space.
Hybrid Approach Some queries combine classical filters (e.g., a user’s location or a date range) with vector similarity. This means you might need a system that can do both vector-based filtering and typical attribute-based queries, ideally in one seamless environment.

SurrealDB as a Vector Database

Although SurrealDB is commonly discussed for its multi-model capabilities—supporting documents, tables, and graph relationships—it also has the ability to store and index vectors. This makes SurrealDB a powerful “one-stop shop” if you need:

Document/graph capabilities for typical data interactions, plus
Vector search to power semantic queries, AI recommendations, or advanced content retrieval.

With SurrealDB’s SurrealQL, you can define vector fields, store numeric arrays as embeddings, create indexes, and perform similarity queries. SurrealDB’s approach aims to unify these features so you don’t have to maintain separate data stores (e.g., a dedicated vector database plus a separate document or graph DB) if you prefer a single integrated system.

Resources

Edit this page on GitHub