Our new demo dataset has a lot in store for you!

releases tutorials

Jun 13, 2024

Alexander Fridriksson

Our new demo dataset has a lot in store for you!

With our 2.0-alpha release, you’ll find a lot of exciting updates, one of which is a new and improved demo dataset!

If you’ve been using our Surreal Deal dataset, you’ll see some familiar tables and fields as this dataset is also based around an e-commerce store. However, unlike the previous Surreal Deal, this dataset is based on an actual e-commerce store!

That store is none other than our very own swag store - SurrealDB.store!

The real and the surreal

Before we dive into some exciting examples, it’s important to clarify what it means to be based on a real e-commerce store.

It means that

The products are real, and you can buy them on the SurrealDB.store!
Everything else is fake dummy data that has been generated from scratch with no connection to the actual real store.
There is nothing in this demo data that was not publicly available on SurrealDB.store at the time of this dataset’s creation.

Now that we have that out of the way, let’s explore the new dataset.

What’s in store

Define all the things!

There have been massive improvements and a lot of new features added since the dataset was first generated for 1.0-beta.9 last year.

The dataset has been completely overhauled to showcase many new features and more examples of data modelling patterns.

You’ll now find the schema definitions for all the tables and fields, including more advanced definitions like assertions and default values.

-- Define a field with the object data type
DEFINE FIELD images
    ON TABLE product
    TYPE array<object>;

-- Define the subfields of the string data type Assert that the URL is a valid URL
DEFINE FIELD images.*.url
    ON TABLE product
    TYPE string
    ASSERT string::is::url($value);

-- Define the subfields of type number
DEFINE FIELD images.*.position
    ON TABLE product
    TYPE number;

-- Define time field on product table type object with subfields: created_at and updated_at with the datetime data type
DEFINE FIELD time
    ON TABLE product
    TYPE object;

DEFINE FIELD time.created_at
    ON TABLE product
    TYPE datetime
    VALUE $before OR time::now()
    DEFAULT time::now();

DEFINE FIELD time.updated_at
    ON TABLE product
    TYPE datetime
    VALUE time::now()
    DEFAULT time::now();

You’ll find tables with record links in both directions.

-- from person to address_history
DEFINE FIELD person
    ON TABLE address_history
    TYPE record<person>;

-- from address_history to person
DEFINE FIELD address_history
    ON TABLE person
    TYPE record<address_history>;

As well as a graph relation example with just 2 tables instead of 3.

DEFINE TABLE product_sku TYPE RELATION FROM product TO product

You might notice its using the new TYPE RELATION instead of defining the in and out fields.

There are also various indexes defined.

-- Unique indexes
DEFINE INDEX unique_wishlist_relationships
    ON TABLE wishlist
    COLUMNS in, out UNIQUE;

-- Index on nested fields or record links
DEFINE INDEX person_country
    ON TABLE person
    COLUMNS address.country;

-- Analyzer and index for full-text search
DEFINE ANALYZER blank_snowball
    TOKENIZERS blank
    FILTERS lowercase, snowball(english);

DEFINE INDEX review_content
    ON TABLE review
    COLUMNS review_text
    SEARCH ANALYZER blank_snowball BM25 HIGHLIGHTS;

As well as functions

-- A function can encapsulate any valid SurrealQL logic
DEFINE FUNCTION fn::pound_to_usd($price: number) {
    RETURN $price * 1.26f;
};

-- Which means it can also be used like stored procedure
DEFINE FUNCTION fn::number_of_unfulfilled_orders() {
    RETURN (
        SELECT count()
        FROM order
        WHERE order_status NOTINSIDE ['processed', 'shipped']
        GROUP ALL
 );
};

Time sortable random ids, using ULIDs

While our random ids are a great default option, they aren’t the only option.

For an e-commerce application where things are often based on time, it can make sense to have a time sortable id, as was the case for Shopify switching from UUID v4 to ULID.

SurrealDB also supports UUID v7, which does pretty much the same thing. The honest reason ULID was chosen here is just because its shorter and looks better.

Anyway, these identifiers naturally sort your data based on creation time, which enables you to better use our range query pattern for highly efficient operations regardless of table size.

-- Select the sum of sales for each product in the order table
SELECT
 product_name,
    math::sum(price * quantity) as sum_sales
FROM order:01FS426M489J4RVK2AGSN9HVP0..01FTQB6ZMG9RMB9300Z6GBDXNR
GROUP BY product_name;

Realistic product reviews generated with llama3

It has been a fun experiment seeing what it takes to use a local LLM for fake data generation.

Now our fake reviews are no longer just a collection of words chosen at random.

review_text: "agriculture artwork feed name along xhtml putting photos much meal costs spring"

They are a collection of semi-random words chosen with statistics!

review_text: "The Voyager Wool-Like Jacket has become my new go-to for cold winter days. It's that good!",

As expected, it was relatively quick and easy to write prompts and get some outputs, you’ve been doing this with ChatGPT and friends for a while now.

The hard part was actually integrating it into a system and building the guardrails around it which makes it safe, useful, predictable and repeatable.

That is why we’re building out various AI/ML workflows here at SurrealDB which make it easier for you to build intelligent applications.

What this means for this dataset however is that it allows us to have better examples such as this full-text search example looking for the products which people mention they are wearing nonstop.

-- Select product name and review text where the review text contains the phrase "wearing nonstop"
SELECT ->product.name, review_text
FROM review
WHERE review_text @@ 'wearing nonstop';

What’s next?

A lot of thought has been put into to making the dataset more realistic and useful for learning the various features and patterns everyone should be aware of. This is however just the beginning.

As we get closer to the finalised 2.0 release, the dataset will be updated with more features and improvements. As well as more examples of how to use it.

What you can do now is:

If you think something is missing or could be done better, let us know! We would love to hear from you.