

In this page we will start turning the existing data into more useful SurrealQL data types.
Since we can make a datetime from the format YYYY-MM-DD, we just need to turn the existing "03 Jul 1991" format into "1991-07-03", and put <datetime> in front.
The first thing we can do is use the string::split function. Typing .split(' ') will split by whitespace.
We can assign this output to a parameter that we'll call $split. Then we can rearrange the order by first getting the 2nd index, then the 1st, and finally the 0th. In between we'll use + '-' to add a hyphen.
That's almost good enough. Let's define a function to hold this code.
A function is made using a DEFINE FUNCTION statement. A function's input and return value can have a type name (like string) to make it more type safe.
By the way, you can use the RETURN keyword if you prefer. But it's not necessary since the last expression will automatically be the return value of a function.
The only time when using RETURN makes a difference is when you use it to return a value early. For example, the following query will only show "This value shows up" and never reach the next line.
Now let's define another function so that we can get the output '1991-07-03' instead of '1991-Jul-03'.
Fortunately, all the dates in our naive_movie records all follow the same format, with months like Jan, Feb, Mar, and so on. Changing them from this format to a number requires some typing, but looks clean enough. At the end of the function we have a THROW statement to return an error if the input is invalid.
Now that the function fn::month_to_num() is done, we can redefine the existing fn::date_to_datetime() function with the OVERWRITE clause.
With those functions defined, we can now give this a try. Let's pick a single naive_movie using this SELECT statement. This statement has the keyword ONLY, which tells the database to return just a single record instead of an array of records. And to make the statement work, we need to add LIMIT 1 so that only up to one record will be selected.
Then we can display the movie's Title field and modified Release field. One way to do this is by using . and then {} to open up a new space inside which we choose which fields and how to represent them. Or we can just do a SELECT statement on the $one_movie parameter that we just created. This time FROM ONLY doesn't need a LIMIT 1 because the database already knows that $one_movie is a single record.
Here's the output! It's Aliens. Or Good Will Hunting, or anything else. Since each naive_movie has a random ID, the first movie returned with LIMIT 1 will depend on which IDs were generated in your case.
On the other hand, if you had specified predictible IDs like 0 to 100, then the first record returned will be the 0. The following query creates 100 records of table a with IDs from a:0 to a:100.
The output for the second statement above will always be this record.
The data on ratings is a bit more challenging, as it involves turning an object like this into separate fields with each Score cast into an integer. Each source has a different way to represent a score.
In addition, many Ratings objects don't have ratings from all three websites, so we can't guarantee that every movie will have three scores. We can show this by seeing how many NONE results show up at indexes 1 and 2 of the Ratings field. NONE is the value that shows up when no value is present.
By the way, here is how you can calculate the number of movies that don't have ratings from all three of these websites: run the query above on index 2, enclose it in parentheses, add a [WHERE Ratings = NONE] filter, and then use array::len() to get the length of the array.
We are going to work with this data by using a SELECT query with a WHERE clause, like WHERE Source = 'Metacritic'. First we will give this data the parameter $ratings.
If we were to use SELECT Score... on this, the return value would be an array with the matching objects and their Score field.
Since we only have one field and only care about its value, we can add the VALUE clause. Now it will return an array of the values, which in this case is just this string.
Finally, we can finish off with ONLY and LIMIT 1 as we learned above. This will only return a single string and nothing else.
Then we can use the string::replace() function to replace the '/100' part with nothing.
That output will be good enough to cast into a string.
With this logic, we can create three functions to do this with each type of reviewer. The functions will take an array<object> and return an option<number>. An option is a type that can either be something, or NONE. With this as the return type, we can safely return NONE if there is no review, and a value otherwise.
All the functions together look like this.
Turning the movie runtime values into durations will not be difficult, as they are all expressed in minutes with the same format:
The string::replace() function will be enough to turn the output from something like '133 min' to '133m', which SurrealDB will be able to cast into a datetime. A change from min to m will do the trick.
This will return '6m'.
Now let's give this a try with a single movie from the naive_movie records in the database.
These fields each contain a single string that holds the names of one or more people. We will use this later on to create person records that will be linked to these movies. In the meantime, we will use .split(', ') on these values to have a single string for each person and make use of it later.
Let's move on by taking a look at the data for genre, languages, and rated (the age suitability ranking for each movie). A quick query returning the languages and genre for three movies shows that this data is less than ideal.
The language and genre data is just a string - readable for human eyes, but not very useful at all for actual data analysis. Ideally, we would like to be able to use a query that shows us all the languages or genres present in all of our movies.
The first function we would like to use in such a query is one called array::distinct() that removes all duplicates. Here is a quick example of its behaviour:
But this function alone won't do the trick, because it will only check for distinct strings, not individual languages. Let's give this a try with eight movies:
Unfortunately, so many Genre values are unique strings made up of a combination of genres that array::distinct() hasn't filtered out all that much.
What we really want is to have each genre treated as a separate data point instead of these big combined strings.
Fortunately, each genre is separated by ', ', making it possible to use the string::split() function to turn this into an array.
Now, we can't call .split(', ') here because that would be trying to split on the entire return value of the SELECT statement, which is an array<string>.
Instead, we can use the .map() method to do something on each string inside. After .map you can see a small section with |$m| where you give each item a parameter name, after which .split() can be called on it. This is what is known as a closure, and you can read more about them here.
The output is a big array<array<string>> (an array of arrays of strings).
To flatten them into a single array<string> we can use a method called .flatten().
And then to remove the duplicate items we can call .distinct() at the end.
There is also a method called .group() that does the same thing by flattening and removing duplicates in a single call!
These values will be useful later on, because we can use them to ensure that any new movies added in the future will be one of these genres. We can use DEFINE PARAM followed by the parameter name $GENRE, followed by VALUE to set the value. And its value will be the query that we just used.
A database parameter is just a value that you can access using its name, so just typing this is enough to display all of the available genres.
We can do this with the age-related ratings for movies too. As the following query shows, there is only a small number of choices here:
We can use this as an assertion too. Let's put it into a parameter in the same way we did with $GENRES.
These two fields are not a huge challenge. We already made our own fn::date_to_datetime function to convert date formats like '21 Dec 1999' to a datetime, and the dollar values like '$136,381,073' for BoxOffice can be converted to numbers after removing the dollar sign and commas. But there is one more small item to note. You might have noticed this already when eyeballing the data, but this query will make it more obvious:
We can see that the database we got this information from uses N/A to represent a lack of data instead of something like NULL or NONE. So we will need to do a quick check for these fields to see if they are equal to 'N/A', and set them as NONE if that is the case.
Let's give this a try now by creating some movie records with just these two fields (plus a title).
We can also wrap this in a query of its own so that we can add ORDER BY box_office and LIMIT 5 to return only the five movies that made the most money.
That takes care of all of the fields that need work! There are a lot of other fields that don't need any modification such as title, plot, and poster (a url to an image of the movie's poster). But we will work with these fields soon on the schema level.
Putting everything together that we have so far, this single query will allow us to take each of the naive_movie records and turn them into movie records that are much easier to use for real data analysis.
Let's grab one movie to see what the new format looks like!
The output is much better than the naive_movie data. Each person involved is part of an array of strings, box office data is a proper number, release dates are datetimes, and the runtime is a duration.
That means that our attempt to order by runtime will now work. Let's give it a try!