# Recommendations
Typesense can generate recommendations based on the actions users take in a given session, using the Vector Search feature.
This involves building a Machine Learning model to generate embeddings, storing them in Typesense, and then doing a nearest-neighbor search in Typesense.
In this article's we'll talk about how to use the Starspace (opens new window) ML model to generate embeddings.
Transformers4Rec (opens new window) is another ML model that can be used for this use-case, among others.
# Scenario
We'll use an e-commerce products dataset in this article for illustration, but the concepts below can be applied to any domain (eg: recommending articles, movies, or any type of records stored in Typesense).
# Step 1: Prepare training dataset
We'll be using Starspace (opens new window) to build our ML model.
Starspace expects the training dataset in the following format - one line for each user session and the set of items that they interacted with in that session.
In the example above, the first line indicates that a certain user interacted with (viewed, bought, added to cart, etc) the products apple
, orange
, banana
, broccoli
and mango
in a single session.
Another user (or may be even the same user as above) interacted with the products cereals
, soda
, bread
, nuts
and cookies
in another session.
TIP
We're using the product_name
in this example to make this article easier to read.
In a production setting, you'd want to use the product's ID or SKU in the training dataset.
# Step 2: Setup Starspace
# Install system dependencies
Ensure that you have a C++11 compiler (gcc-4.6.3 or newer or Visual Studio 2015 or clang-3.3 or newer).
On macOS you'd need to install XCode and on Linux distros, you'd need to install the build-essential
package from your distro's package manager.
# Clone Starspace source code
git clone https://github.com/facebookresearch/Starspace.git
cd Starspace
# Setup Boost
Boost is a library required by Starspace.
From inside the Starspace
directory above, run the following:
curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz
tar -xzvf boost_1_82_0.tar.gz
# Compile Starspace
From inside the Starspace
directory above, run the following:
make -e BOOST_DIR=boost_1_82_0 && \
make embed_doc -e BOOST_DIR=boost_1_82_0
To verify that Starspace is working fine, if you run ./starspace
, you should see output similar to the following:
$ ./starspace
Usage: need to specify whether it is train or test.
"starspace train ..." or "starspace test ..."
...
# Step 3: Train Starspace model
Name the file with your training dataset from Step 1 as session-data.txt
.
Then run the following command to train your model:
./starspace train \
-trainFile <path/to/session-data.txt> \
-model productsModel \
-label '' \
-trainMode 1 \
-epoch 25 \
-dim 100
Once this command runs, it will generate two files - a binary file called productsModel
and a TSV file with model weights called productsModel.tsv
.
You want to run this training step periodically as you collect new session data as users use your site / app.
# Step 4: Generate embeddings
First extract all the unique products from our training dataset:
export unique_items=$(tr ' ' '\n' < session-data.txt | sed '/^$/d' | sort -u)
export output_jsonl_file="products-with-embeddings.jsonl"
For each product, generate embeddings and store them in a JSONL file:
echo -n > ${output_jsonl_file}
while read -r item; do
embedding=$(echo "${item}" | ./embed_doc productsModel | tail -1 | tr ' ' ',')
echo "{\"id\":\"${item}\",\"embedding\":[${embedding%?}]}" >> "${output_jsonl_file}"
done <<< "${unique_items}"
This will generate a JSONL file that looks like this:
We can now ingest this JSONL file into Typesense.
# Step 5: Index the embeddings in Typesense
export TYPESENSE_API_KEY=xyz
export TYPESENSE_URL='https://xyz.a1.typesense.net'
Create a collection:
curl "${TYPESENSE_URL}/collections" \
-X POST \
-H "Content-Type: application/json" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-d '{
"name": "products",
"fields": [
{"name": "name", "type": "string", "optional": true},
{"name": "description", "type": "string", "optional": true},
{"name": "price", "type": "float", "optional": true},
{"name": "categories", "type": "string[]", "optional": true},
{"name": "embedding", "type": "float[]", "num_dim": 100 }
]
}'
Notice how we've set num_dim: 100
. This correlates to the -dim 100
parameter we set when training our Starspace model.
Import the JSONL file with embeddings into the collection:
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-X POST \
-T products-with-embeddings.jsonl \
"${TYPESENSE_URL}/collections/products/documents/import?action=emplace"
We're only inserting the embeddings for each product here. You can also import other values like name, description, price, categories, etc separately using the same id
field as reference to populate the rest of the product record.
# Step 6: Generate recommendations
Now let's say a user lands on our site / app and interacts with the following products in their session:
mango broccoli milk
To generate recommendations based on this session data, let's first generate embeddings:
export embedding=$(echo "mango broccoli milk" | ./embed_doc productsModel | tail -1 | tr ' ' ',')
echo ${embedding}
-0.0862846,0.127956,0.0558543,0.0745331,0.02449,-0.131018,0.0886827,-0.0571893,-0.0398686,-0.0116799,-0.0164978,-0.173818,0.0478985,0.109211,-0.0826394,-0.177671,-0.219366,0.180478,-0.0140154,-0.0237589,-0.010896,0.115979,-0.044924,0.129452,-0.0111529,-0.0978542,-0.121468,-0.0700872,-0.0190036,0.116127,0.0617186,-0.0463324,-0.172141,0.0302211,0.0610366,-0.0831281,0.04558,-0.00370933,-0.107602,-0.0394414,0.0334175,0.0429023,0.133572,-0.124658,0.225743,-0.0156787,-0.284864,0.148183,-0.0508378,0.175489,-0.0417769,-0.0920536,-0.0443016,-0.0838343,-0.0694042,-0.0333535,-0.108574,-0.0894618,-0.022049,-0.0500605,-0.0234268,0.00732048,0.0817547,0.00764651,0.0285933,0.100818,-0.229398,0.0508415,0.117766,-0.0289333,-0.0493134,0.167664,0.0696889,0.115228,-0.0609508,-0.12562,-0.0450054,-0.0648439,0.0817176,0.169663,0.133255,-0.111001,-0.0467052,-0.0373238,0.005385,0.111311,-0.0171787,0.0311545,0.0474074,-0.0301008,-0.0555648,0.0776044,-0.0287841,-0.162136,-0.0511268,0.174767,-0.0169033,-0.0223623,-0.140496,0.154727
Now we can send the embedding generated for this user's session data to Typesense as a vector_query
to do a nearest neighbor search, which will return the list of products to recommend to this user:
curl "${TYPESENSE_URL}/multi_search" \
-X POST \
-H "Content-Type: application/json" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-d '{
"searches": [
{
"q": "*",
"collection": "products",
"vector_query": "embedding:(['${embedding%?}'], k:10, distance_threshold: 1)"
}
]
}' | jq '.results[0].hits[] | .document.id'
This will return the following recommendations, which we can show this user in our UI, after filtering out items they've already seen in this session:
"broccoli"
"mango"
"banana"
"apple"
"orange"
"tissue"
"detergent"
"cheese"
"milk"
"butter"
TIP
We're using the product's name in this example to make this article easier to read. In a production setting, you'd want to use the product's ID or SKU in the training dataset and to generate embeddings like this:
sku_1 sku_4 sku_5
sku_5 sku_8 sku_1 sku_2 sku_10
sku_5 sku_1 sku_4 sku_21 sku_22