# Recommendations

Typesense can generate recommendations based on the actions users take in a given session, using the Vector Search feature.

This involves building a Machine Learning model to generate embeddings, storing them in Typesense, and then doing a nearest-neighbor search in Typesense.

In this article's we'll talk about how to use the Starspace (opens new window) ML model to generate embeddings.

Transformers4Rec (opens new window) is another ML model that can be used for this use-case, among others.

# Scenario

We'll use an e-commerce products dataset in this article for illustration, but the concepts below can be applied to any domain (eg: recommending articles, movies, or any type of records stored in Typesense).

# Step 1: Prepare training dataset

We'll be using Starspace (opens new window) to build our ML model.

Starspace expects the training dataset in the following format - one line for each user session and the set of items that they interacted with in that session.

In the example above, the first line indicates that a certain user interacted with (viewed, bought, added to cart, etc) the products apple, orange, banana, broccoli and mango in a single session.

Another user (or may be even the same user as above) interacted with the products cereals, soda, bread, nuts and cookies in another session.

TIP

We're using the product_name in this example to make this article easier to read. In a production setting, you'd want to use the product's ID or SKU in the training dataset.

# Step 2: Setup Starspace

# Install system dependencies

Ensure that you have a C++11 compiler (gcc-4.6.3 or newer or Visual Studio 2015 or clang-3.3 or newer).

On macOS you'd need to install XCode and on Linux distros, you'd need to install the build-essential package from your distro's package manager.

# Clone Starspace source code

git clone https://github.com/facebookresearch/Starspace.git
cd Starspace

# Setup Boost

Boost is a library required by Starspace.

From inside the Starspace directory above, run the following:

curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/boost_1_82_0.tar.gz
tar -xzvf boost_1_82_0.tar.gz

# Compile Starspace

From inside the Starspace directory above, run the following:

make -e BOOST_DIR=boost_1_82_0 && \
  make embed_doc -e BOOST_DIR=boost_1_82_0

To verify that Starspace is working fine, if you run ./starspace, you should see output similar to the following:

$ ./starspace
Usage: need to specify whether it is train or test.

"starspace train ..."  or "starspace test ..."
...

# Step 3: Train Starspace model

Name the file with your training dataset from Step 1 as session-data.txt.

Then run the following command to train your model:

./starspace train \
  -trainFile <path/to/session-data.txt> \
  -model productsModel \
  -label '' \
  -trainMode 1 \
  -epoch 25 \
  -dim 100

Once this command runs, it will generate two files - a binary file called productsModel and a TSV file with model weights called productsModel.tsv.

You want to run this training step periodically as you collect new session data as users use your site / app.

# Step 4: Generate embeddings

First extract all the unique products from our training dataset:

export unique_items=$(tr ' ' '\n' < session-data.txt | sed '/^$/d' | sort -u)
export output_jsonl_file="products-with-embeddings.jsonl"

For each product, generate embeddings and store them in a JSONL file:

echo -n > ${output_jsonl_file}

while read -r item; do
    embedding=$(echo "${item}" | ./embed_doc productsModel | tail -1 | tr ' ' ',')
    echo "{\"id\":\"${item}\",\"embedding\":[${embedding%?}]}" >> "${output_jsonl_file}"
done <<< "${unique_items}"

This will generate a JSONL file that looks like this:

We can now ingest this JSONL file into Typesense.

# Step 5: Index the embeddings in Typesense

export TYPESENSE_API_KEY=xyz
export TYPESENSE_URL='https://xyz.a1.typesense.net'

Create a collection:

curl "${TYPESENSE_URL}/collections" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
         "name": "products",
         "fields": [
           {"name": "name", "type": "string", "optional": true},
           {"name": "description", "type": "string", "optional": true},
           {"name": "price", "type": "float", "optional": true},
           {"name": "categories", "type": "string[]", "optional": true},
           {"name": "embedding", "type": "float[]", "num_dim": 100 }
         ]
       }'

Notice how we've set num_dim: 100. This correlates to the -dim 100 parameter we set when training our Starspace model.

Import the JSONL file with embeddings into the collection:

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
      -X POST \
      -T products-with-embeddings.jsonl \
      "${TYPESENSE_URL}/collections/products/documents/import?action=emplace"

We're only inserting the embeddings for each product here. You can also import other values like name, description, price, categories, etc separately using the same id field as reference to populate the rest of the product record.

# Step 6: Generate recommendations

Now let's say a user lands on our site / app and interacts with the following products in their session:

mango broccoli milk

To generate recommendations based on this session data, let's first generate embeddings:

export embedding=$(echo "mango broccoli milk" | ./embed_doc productsModel | tail -1 | tr ' ' ',')
echo ${embedding}
-0.0862846,0.127956,0.0558543,0.0745331,0.02449,-0.131018,0.0886827,-0.0571893,-0.0398686,-0.0116799,-0.0164978,-0.173818,0.0478985,0.109211,-0.0826394,-0.177671,-0.219366,0.180478,-0.0140154,-0.0237589,-0.010896,0.115979,-0.044924,0.129452,-0.0111529,-0.0978542,-0.121468,-0.0700872,-0.0190036,0.116127,0.0617186,-0.0463324,-0.172141,0.0302211,0.0610366,-0.0831281,0.04558,-0.00370933,-0.107602,-0.0394414,0.0334175,0.0429023,0.133572,-0.124658,0.225743,-0.0156787,-0.284864,0.148183,-0.0508378,0.175489,-0.0417769,-0.0920536,-0.0443016,-0.0838343,-0.0694042,-0.0333535,-0.108574,-0.0894618,-0.022049,-0.0500605,-0.0234268,0.00732048,0.0817547,0.00764651,0.0285933,0.100818,-0.229398,0.0508415,0.117766,-0.0289333,-0.0493134,0.167664,0.0696889,0.115228,-0.0609508,-0.12562,-0.0450054,-0.0648439,0.0817176,0.169663,0.133255,-0.111001,-0.0467052,-0.0373238,0.005385,0.111311,-0.0171787,0.0311545,0.0474074,-0.0301008,-0.0555648,0.0776044,-0.0287841,-0.162136,-0.0511268,0.174767,-0.0169033,-0.0223623,-0.140496,0.154727

Now we can send the embedding generated for this user's session data to Typesense as a vector_query to do a nearest neighbor search, which will return the list of products to recommend to this user:

curl "${TYPESENSE_URL}/multi_search" \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{
          "searches": [
            {
              "q": "*",
              "collection": "products",
              "vector_query": "embedding:(['${embedding%?}'], k:10, distance_threshold: 1)"
            }
          ]
        }' | jq '.results[0].hits[] | .document.id'

This will return the following recommendations, which we can show this user in our UI, after filtering out items they've already seen in this session:

"broccoli"
"mango"
"banana"
"apple"
"orange"
"tissue"
"detergent"
"cheese"
"milk"
"butter"

TIP

We're using the product's name in this example to make this article easier to read. In a production setting, you'd want to use the product's ID or SKU in the training dataset and to generate embeddings like this:

sku_1 sku_4 sku_5
sku_5 sku_8 sku_1 sku_2 sku_10
sku_5 sku_1 sku_4 sku_21 sku_22
Last Updated: 8/31/2023, 11:03:18 AM