# Collections

In Typesense, every record you index is called a Document and a group of documents with similar fields is called a Collection. A Collection is roughly equivalent to a table in a relational database.

# Create a collection

Before we can add documents to Typesense, we need to first create a Collection - we give it a name and describe the fields that will be indexed from our Documents. We call this definition the collection's schema, which is just a fancy term to describe the fields (and their data types) in your documents.

TIP

It might help to think of defining a collection "schema" as being similar to defining "types" in a strongly-typed programming language like Typescript, C, Java, Dart, Rust, etc. This ensures that the documents you add to your collection have consistent data types and are validated, and this helps prevent a whole class of errors you might typically see with mis-matched or inconsistent data types across documents.

Organizing Collections

Read more on how to organize data into collections in this dedicated guide article: Organizing Collections.

There are two ways to specify a schema:

  1. Pre-define all the fields to be indexed from your documents OR
  2. Have Typesense automatically detect your fields and data types based on the documents you index.

The simplest option is #2 where you don't have to worry about defining an explicit schema. But if you need more fine-grained control and/or validation, you want to use #1 or even mix both together.

# With pre-defined schema

Let's first create a collection with an explicit, pre-defined schema.

This option gives you fine-grained control over your document fields' data types and configures your collection to reject documents that don't match the data types defined in your schema (by default).

If you want Typesense to automatically detect your schema for you, skip over to auto-schema detection.

See Schema Parameters for all available options, and Field Types for all available data types.

Sample Response

Definition

POST ${TYPESENSE_HOST}/collections

IMPORTANT NOTE & TIP

All fields you mention in a collection's schema will be indexed in memory.

There might be cases where you don't intend to search / filter / facet / group by a particular field and just want it to be stored (on disk) and returned as is when a document is a search hit. For eg: you can store image URLs in every document that you might use when displaying search results, but you might not want to text-search the actual URLs.

You want to NOT mention these fields in the collection's schema or mark these fields as index: false (see fields schema parameter below) to mark it as an unindexed field. You can have any number of these additional unindexed fields in the documents when adding them to a collection - they will just be stored on disk, and will not take up any memory.

# With auto schema detection

If your field names are dynamic and not known upfront, or if you just want to keep things simple and index all fields you send in your documents by default, auto-schema detection should help you.

You can define a wildcard field with the name .* and type auto to let Typesense automatically detect the type of the fields when you add documents to the collection. In fact, you can use any RegEx expression to define a field name.

When a .* field is defined this way, all the fields in a document are automatically indexed for searching and filtering.

# Data Coercion

Say you've set type: auto for a particular field (or fields) (eg: popularity_score) in a collection and send the first document as:

Since popularity_score has type: auto, the data-type will automatically be set to int64 internally.

What happens when the next document's popularity_score field is not an integer field, but a string? For eg:

By default, Typesense will try to coerce (convert) the value to the previously inferred type. So in this example, since the first document had a numeric data-type for popularity_score, the second document's popularity_score field will be coerced to an integer from string.

However, this may not always work - (for eg: say the value has alphabets, it can't be coerced to an integer). In such cases, when Typesense is unable to coerce the field value to the previously inferred type, the indexing will fail with the appropriate error.

TIP

You can control this default coercion behavior at write-time with the dirty_values parameter.

# Faceting fields with auto-schema detection

Faceting is not enabled for a wildcard field {"name": ".*" , ...}, since that can consume a lot of memory, especially for large text fields. However, you can still explicitly define specific fields (with or without RegEx names) to facet by setting facet: true for them.

For e.g, when you define a schema like this:

{
  "name": "companies",
  "fields": [
    {
      "name": ".*_facet",
      "type": "auto",
      "facet": true
    }
  ]
}

This will only set field names that end with _facet in the document, as a facet.

# Geopoint and auto-schema detection

A geopoint field requires an explicit type definition, as the geo field value is represented as a 2-element float field and we cannot differentiate between a lat/long definition and an actual float array.

# Indexing all but some fields

If you have a case where you do want to index all fields in the document, except for a few fields, you can use the {"index": false, "optional": true} settings to exclude fields.

Note: it is not currently possible to have a mandatory field excluded from the indexing, hence the setting to optional.

For eg, if you want to index all fields, except for fields that start with description_, you can use a schema like this:

TIP

You can mix auto-schema detection with explicit field definitions.

If an explicit definition is available for a field (country in the example above), Typesense will give preference to that before falling back to the wildcard definition.

When such an explicit field definition is not available, the first document that contains a field with a given name determines the type of that field.

For example, if you index a document with a field named title and it is a string, then the next document that contains the field named title will be expected to have a string too.

# Schema parameters

Parameter Required Description
name yes Name of the collection you wish to create.
fields yes A list of fields that you wish to index for querying, filtering and faceting. For each field, you have to specify at least it's name and type.

Eg: {"name": "title", "type": "string", "facet": false, "index": true}

name can be a simple string like "name": "score". Or you can also use a RegEx to specify field names matching a pattern. For eg: if you want to specify that all fields starting with score_ should be an integer, you can set name as "name": "score_.*".

Declaring a field as optional
A field can be declared as optional by setting "optional": true.

Declaring a field as a facet
A field can be declared as a facetable field by setting "facet": true. Faceted fields are indexed verbatim without any tokenization or preprocessing. For example, if you are building a product search, color and brand could be defined as facet fields. Once a field is enabled for faceting in the schema, it can be used in the facet_by search parameter.

Declaring a field as un-indexed
You can set a field as un-indexed by setting "index": false. This is useful when used along with auto schema detection and you need to exclude certain fields from indexing.

Configuring language-specific tokenization:
The default tokenizer that Typesense uses works for most languages, especially ones that separate words by spaces. However, based on feedback from users, we've added locale specific customizations for the following languages. You can enable these customizations for a field, by setting a field called locale inside the field definition. Eg: {name: 'title', type: 'string', locale: 'ja'} will enable the Japanese locale customizations for the field named title.

Here's the list of all language-specific customizations:
  • ja - Japanese
  • zh - Chinese
  • ko - Korean
  • th - Thai
  • el - Greek
  • ru - Russian
  • sr - Serbian / Cyrillic
  • uk - Ukrainian
  • be - Belarusian
  • For all other languages, you don't have to set the locale field.
token_separators no List of symbols or special characters to be used for splitting the text into individual words in addition to space and new-line characters.

For e.g. you can add - (hyphen) to this list to make a word like non-stick to be split on hyphen and indexed as two separate words.
symbols_to_index no List of symbols or special characters to be indexed.

For e.g. you can add + to this list to make the word c++ indexable verbatim.
default_sorting_field no The name of an int32 / float field that determines the order in which the search results are ranked when a sort_by clause is not provided during searching. This field must indicate some kind of popularity. For example, in a product search application, you could define num_reviews field as the default_sorting_field.

Additionally, when a word in a search query matches multiple possible words (either because of a typo or during a prefix search), this parameter is used to rank such equally matching tokens. For e.g. both "john" and "joan" are 1-typo away from "jofn". Similarly, in a prefix search, both "apple" and "apply" would match the prefix "app". In these cases, the default_sorting_field is used as the tie-breaker to rank.

# Field types

Typesense allows you to index the following types of fields:

type Description
string String values
string[] Array of strings
int32 Integer values up to 2,147,483,647
int32[] Array of int32
int64 Integer values larger than 2,147,483,647
int64[] Array of int64
float Floating point / decimal numbers
float[] Array of floating point / decimal numbers
bool true or false
bool[] Array of booleans
geopoint Latitude and longitude specified as [lat, lng]
geopoint[] Arrays of Latitude and longitude specified as [[lat1, lng1], [lat2, lng2]]
object Nested objects
object[] Arrays of nested objects
string* Special type that automatically converts values to a string or string[].
auto Special type that automatically attempts to infer the data type based on the documents added to the collection. See automatic schema detection.

# Cloning a collection schema

Here's how you can clone an existing collection's schema (documents are not copied), overrides and synonyms.

curl -k "http://localhost:8108/collections?src_name=existing_coll" -X POST -H "Content-Type: application/json" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
        "name": "new_coll"
      }'

The above API call will create a new collection called new_coll that contains the schema, overrides and synonyms of the collection existing_coll. The actual documents in the existing_coll collection are not copied, so this is primarily useful for creating new collections from an existing reference template.

TIP

Cloning a collection this way, does not copy the data.

# Notes on indexing common types of data

Here's how to index other common types of data, using the basic primitives in the table above:

# Indexing nested fields

Typesense supports indexing nested objects (and array of objects) from v0.24.

You must first enable nested fields at a collection level via the enable_nested_fields schema property:




 





curl -k "http://localhost:8108/collections" -X POST -H "Content-Type: application/json" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
        "name": "docs", 
        "enable_nested_fields": true,
        "fields": [
          {"name": ".*", "type": "auto"}
        ]
      }'

The schema can also explicitly index specific object fields or object arrays, e.g.:






 
 



curl -k "http://localhost:8108/collections" -X POST -H "Content-Type: application/json" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
        "name": "docs", 
        "enable_nested_fields": true,
        "fields": [
          {"name": "person", "type": "object"},
          {"name": "details", "type": "object[]"}
        ]
      }'

When you now search on an object field name, all sub-fields will be automatically searched. Use a dot notation to refer to specific sub-fields, e.g. person.last_name or person.school.name.

Indexing nested objects via flattening

You can also flatten objects and arrays of objects into top-level keys before sending the data into Typesense.

For example, a document like this containing nested objects:

{
  "nested_field": {
    "field1": "value1",
    "field2": ["value2", "value3", "value4"],
    "field3": {
      "fieldA": "valueA",
      "fieldB": ["valueB", "valueC", "valueD"]
    }
  }
}  

would need to be flattened as:

{
  "nested_field.field1": "value1",
  "nested_field.field2":  ["value2", "value3", "value4"],
  "nested_field.field3.fieldA": "valueA",
  "nested_field.field3.fieldB": ["valueB", "valueC", "valueD"]
}

before indexing it into Typesense.

To simplify traversing the data in the results, you might want to send both the flattened and unflattened version of the nested fields into Typesense, and only set the flattened keys as indexed in the collection's schema and use them for search/filtering/faceting. At display time when parsing the results, you can then use the nested version.

# Indexing Dates

Dates need to be converted into Unix timestamps (opens new window) and stored as int64 fields in Typesense. Most languages have libraries that help do this conversion for you.

You'll then be able to use numerical operators like <, >, etc to filter records that are before or after or between dates.

# Indexing other types of data

Read our dedicated guide article on how to index other common types of data like emails, phone numbers, SKUs, model numbers, etc here.

# Retrieve a collection

Retrieve the details of a collection, given its name.

Sample Response

Definition GET ${TYPESENSE_HOST}/collections/:collection

# List all collections

Returns a summary of all your collections. The collections are returned sorted by creation date, with the most recent collections appearing first.

Sample Response

Definition GET ${TYPESENSE_HOST}/collections

# Drop a collection

Permanently drops a collection. This action cannot be undone. For large collections, this might have an impact on read latencies.

Sample Response

Definition DELETE ${TYPESENSE_HOST}/collections/:collection

# Update or alter a collection

Typesense supports adding or removing fields to a collection's schema in-place.

TIP

Typesense supports updating all fields except the id field (since it's a special field within Typesense).

Let's see how we can add a new company_category field to the companies collection and also drop the existing num_employees field.

Sample Response

Definition PATCH ${TYPESENSE_HOST}/collections/:collection

TIP

The schema update is a synchronous blocking operation. When the update is in progress, all incoming writes and reads to that particular collection will wait for the schema update to finish. So, we recommend updating fields one at a time, especially for large collections and during off-peak hours.

Alternatively, you can also use the alias feature to do zero downtime schema changes.

The update operation consists of an initial validation step where the records on-disk are assessed to ensure that they are compatible with the proposed schema change. For example, let's say there is a string field A which is already present in the documents on-disk but is not part of the schema. If you try to update the collection schema by adding a field A with type integer, the validation step will reject this change as it's incompatible with the type of data already present.

If the validation is successful, the actual schema change is done and the records are indexed / re-indexed / dropped as per the requested change. The process is complete as soon as the API call returns (make sure you use a large client timeout value). Because of the blocking nature of the update, we recommend doing the change during off-peak hours. Alternatively, you can also use the alias feature to do zero downtime schema changes.

# Modifying an existing field

Since Typesense currently only supports adding/deleting a field, any modifications to an existing field should be expressed as a drop + add operation. All fields except the id field can be modified.

For example, to add a facet property to the company_category field, we will drop + add it in the same change set:

curl "http://localhost:8108/collections/companies" \
       -X PATCH \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
         "fields": [
           {"name": "company_category", "drop": true },
           {"name": "company_category", "type": "string", "facet": true }   
         ]
       }'

# Using an alias

If you need to do zero-downtime schema changes, you could also re-create the collection fully and use the Collection Alias feature to do a zero-downtime switch over to the new collection:

  1. Create your collection as usual with a timestamped name. For eg: movies_jan_1
  2. Create an alias pointing to your collection. For eg: create an alias called movies pointing to movies_jan_1
  3. Use the collection alias in your application to search / index documents in your collection.
  4. When you need to make schema changes, create a new timestamped collection with the updated collection schema, for eg: movies_feb_1 and reindex your data in it.
  5. Update the collection alias to now point to the new collection. Eg: Update movies to now point to movies_feb_1.
  6. Drop the old collection, movies_jan_1 in our example.

Once you update the alias, any search / indexing operations will go to the new collection (eg: movies_feb_1) without you having to do any application-side changes.

# Dynamic field additions

If you only need to add new fields to the schema on the fly, we recommend using auto-schema detection when creating the collection. You can essentially define RegEx field names and when documents containing new field names that match the RegEx come in, the new fields will automatically be added to the schema.

Last Updated: 3/31/2024, 9:38:47 AM