# Conversational Search (RAG)

Typesense has the ability to respond to free-form questions, with conversational responses and also maintain context for follow-up questions and answers.

Think of this feature as a ChatGPT-style Q&A interface, but with the data you've indexed in Typesense. 

Typesense uses a technique called [Retrieval Augmented Generation](https://www.promptingguide.ai/techniques/rag) (RAG) to enable this style of conversational searches. 

Instead of having to build your own RAG pipeline, Typesense essentially has built-in RAG using it's [Vector Store](./vector-search.md) for [semantic search](../../guide/semantic-search.md), and it's pre-built integration with LLMs for formulating conversational responses.

## Create a Conversation History collection

Let's start by creating a Typesense collection called `conversation_store` that will store the conversation history generated by the conversational search feature.

The collection can be named anything, but it needs to have the following defined schema, since this collection is auto-populated by Typesense internally, as conversations happen.

```shell
curl "http://localhost:8108/collections" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
        "name": "conversation_store",
        "fields": [
            {
                "name": "conversation_id",
                "type": "string"
            },
            {
                "name": "model_id",
                "type": "string"
            },
            {
                "name": "timestamp",
                "type": "int32"
            },
            {
                "name": "role",
                "type": "string",
                "index": false
            },
            {
                "name": "message",
                "type": "string",
                "index": false
            }
        ]
    }'
```

## Create a Conversation Model

Once we've created the conversation history collection above, we can then create a conversation model resource using a large language model (LLM) of our choice.

Typesense currently supports the following LLM platforms:

- [OpenAI](https://platform.openai.com/docs/models/models-overview)
- [Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)
- [Google](https://ai.google.dev/gemini-api/docs/text-generation)
- [Cloudflare Workers AI](https://developers.cloudflare.com/workers-ai/models/#text-generation)
- [vLLM](https://github.com/vllm-project/vllm) (useful when running local LLMs)

Here's how to use each of these platforms to create a Conversational Model. (Use the tabs in the code snippet below to navigate between each platform).

<Tabs :tabs="['OpenAI', 'Azure', 'Google', 'Cloudflare', 'vLLM']">

<template v-slot:OpenAI>

```shell
curl 'http://localhost:8108/conversations/models' \
  -X POST \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
        "id": "conv-model-1",
        "model_name": "openai/gpt-3.5-turbo",
        "history_collection": "conversation_store",
        "api_key": "OPENAI_API_KEY",
        "system_prompt": "You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.",
        "max_bytes": 16384
      }'
```

</template>

<template v-slot:Azure>

```shell
curl 'http://localhost:8108/conversations/models' \
  -X POST \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
        "id": "conv-model-1",
        "model_name": "azure/gpt-35-turbo",
        "history_collection": "conversation_store",
        "api_key": "AZURE_OPENAI_API_KEY",
        "url": "https://your_resource.openai.azure.com/openai/deployments/your_deployment/chat/completions?api-version=2024-02-15-preview",
        "system_prompt": "You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.",
        "max_bytes": 16384
      }'
```

</template>

<template v-slot:Google>

```shell
curl 'http://localhost:8108/conversations/models' \
  -X POST \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
        "id": "conv-model-1",
        "model_name": "google/gemini-2.0-flash",
        "history_collection": "conversation_store",
        "api_key": "GEMINI_API_KEY",
        "system_prompt": "You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.",
        "max_bytes": 16384
      }'
```

</template>

<template v-slot:Cloudflare>

```shell
curl 'http://localhost:8108/conversations/models' \
  -X POST \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
        "id": "conv-model-1",
        "model_name": "cloudflare/@cf/mistral/mistral-7b-instruct-v0.1",
        "history_collection": "conversation_store",
        "api_key": "CLOUDFLARE_API_KEY",
        "account_id": "CLOUDFLARE_ACCOUNT_ID",
        "system_prompt": "You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.",
        "max_bytes": 16384
      }'
```

</template>

<template v-slot:vLLM>

```shell
curl 'http://localhost:8108/conversations/models' \
  -X POST \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
        "id": "conv-model-1",
        "model_name": "vllm/NousResearch/Meta-Llama-3-8B-Instruct",
        "history_collection": "conversation_store",
        "vllm_url": "http://localhost:8000",
        "system_prompt": "You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.",
        "max_bytes": 16384
      }'
```

</template>
</Tabs>

#### Parameters

| Parameter          | Description                                                                                                                                               |
|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| model_name         | Name of the LLM model offered by OpenAI, Azure OpenAI, Cloudflare or vLLM                                                                                 |
| api_key            | The LLM service's API Key                                                                                                                                 |
| history_collection | Typesense collection that stores the historical conversations                                                                                             |
| account_id         | LLM service's account ID (only applicable for Cloudflare)                                                                                                 |
| url                | The Azure OpenAI endpoint URL (only applicable for Azure OpenAI)                                                                                          |
| system_prompt      | The system prompt that contains special instructions to the LLM                                                                                           |
| ttl                | Time interval in seconds after which the messages would be deleted. Default: `86400` (24 hours)                                                           |
| max_bytes          | The maximum number of bytes to send to the LLM in every API call. Consult the LLM's documentation on the number of bytes supported in the context window. |
| vllm_url           | URL of vLLM service                                                                                                                                       |
| openai_url         | Base URL of OpenAI API endpoint (only applicable for OpenAI)                                                                                              |
| openai_path        | URL path of OpenAI API endpoint (only applicable for OpenAI)                                                                                              |

**Response:**

```json
{
  "api_key": "sk-7K**********************************************",
  "id": "conv-model-1",
  "max_bytes": 16384,
  "model_name": "openai/gpt-3.5-turbo",
  "history_collection": "conversation_store",
  "system_prompt": "You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic."
}
```

:::tip
If you don't pass an explicit `id` for the model, the API will return a response with an auto-generated conversation 
model `id`, that we can use in our search queries:
:::

## Start a Conversation

Once we've created a conversation model, we can start a conversation using the [search](./federated-multi-search.md) endpoint and the following _search_ parameters:

- `conversation = true`
- `conversation_model_id = X`
- `q = <Any conversational question>`
- `query_by = <an auto-embedding field>`

Where `X` is the auto-generated Conversation Model ID returned by Typesense in the step above and `query_by` is an [auto-embedding field](./vector-search.md#option-b-auto-embedding-generation-within-typesense).

Here's an example, where we ask the question "Can you suggest an action series?" in the `q` parameter, using data we've indexed in a collection called `tv_shows` in Typesense.

```shell
curl 'http://localhost:8108/multi_search?q=can+you+suggest+an+action+series&conversation=true&conversation_model_id=conv-model-1' \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{
              "searches": [
                {
                  "collection": "tv_shows",
                  "query_by": "embedding",
                  "exclude_fields": "embedding"
                }
              ]
            }'
```

:::warning IMPORTANT
It's important to specify `"exclude_fields": "embedding"` in the search request, so that the raw floating point numbers aren't being sent to the LLM unnecessarily, which will end up consuming most of the context window.

You also want to remove any other fields that might not be relevant to generating a conversational response by specifying them in `exclude_fields`.
:::

**Response:**

Typesense will now return a new field in the search API response called `conversation`.

You'd display the `conversation.answer` key to your user, as the response to their question. 

```json
{
  "conversation": {
    "answer": "I would suggest \"Starship Salvage\", a sci-fi action series set in a post-war galaxy where a crew of salvagers faces dangers and ethical dilemmas while trying to rebuild.",
    "conversation_history": {
      "conversation": [
        {
          "user": "can you suggest an action series"
        },
        {
          "assistant": "I would suggest \"Starship Salvage\", a sci-fi action series set in a post-war galaxy where a crew of salvagers faces dangers and ethical dilemmas while trying to rebuild."
        }
      ],
      "id": "771aa307-b445-4987-b100-090c00a13f1b",
      "last_updated": 1694962465,
      "ttl": 86400
    },
    "conversation_id": "771aa307-b445-4987-b100-090c00a13f1b",
    "query": "can you suggest an action series"
  },
  "results": [
    {
      "facet_counts": [],
      "found": 10,
      "hits": [
        ...
      ],
      "out_of": 47,
      "page": 1,
      "request_params": {
        "collection_name": "tv_shows",
        "per_page": 10,
        "q": "can you suggest an action series"
      },
      "search_cutoff": false,
      "search_time_ms": 3908
    }
  ]
}
```

:::tip Excluding Conversation History
You can exclude conversation history from the search API response by setting `exclude_fields: conversation_history` as a search parameter.
:::

:::tip Multi-Search
When using the `multi_search` endpoint with the Conversations feature, the `q` parameter has to be set as a query parameter and not as a body parameter inside a particular search.

You can search multiple collections within the `multi_search` endpoint, and Typesense will use the top results from each collection when communicating with the LLM.
If your `multi_search` request includes multiple searches with `conversation=true` (either through the common query parameters or within individual searches), only a single conversation object will be returned in the response. Typesense will use results from all searches in the `multi_search` to generate one answer based on data from multiple collections.
:::

:::tip Auto-Embedding Model
In our experience, we've found that models that are specifically meant for Q&A use-cases (like the `ts/all-MiniLM-L12-v2` S-BERT model) perform well for conversations. You can also use OpenAI's text embedding models.
:::

## Streaming Conversations

You can enable streaming responses from the LLM by setting `conversation_stream=true` as a query parameter. This allows you to build interactive chat experiences where responses appear gradually as they are generated.

Here's an example of how to use streaming:

```shell
curl 'http://localhost:8108/multi_search?q=can+you+suggest+an+action+series&conversation=true&conversation_model_id=conv-model-1&conversation_stream=true' \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{
              "searches": [
                {
                  "collection": "tv_shows",
                  "query_by": "embedding",
                  "exclude_fields": "embedding"
                }
              ]
            }'
```

When streaming is enabled, the response will be sent as a Server-Sent Events (SSE) stream. Each event will contain a chunk of the response as it's being generated by the LLM. The events will have the following format:

```
data: {"conversation": {"message": "I would suggest", "conversation_id": "771aa307-b445-4987-b100-090c00a13f1b"}}

data: {"conversation": {"message": " \"Starship Salvage\"", "conversation_id": "771aa307-b445-4987-b100-090c00a13f1b"}}

data: {"conversation": {"message": ", a sci-fi action series", "conversation_id": "771aa307-b445-4987-b100-090c00a13f1b"}}

...

data: [DONE]
```

Each event contains:

- `message`: A piece of the response being streamed.
- `conversation_id`: The ID of the conversation, which you can use for follow-up questions.

The final response will still include the complete conversation object with the full answer and conversation history, just like in non-streaming mode.

:::tip
If you're using the [TypeScript client](https://github.com/typesense/typesense-js), you can use the `streamConfig` parameter to handle streaming events:

```typescript
const searchParameters = {
  q: 'can you suggest an action series',
  conversation: true,
  conversation_model_id: 'conv-model-1',
  conversation_stream: true,
  streamConfig: {
    onChunk: chunk => {
      // Handle each chunk of the response
      console.log(chunk.conversation.message)
    },
    onError: error => {
      // Handle any errors
      console.error(error)
    },
    onComplete: response => {
      // Handle the complete response
      console.log(response)
    },
  },
}
```

For multisearch requests, the `streamConfig` parameter is part of the `commonParams` object:

```typescript
type Product = {
  id: string
  title: string
  description: string
}

type Store = {
  id: string
  name: string
}

await test.multiSearch.perform<[Product, Store]>(
  {
    searches: [
      {
        collection: 'products',
        query_by: 'title',
      },
      {
        collection: 'stores',
        query_by: 'name',
      },
    ],
  },
  {
    per_page: 10,
    q: 'Raisin',
    conversation: true,
    conversation_stream: true,
    conversation_model_id: 'conv-model-1',
    streamConfig: {
      onChunk: result => {
        console.log(result)
      },
      onComplete: results => {
        console.log(results.results[0].hits) // SearchResponseHit<Product>[] | undefined
        console.log(results.results[1].hits) // SearchResponseHit<Store>[] | undefined
      },
    },
  },
)
```

:::

## Follow-up Questions

We can continue a conversation that we started previously and ask follow-up questions, using the `conversation_id` parameter returned by Typesense, when starting a conversation.

Continuing our example from above, let's ask the follow-up question - "How about another one" in the `q` parameter:

```shell
curl 'http://localhost:8108/multi_search?q=how+about+another+one&conversation=true&conversation_model_id=conv-model-1&conversation_id=771aa307-b445-4987-b100-090c00a13f1b' \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{
              "searches": [
                {
                  "collection": "tv_shows",
                  "query_by": "embedding",
                  "exclude_fields": "embedding"
                }
              ]
            }'
```

Notice the addition of the `conversation_id` as a query parameter above. 

This parameter causes Typesense to include prior context when communicating with the LLM.

**Response:**

```json
{
  "conversation": {
    "answer": "Sure! How about \"Galactic Quest\"? It could follow a group of intergalactic adventurers as they travel through different planets and encounter various challenges and mysteries along the way.",
    "conversation_history": {
      "conversation": [
        {
          "user": "can you suggest an action series"
        },
        {
          "assistant": "I would suggest \"Starship Salvage\", a sci-fi action series set in a post-war galaxy where a crew of salvagers faces dangers and ethical dilemmas while trying to rebuild."
        },
        {
          "user": "how about another one"
        },
        {
          "assistant": "Sure! How about \"Galactic Quest\"? It follows a group of intergalactic adventurers as they travel through different planets and encounter various challenges and mysteries along the way."
        }
      ],
      "id": "771aa307-b445-4987-b100-090c00a13f1b",
      "last_updated": 1694963173,
      "ttl": 86400
    },
    "conversation_id": "771aa307-b445-4987-b100-090c00a13f1b",
    "query": "how about another one"
  },
  "results": [
    {
      "facet_counts": [],
      "found": 10,
      "hits": [
        ...
      ],
      "out_of": 47,
      "page": 1,
      "request_params": {
        "collection_name": "tv_shows",
        "per_page": 10,
        "q": "how about another one"
      },
      "search_cutoff": false,
      "search_time_ms": 3477
    }
  ]
}
```

Under the hood, for each follow-up question, Typesense makes an API call to the LLM to generate a standalone question that captures all relevant context from the conversation history, using the following prompt:

```markdown
Rewrite the follow-up question on top of a human-assistant conversation history as a standalone question that encompasses all pertinent context.

<Conversation history>
{actual conversation history without system prompts}

<Question>
{follow up question}

<Standalone Question>
```

The generated standalone question will be used for semantic/hybrid search within the collection, and the results will then be forwarded to the LLM as context for answering the generated standalone question.

:::tip Context Window Limits
Although we retain the entire conversation history in Typesense, only the most recent 3000 tokens (approximately 1200 characters) of the conversation history will be sent for generating the standalone question due to the context limit.

Similar to the conversation history, only the top search results, limited to 3000 tokens, will be sent along with the standalone question.
:::

## Managing Past Conversations

Typesense stores all questions and answers (conversation history), in the Typesense collection specified by the 
`history_collection` parameter. Each conversation has a `conversation_id` parameter associated with it and is 
persisted by default for 24 hours to support follow-ups. This persistence period can be configured by the 
model's `ttl` parameter.

You can use the collection APIs to manage these messages. Since the conversation history is not tied to 
specific search collections, they are versatile and compatible with different search collections at any time.

## Managing Conversation Models

### Retrieve all models

```shell
curl 'http://localhost:8108/conversations/models' \
  -X GET \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
```

### Retrieve a single model

```shell
curl 'http://localhost:8108/conversations/models/conv-model-1' \
  -X GET \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
```

### Update a model

You can update the model parameters like this:

```shell
curl 'http://localhost:8108/conversations/models/conv-model-1' \
  -X PUT \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -d '{
        "id": "conv-model-1",
        "model_name": "openai/gpt-3.5-turbo",
        "history_collection": "conversation_store",
        "api_key": "OPENAI_API_KEY",
        "system_prompt": "Hey, you are an **intelligent** assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.",
        "max_bytes": 16384
      }'
```

### Delete a model

```shell
curl 'http://localhost:8108/conversations/models/conv-model-1' \
  -X DELETE \
  -H 'Content-Type: application/json' \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
```