# Search for Documentation Sites

The good folks over at Algolia have built and open-sourced DocSearch (opens new window), which is a suite of tools specifically built to index data from a documentation site and then add a search bar to the site quickly.

This article will show you how to use a customized version of DocSearch that works with Typesense. In fact, the search bar you see on Typesense's own documentation site is built with this customized version of DocSearch.

Typesense's customized version of DocSearch is made up of two components:

  1. typesense-docsearch-scraper (opens new window) - A web-scraper that scans your documentation site and indexes the content in Typesense.
  2. typesense-docsearch.js (opens new window) - A JavaScript library that adds a search bar to your documentation site. When end-users start typing into the search bar, it queries the content index built by the DocSearch scraper.

TIP: Usage on Non-Documentation Sites

Even though DocSearch was originally built for documentation sites, it can actually be used for any site that has structured, hierarchical, and consistent HTML markup across pages.

# Step 1: Set up DocSearch Scraper

First, we need to set up the scraper to point to your documentation site. Running the scraper will generate an index for each word on your website, and then upload it to your Typesense server. That's what makes your website searchable!

# Create a DocSearch Scraper Config File

Follow one of the templates below to create your own config.json file, pointing to your documentation site:

Here's the official DocSearch Scraper documentation (opens new window) that describes all the available config options.

Note

Algolia's DocSearch repositories are archived because Algolia has migrated to their proprietary closed-source crawler in February 2022. Thus, they no longer maintain the open-source version.

Given this, Typesense intends to maintain and develop a fork (opens new window). Thus, you can safely ignore the deprecation warnings in their documentation.

In the long term, we intend to update all the documentation to Typesense repositories to avoid the confusion.

# Make necessary changes to the DocSearch Scraper config file

After starting with one of the templates, you will want to change a few fields in the configuration:

  • index_name - Should be a unique string that identifies your website. This corresponds to typesenseCollectionName in the front-end configuration further down below.
  • start_urls - This corresponds to the URL for your website.
  • stop_urls - An array of URLs to ignore. For example, if you have a change log on your website, you might want to ignore it so that it does not interfere with the search results for actual content.
  • sitemap_urls - (Docusaurus-only) You will need to change this URL to match, just like you changed the start_urls. (This XML file is automatically generated by Docusaurus during its build process.)
  • lvl1 - (Docusaurus-only) Change header h1 to article h1, header h1.

Tip: Scraping a site running on localhost

If you are running Typesense on localhost and you're using Docker to run the scraper, you will need to change some things in your config.json file.

On start_urls and sitemap_urls, you will need to target the host.docker.internal URL, to ensure that will find the right site in your host machine, instead of trying to find it inside the container.

You will need to run your site at port :80, because the scraper can present a not expected behavior if has hosted in another port.

TIP

There is a mismatch between index_name in the scraper config and typesenseCollectionName in the front-end config. This is because Algolia calls a collection of documents an "index" and Typesense calls a collection of documents a collection. The scraper was originally forked from Algolia and the name was deliberately kept to maintain backwards compatibility with the ecosystem.

TIP

If you look at the logs of your Typesense instance, you might see that it reports the index/collectible name as something like foo_1675838072 instead of foo. This is because every time the crawler runs:

  • It creates a new collection called: foo_<current_unix_timestamp>
  • It creates/updates an alias called foo that points to: foo_<current_unix_timestamp>
  • It deletes the previously scrapped version of the docs, stored in: foo_<previous_timestamp>

For this reason, when configuring your front-end search engine, you should specify the index/collection name as foo instead of foo_<unix_timestamp>.

# Add DocSearch Meta Tags (optional)

The scraper automatically extracts information from the DocSearch meta tags and attaches the content value to all records extracted on the page. This is a great way to filter searches on custom attributes.

<meta name="docsearch:{$NAME}_tag" content="{$CONTENT}" />

Example: If you have the following markup across a certain set of pages:

<meta name="docsearch:language_tag" content="en" />
<meta name="docsearch:version_tag" content="1.2.4" />

All extracted records on these pages will have a language_tag attribute with a value of en and a version_tag attribute with a value of 1.24, that you can use in filter_by to restrict the search to particular sets of records.

TIP

_tag must be appended to the end of the $NAME variable for the attribute to be saved in the schema.

# Run the Scraper

The easiest way to run the scraper is using Docker.

  1. Install Docker. (opens new window)

  2. Install jq. (opens new window)

  3. Make sure your Typesense server is operational.

  4. Create a .env file with the following contents, replacing them with the correct values for your particular situation:

    TYPESENSE_API_KEY=xyz
    TYPESENSE_HOST=xxx.a1.typesense.net
    TYPESENSE_PORT=443
    TYPESENSE_PROTOCOL=https
    

    TIP

    If you are self-hosting Typesense, then you can usually find your API key and port number in the /etc/typesense/typesense-server.ini file.

    The host will be equal to the FQDN or IP address of your server.

    By default, self-hosted Typesense uses HTTP, so you might need to change https to http. (Unless of course you specified ssl-certificate and ssl-certificate-key in your ini file.)

    TIP

    If you are running Typesense on localhost and you're using Docker to run the scraper, using TYPESENSE_HOST=localhost will not work because localhost in this context refers to localhost within the container. Instead you want the scraper running inside the Docker container to be able to connect to Typesense running outside the docker container on your host. Follow the instructions here (opens new window) to use the appropriate hostname to refer to your Docker host. For example, on macOS you want to use: TYPESENSE_HOST=host.docker.internal

  5. Run the scraper:

    docker run -it --env-file=/path/to/your/.env -e "CONFIG=$(cat config.json | jq -r tostring)" typesense/docsearch-scraper:0.9.1
    

This will scrape your documentation site and index it into Typesense.

TIP

The Docker command above will run the scraper in interactive mode, outputting logs to stdout.

If needed, you can send the output to both stdout and a file at the same time by adding | tee scraper-output.txt to the end of the command. This is helpful because the output can be very verbose.

You can also run the scraper as a daemon by substituting the -it flags with -d (detached mode (opens new window)).

# Tips for common challenges or more complex use-cases

Below are some tips for common challenges when running the scraper inside a Docker container:

# Passing a config file location, rather than a config string

The example above uses the jq tool to parse the config file into a JSON string prior to passing it as the CONFIG environment variable.

If you don't have jq available, it's good to know that you can also pass the location of the config file to the CONFIG variable, and then the file will be read from this location.

Just make sure that the config is available inside the container. In other words, you'll need to volume mount it, like in the example below:

docker run -it \
  -v "/path/to/config/dir/on/your/machine:/tmp/search" \
  -e "CONFIG=/tmp/search/typesense.json" \
  typesense/docsearch-scraper:0.9.1

# Trusting certificates from internal CAs

If you're trying to scrape a website that is secured with a certificate from an internal CA — common for corporate intranets for example — you will need to somehow make the container trust this CA. To do so, you can mount a file with trusted CAs and then pass it as a command line option.

In the example below, a file in the current folder names ca-chain.crt will be added to the trusted CA list:

docker run -it \
  --mount type=bind,source="$(pwd)/ca-chain.crt",target=/etc/ssl/certs/ca-certificates.crt \
  --env "REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" \
  --env-file=/path/to/your/.env  \
  -e "CONFIG=$(cat config.json | jq -r tostring)" \
  typesense/docsearch-scraper:0.9.1

# Set environment variables on the command line, rather than using a .env file

I you don't want to use a .env file or cannot use one in your setup, you can also pass all variables on the command line:

docker run -it \
  -e "TYPESENSE_API_KEY=xyz" \
  -e "TYPESENSE_HOST=xxx.a1.typesense.net" \
  -e "TYPESENSE_PORT=443" \
  -e "TYPESENSE_PROTOCOL=https" \
  -e "CONFIG=$(cat config.json | jq -r tostring)" \
  typesense/docsearch-scraper:0.9.1

# Resolving hosts

If your scraper depends on host resolution that is not available inside the container, you can add a host entry on the command line:

docker run -it \
  --add-host intranet.company.com:10.1.2.3 \
  --env-file=/path/to/your/.env  \
  -e "CONFIG=$(cat config.json | jq -r tostring)" \
  typesense/docsearch-scraper:0.9.1

# Authentication

If you're looking to scrape content that requires authentication, there's a number of options that are supported out of the box:

# Basic HTTP authentication

To use this authentication, set these environment variables:

  • DOCSEARCH_BASICAUTH_USERNAME
  • DOCSEARCH_BASICAUTH_PASSWORD
# Cloudflare Zero Trust (CF)

To use this authentication, set these environment variables:

  • CF_ACCESS_CLIENT_ID
  • CF_ACCESS_CLIENT_SECRET
# Google Identity-Aware Proxy (IAP)

To use this authentication, set these environment variables:

  • IAP_AUTH_CLIENT_ID
  • IAP_AUTH_SERVICE_ACCOUNT_JSON
# Keycloak (KC)

To use this authentication, set these environment variables:

  • KC_URL
  • KC_REALM
  • KC_CLIENT_ID
  • KC_CLIENT_SECRET

# Integrate With CI / Deploy It to a Server

If you are setting up Typesense for the first time, then skip down to the next section. But once you have confirmed that the scraper works and confirmed that your website has coherent search results, you should set things up so that your website can get continually scraped.

TIP

In Typesense Cloud (opens new window), we only host your Typesense cluster for you. You are still responsible for running the scraper to update your index in your CI pipeline / infrastructure.

The scraper Docker container is stateless and so can be run on any platform that allows you to run stateless Docker containers like:

And many more. We recommend running the scraper in CI so that your search index will always stay up-to-date (as opposed to e.g. a cron job that runs every day).

# Step 2: Add a Search Bar to your Documentation Site

# Option A: Docusaurus-powered sites

If you use Docusaurus (opens new window) as your documentation framework, use the docusaurus-theme-search-typesense (opens new window) plugin to add a search bar to your Docusaurus site.

$ npm install docusaurus-theme-search-typesense@next --save

# or

$ yarn add docusaurus-theme-search-typesense@next

# or

$ pnpm add docusaurus-theme-search-typesense@next

Add the following to your docusaurus.config.js file:

{
  themes: ['docusaurus-theme-search-typesense'],
  themeConfig: {
    typesense: {
      // Replace this with the name of your index/collection.
      // It should match the "index_name" entry in the scraper's "config.json" file.
      typesenseCollectionName: 'docusaurus-2',

      typesenseServerConfig: {
        nodes: [
          {
            host: 'xxx-1.a1.typesense.net',
            port: 443,
            protocol: 'https',
          },
          {
            host: 'xxx-2.a1.typesense.net',
            port: 443,
            protocol: 'https',
          },
          {
            host: 'xxx-3.a1.typesense.net',
            port: 443,
            protocol: 'https',
          },
        ],
        apiKey: 'xyz',
      },

      // Optional: Typesense search parameters: https://typesense.org/docs/0.24.0/api/search.html#search-parameters
      typesenseSearchParameters: {},

      // Optional
      contextualSearch: true,
    },
  }
}

Style your search component following these instructions (opens new window).

# Option B: Vuepress-powered sites

If you use Vuepress (opens new window) for a documentation framework (like Typesense's own documentation site), here's a Vue Component (opens new window) you can use.

Copy that component into .vuepress/components/TypesenseSearchBox.vue and edit it as needed.

Then add a key called typesenseDocsearch to your .vuepress/config.js file with these contents:

{
  themeConfig: {
    typesenseDocsearch: {
      typesenseServerConfig: {
        nearestNode: {
          host: 'xxx.a1.typesense.net',
          port: 443,
          protocol: 'https',
        },
        nodes: [
          {
            host: 'xxx-1.a1.typesense.net',
            port: 443,
            protocol: 'https',
          },
          {
            host: 'xxx-2.a1.typesense.net',
            port: 443,
            protocol: 'https',
          },
          {
            host: 'xxx-3.a1.typesense.net',
            port: 443,
            protocol: 'https',
          },
        ],
        apiKey: '<your-search-only-api-key>',
      },
      typesenseCollectionName: 'docs', // Should match the collection name you use in the scraper configuration
      typesenseSearchParams: {
        num_typos: 1,
        drop_tokens_threshold: 3,
        typo_tokens_threshold: 1,
        per_page: 6,
      },
    },
  }
}

Reference

Here's the docsearch-scraper configuration (opens new window) we use for Typesense's own Vuepress-powered documentation site.

# Option C: Custom Docs Framework with DocSearch.js v3 (modal layout)

Add the Following DocSearch.JS Snippet to all your Documentation Pages:

<!-- Somewhere in your doc site's navigation -->
<div id="searchbar"></div>

<!-- Before the closing head -->
<link
  rel="stylesheet"
  href="https://cdn.jsdelivr.net/npm/typesense-docsearch-css@0.3.0"
/>

<!-- Before the closing body -->
<script src="https://cdn.jsdelivr.net/npm/typesense-docsearch.js@3.4"></script>

<script>
  docsearch({
    container: '#searchbar',
    typesenseCollectionName: 'docs', // Should match the collection name you mention in the docsearch scraper config.js
    typesenseServerConfig: { 
      nodes: [{
        host: 'localhost', // For Typesense Cloud use xxx.a1.typesense.net
        port: '8108',      // For Typesense Cloud use 443
        protocol: 'http'   // For Typesense Cloud use https
      }],
      apiKey: '<SEARCH_API_KEY>', // Use API Key with only Search permissions
    },
    typesenseSearchParameters: { // Optional.
      filter_by: 'version_tag:=0.21.0' // Useful when you have versioned docs
    },
  });
</script>

# Reference:

# Option D: Custom Docs Framework with DocSearch.js v2 (Dropdown layout)

Add the Following DocSearch.JS Snippet to all your Documentation Pages:

<!-- Somewhere in your doc site's navigation -->
<input type="search" id="searchbar">

<!-- Before the closing head -->
<link
  rel="stylesheet"
  href="https://cdn.jsdelivr.net/npm/typesense-docsearch.js@1/dist/cdn/docsearch.min.css"
/>

<!-- Before the closing body -->
<script src="https://cdn.jsdelivr.net/npm/typesense-docsearch.js@1/dist/cdn/docsearch.min.js"></script>

<script>
  docsearch({
    inputSelector: '#searchbar',
    typesenseCollectionName: 'docs', // Should match the collection name you mention in the docsearch scraper config.js
    typesenseServerConfig: { 
      nodes: [{
        host: 'localhost', // For Typesense Cloud use xxx.a1.typesense.net
        port: '8108',      // For Typesense Cloud use 443
        protocol: 'http'   // For Typesense Cloud use https
      }],
      apiKey: '<SEARCH_API_KEY>', // Use API Key with only Search permissions
    },
    typesenseSearchParams: { // Optional.
      filter_by: 'version_tag:=0.21.0' // Useful when you have versioned docs
    },
  });
</script>

# Reference:

# Styling

You can override the following styles as needed:


.algolia-autocomplete .ds-dropdown-menu {
  width: 500px;
}

.algolia-autocomplete .typesense-docsearch-suggestion--category-header {
  color: darkgray;
  border: 1px solid gray;
}

.algolia-autocomplete .typesense-docsearch-suggestion--subcategory-column {
  color: gray;
}

.algolia-autocomplete .typesense-docsearch-suggestion--title {
  font-weight: bold;
  color: black;
}

.algolia-autocomplete .typesense-docsearch-suggestion--text {
  font-size: 0.8rem;
  color: gray;
}

.algolia-autocomplete .typesense-docsearch-suggestion--highlight {
  color: blue;
}

Notice that you still need to use .algolia-autocomplete class names since we use autocomplete.js (opens new window) unmodified, but for docsearch classnames the class names are .typesense-docsearch-* since this is a modified version of DocSearch.js.

Debugging CSS

In order to inspect and debug your CSS without having the searchbar close when you click on the devtool panels, you can initialize the docsearch library with the debug: true option!

# Option E: Sphinx Documentation Generator

Here's (opens new window) a guide written by a Typesense user on how to integrate Sphinx (opens new window) with Typesense DocSearch.

Typesense supports built-in Semantic Search as v0.25.1 of Typesense Server and v0.9.1 of the typesense-docsearch-scraper.

Semantic search uses Machine Learning models to provide users with conceptually related results, even if the exact keyword they are searching for doesn't exist in your documentation site.

For eg, if a user searches for "hard disk" and you documentation contains "hard drive", semantic search will still be able to pull these results up.

Step 1: To enable Semantic Search, first update your scraper config file to include following highlighted section:





 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 


{
  "index_name": "your_docs",
  "start_urls": ["..."],
  "selectors": {},
  "custom_settings": {
    "field_definitions": [
      {"name": "anchor", "type": "string", "optional": true},
      {"name": "content", "type": "string", "optional": true},
      {"name": "url", "type": "string", "facet": true},
      {"name": "url_without_anchor", "type": "string", "facet": true, "optional": true},
      {"name": "version", "type": "string[]", "facet": true, "optional": true},
      {"name": "hierarchy.lvl0", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl1", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl2", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl3", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl4", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl5", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl6", "type": "string", "facet": true, "optional": true},
      {"name": "type", "type": "string", "facet": true, "optional": true},
      {"name": ".*_tag", "type": "string", "facet": true, "optional": true},
      {"name": "language", "type": "string", "facet": true, "optional": true},
      {"name": "tags", "type": "string[]", "facet": true, "optional": true},
      {"name": "item_priority", "type": "int64"},
      {
        "name": "embedding",
        "type": "float[]",
        "embed": {
          "from": [
            "content",
            "hierarchy.lvl0",
            "hierarchy.lvl1",
            "hierarchy.lvl2",
            "hierarchy.lvl3",
            "hierarchy.lvl4",
            "hierarchy.lvl5",
            "hierarchy.lvl6",
            "tags"
          ],
          "model_config": {
            "model_name": "ts/all-MiniLM-L12-v2"
          }
        }
      }
    ]
  }
}

This instructs Typesense to automatically generate an embedding field using the contents of the content, hierarchy.* and tags fields.

If you have custom tags, you can edit the schema above to include those custom fields in embed.from.

Step 2: Now, update your DocSearch initialization code in your frontend to set the following custom query_by field, to include the embedding field:

docsearch({
    //... Other parameters as described above 
    typesenseSearchParameters: { // In some docsearch plugins (see above), this might be called `typesenseSearchParams` 
      // ... 
      query_by:
        'hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,embedding',
      vector_query: 'embedding:([], k: 5, distance_threshold: 1.0, alpha: 0.2)' // Optional vector search fine-tuning
    },
  });

And that's it!

You now have semantic search enabled DocSearch.

Tip: ML Model options

The example above uses one of the built-in ML models in Typesense, but you can use OpenAI, PaLM API or any other built-in ML model as described here.

Note: CPU Usage

Built-in Machine Learning models are computationally intensive.

So depending on the size of your documentation site, when you enable semantic search and use a built-in ML model, even a few thousand records could take 10s of minutes to generate embeddings and index.

If you want to speed this process up, you want to enable GPU Acceleration in Typesense.

When you use a remote embedding service like OpenAI within Typesense, then you do not need a GPU, since the model runs on OpenAI's servers.

Last Updated: 3/31/2024, 9:38:47 AM