# Search for Documentation Sites
The good folks over at Algolia have built and open-sourced DocSearch (opens new window), which is a suite of tools specifically built to index data from a documentation site and then add a search bar to the site quickly.
This article will show you how to use a customized version of DocSearch that works with Typesense. In fact, the search bar you see on Typesense's own documentation site is built with this customized version of DocSearch.
Typesense's customized version of DocSearch is made up of two components:
- typesense-docsearch-scraper (opens new window) - A web-scraper that scans your documentation site and indexes the content in Typesense.
- typesense-docsearch.js (opens new window) - A JavaScript library that adds a search bar to your documentation site. When end-users start typing into the search bar, it queries the content index built by the DocSearch scraper.
TIP: Usage on Non-Documentation Sites
Even though DocSearch was originally built for documentation sites, it can actually be used for any site that has structured, hierarchical, and consistent HTML markup across pages.
# Step 1: Set up DocSearch Scraper
First, we need to set up the scraper to point to your documentation site. Running the scraper will generate an index for each word on your website, and then upload it to your Typesense server. That's what makes your website searchable!
# Create a DocSearch Scraper Config File
Follow one of the templates below to create your own config.json
file, pointing to your documentation site:
- Docusaurus (opens new window) (see changes required below).
- Vuepress (opens new window)
- More templates can be found in Algolia's repo (opens new window).
Here's the official DocSearch Scraper documentation (opens new window) that describes all the available config options.
Note
Algolia's DocSearch repositories are archived because Algolia has migrated to their proprietary closed-source crawler in February 2022. Thus, they no longer maintain the open-source version.
Given this, Typesense intends to maintain and develop a fork (opens new window). Thus, you can safely ignore the deprecation warnings in their documentation.
In the long term, we intend to update all the documentation to Typesense repositories to avoid the confusion.
# Make necessary changes to the DocSearch Scraper config file
After starting with one of the templates, you will want to change a few fields in the configuration:
index_name
- Should be a unique string that identifies your website. This corresponds totypesenseCollectionName
in the front-end configuration further down below.start_urls
- This corresponds to the URL for your website.stop_urls
- An array of URLs to ignore. For example, if you have a change log on your website, you might want to ignore it so that it does not interfere with the search results for actual content.sitemap_urls
- (Docusaurus-only) You will need to change this URL to match, just like you changed thestart_urls
. (This XML file is automatically generated by Docusaurus during its build process.)lvl1
- (Docusaurus-only) Changeheader h1
toarticle h1, header h1
.
Tip: Scraping a site running on localhost
If you are running Typesense on localhost
and you're using Docker to run the scraper, you will need to change some things in your config.json
file.
On start_urls
and sitemap_urls
, you will need to target the host.docker.internal
URL, to ensure that will find the right site in your host machine, instead
of trying to find it inside the container.
You will need to run your site at port :80
, because the scraper can present a not expected behavior if has hosted in another port.
TIP
There is a mismatch between index_name
in the scraper config and typesenseCollectionName
in the front-end config. This is because Algolia calls a collection of documents an "index" and Typesense calls a collection of documents a collection. The scraper was originally forked from Algolia and the name was deliberately kept to maintain backwards compatibility with the ecosystem.
TIP
If you look at the logs of your Typesense instance, you might see that it reports the index/collectible name as something like foo_1675838072
instead of foo
. This is because every time the crawler runs:
- It creates a new collection called:
foo_<current_unix_timestamp>
- It creates/updates an alias called
foo
that points to:foo_<current_unix_timestamp>
- It deletes the previously scrapped version of the docs, stored in:
foo_<previous_timestamp>
For this reason, when configuring your front-end search engine, you should specify the index/collection name as foo
instead of foo_<unix_timestamp>
.
# Add DocSearch Meta Tags (optional)
The scraper automatically extracts information from the DocSearch meta tags and attaches the content
value to all records extracted on the page. This is a great way to filter searches on custom attributes.
<meta name="docsearch:{$NAME}_tag" content="{$CONTENT}" />
Example: If you have the following markup across a certain set of pages:
<meta name="docsearch:language_tag" content="en" />
<meta name="docsearch:version_tag" content="1.2.4" />
All extracted records on these pages will have a language_tag
attribute with a value of en
and a version_tag
attribute with a value of 1.24
, that you can use in filter_by
to restrict the search to particular sets of records.
TIP
_tag
must be appended to the end of the $NAME
variable for the attribute to be saved in the schema.
# Run the Scraper
The easiest way to run the scraper is using Docker.
Create a
.env
file with the following contents, replacing them with the correct values for your particular situation:TYPESENSE_API_KEY=xyz TYPESENSE_HOST=xxx.a1.typesense.net TYPESENSE_PORT=443 TYPESENSE_PROTOCOL=https
TIP
If you are self-hosting Typesense, then you can usually find your API key and port number in the
/etc/typesense/typesense-server.ini
file.The host will be equal to the FQDN or IP address of your server.
By default, self-hosted Typesense uses HTTP, so you might need to change
https
tohttp
. (Unless of course you specifiedssl-certificate
andssl-certificate-key
in your ini file.)TIP
If you are running Typesense on
localhost
and you're using Docker to run the scraper, usingTYPESENSE_HOST=localhost
will not work because localhost in this context refers to localhost within the container. Instead you want the scraper running inside the Docker container to be able to connect to Typesense running outside the docker container on your host. Follow the instructions here (opens new window) to use the appropriate hostname to refer to your Docker host. For example, on macOS you want to use:TYPESENSE_HOST=host.docker.internal
Run the scraper:
docker run -it --env-file=/path/to/your/.env -e "CONFIG=$(cat config.json | jq -r tostring)" typesense/docsearch-scraper:0.11.0
This will scrape your documentation site and index it into Typesense.
TIP
The Docker command above will run the scraper in interactive mode, outputting logs to stdout.
If needed, you can send the output to both stdout and a file at the same time by adding | tee scraper-output.txt
to the end of the command. This is helpful because the output can be very verbose.
You can also run the scraper as a daemon by substituting the -it
flags with -d
(detached mode (opens new window)).
# Tips for common challenges or more complex use-cases
Below are some tips for common challenges when running the scraper inside a Docker container:
# Passing a config file location, rather than a config string
The example above uses the jq
tool to parse the config file into a JSON string prior to passing it as the CONFIG
environment variable.
If you don't have jq
available, it's good to know that you can also pass the location of the config file to the CONFIG
variable, and then the file will be read from this location.
Just make sure that the config is available inside the container. In other words, you'll need to volume mount it, like in the example below:
docker run -it \
-v "/path/to/config/dir/on/your/machine:/tmp/search" \
-e "CONFIG=/tmp/search/typesense.json" \
typesense/docsearch-scraper:0.11.0
# Trusting certificates from internal CAs
If you're trying to scrape a website that is secured with a certificate from an internal CA — common for corporate intranets for example — you will need to somehow make the container trust this CA. To do so, you can mount a file with trusted CAs and then pass it as a command line option.
In the example below, a file in the current folder names ca-chain.crt
will be added to the trusted CA list:
docker run -it \
--mount type=bind,source="$(pwd)/ca-chain.crt",target=/etc/ssl/certs/ca-certificates.crt \
--env "REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" \
--env-file=/path/to/your/.env \
-e "CONFIG=$(cat config.json | jq -r tostring)" \
typesense/docsearch-scraper:0.11.0
# Set environment variables on the command line, rather than using a .env file
I you don't want to use a .env
file or cannot use one in your setup, you can also pass all variables on the command line:
docker run -it \
-e "TYPESENSE_API_KEY=xyz" \
-e "TYPESENSE_HOST=xxx.a1.typesense.net" \
-e "TYPESENSE_PORT=443" \
-e "TYPESENSE_PROTOCOL=https" \
-e "CONFIG=$(cat config.json | jq -r tostring)" \
typesense/docsearch-scraper:0.11.0
# Resolving hosts
If your scraper depends on host resolution that is not available inside the container, you can add a host entry on the command line:
docker run -it \
--add-host intranet.company.com:10.1.2.3 \
--env-file=/path/to/your/.env \
-e "CONFIG=$(cat config.json | jq -r tostring)" \
typesense/docsearch-scraper:0.11.0
# Authentication
If you're looking to scrape content that requires authentication, there's a number of options that are supported out of the box:
# Basic HTTP authentication
To use this authentication, set these environment variables:
DOCSEARCH_BASICAUTH_USERNAME
DOCSEARCH_BASICAUTH_PASSWORD
# Cloudflare Zero Trust (CF)
To use this authentication, set these environment variables:
CF_ACCESS_CLIENT_ID
CF_ACCESS_CLIENT_SECRET
# Google Identity-Aware Proxy (IAP)
To use this authentication, set these environment variables:
IAP_AUTH_CLIENT_ID
IAP_AUTH_SERVICE_ACCOUNT_JSON
# Keycloak (KC)
To use this authentication, set these environment variables:
KC_URL
KC_REALM
KC_CLIENT_ID
KC_CLIENT_SECRET
# Integrate With CI / Deploy It to a Server
If you are setting up Typesense for the first time, then skip down to the next section. But once you have confirmed that the scraper works and confirmed that your website has coherent search results, you should set things up so that your website can get continually scraped.
TIP
In Typesense Cloud (opens new window), we only host your Typesense cluster for you. You are still responsible for running the scraper to update your index in your CI pipeline / infrastructure.
The scraper Docker container is stateless and so can be run on any platform that allows you to run stateless Docker containers like:
- GitHub Actions (here's a pre-built action (opens new window))
- CircleCI
- AWS Fargate
- Google Cloud Run
- Heroku
- Render
- Railway
And many more. We recommend running the scraper in CI so that your search index will always stay up-to-date (as opposed to e.g. a cron job that runs every day).
# Step 2: Add a Search Bar to your Documentation Site
# Option A: Docusaurus-powered sites
If you use Docusaurus (opens new window) as your documentation framework, use the docusaurus-theme-search-typesense (opens new window) plugin to add a search bar to your Docusaurus site.
$ npm install docusaurus-theme-search-typesense@next --save
# or
$ yarn add docusaurus-theme-search-typesense@next
# or
$ pnpm add docusaurus-theme-search-typesense@next
Add the following to your docusaurus.config.js
file:
{
themes: ['docusaurus-theme-search-typesense'],
themeConfig: {
typesense: {
// Replace this with the name of your index/collection.
// It should match the "index_name" entry in the scraper's "config.json" file.
typesenseCollectionName: 'docusaurus-2',
typesenseServerConfig: {
nodes: [
{
host: 'xxx-1.a1.typesense.net',
port: 443,
protocol: 'https',
},
{
host: 'xxx-2.a1.typesense.net',
port: 443,
protocol: 'https',
},
{
host: 'xxx-3.a1.typesense.net',
port: 443,
protocol: 'https',
},
],
apiKey: 'xyz',
},
// Optional: Typesense search parameters: https://typesense.org/docs/0.24.0/api/search.html#search-parameters
typesenseSearchParameters: {},
// Optional
contextualSearch: true,
},
}
}
Style your search component following these instructions (opens new window).
# Option B: Vuepress-powered sites
If you use Vuepress (opens new window) for a documentation framework (like Typesense's own documentation site), here's a Vue Component (opens new window) you can use.
Copy that component into .vuepress/components/TypesenseSearchBox.vue
and edit it as needed.
Then add a key called typesenseDocsearch
to your .vuepress/config.js
file with these contents:
{
themeConfig: {
typesenseDocsearch: {
typesenseServerConfig: {
nearestNode: {
host: 'xxx.a1.typesense.net',
port: 443,
protocol: 'https',
},
nodes: [
{
host: 'xxx-1.a1.typesense.net',
port: 443,
protocol: 'https',
},
{
host: 'xxx-2.a1.typesense.net',
port: 443,
protocol: 'https',
},
{
host: 'xxx-3.a1.typesense.net',
port: 443,
protocol: 'https',
},
],
apiKey: '<your-search-only-api-key>',
},
typesenseCollectionName: 'docs', // Should match the collection name you use in the scraper configuration
typesenseSearchParams: {
num_typos: 1,
drop_tokens_threshold: 3,
typo_tokens_threshold: 1,
per_page: 6,
},
},
}
}
Reference
Here's the docsearch-scraper configuration (opens new window) we use for Typesense's own Vuepress-powered documentation site.
# Option C: Custom Docs Framework with DocSearch.js v3 (modal layout)
Add the Following DocSearch.JS Snippet to all your Documentation Pages:
<!-- Somewhere in your doc site's navigation -->
<div id="searchbar"></div>
<!-- Before the closing head -->
<link
rel="stylesheet"
href="https://cdn.jsdelivr.net/npm/typesense-docsearch-css@0.3.0"
/>
<!-- Before the closing body -->
<script src="https://cdn.jsdelivr.net/npm/typesense-docsearch.js@3.4"></script>
<script>
docsearch({
container: '#searchbar',
typesenseCollectionName: 'docs', // Should match the collection name you mention in the docsearch scraper config.js
typesenseServerConfig: {
nodes: [{
host: 'localhost', // For Typesense Cloud use xxx.a1.typesense.net
port: '8108', // For Typesense Cloud use 443
protocol: 'http' // For Typesense Cloud use https
}],
apiKey: '<SEARCH_API_KEY>', // Use API Key with only Search permissions
},
typesenseSearchParameters: { // Optional.
// filter_by: 'version_tag:=0.21.0' // Useful when you have versioned docs
},
});
</script>
# Reference:
- Read the Authentication Section for all possible options under the
typesenseServerConfig
key. - Read the Search Parameters Section for all possible options under the
typesenseSearchParameters
key. - Read the official DocSearch documentation (opens new window) for information about additional options.
# Option D: Custom Docs Framework with DocSearch.js v2 (Dropdown layout)
Add the Following DocSearch.JS Snippet to all your Documentation Pages:
<!-- Somewhere in your doc site's navigation -->
<input type="search" id="searchbar">
<!-- Before the closing head -->
<link
rel="stylesheet"
href="https://cdn.jsdelivr.net/npm/typesense-docsearch.js@1/dist/cdn/docsearch.min.css"
/>
<!-- Before the closing body -->
<script src="https://cdn.jsdelivr.net/npm/typesense-docsearch.js@1/dist/cdn/docsearch.min.js"></script>
<script>
docsearch({
inputSelector: '#searchbar',
typesenseCollectionName: 'docs', // Should match the collection name you mention in the docsearch scraper config.js
typesenseServerConfig: {
nodes: [{
host: 'localhost', // For Typesense Cloud use xxx.a1.typesense.net
port: '8108', // For Typesense Cloud use 443
protocol: 'http' // For Typesense Cloud use https
}],
apiKey: '<SEARCH_API_KEY>', // Use API Key with only Search permissions
},
typesenseSearchParams: { // Optional.
// filter_by: 'version_tag:=0.21.0' // Useful when you have versioned docs
},
});
</script>
# Reference:
- Read the Authentication Section for all possible options under the
typesenseServerConfig
key. - Read the Search Parameters Section for all possible options under the
typesenseSearchParams
key. - Read the official DocSearch documentation (opens new window) for information about additional options.
# Styling
You can override the following styles as needed:
.algolia-autocomplete .ds-dropdown-menu {
width: 500px;
}
.algolia-autocomplete .typesense-docsearch-suggestion--category-header {
color: darkgray;
border: 1px solid gray;
}
.algolia-autocomplete .typesense-docsearch-suggestion--subcategory-column {
color: gray;
}
.algolia-autocomplete .typesense-docsearch-suggestion--title {
font-weight: bold;
color: black;
}
.algolia-autocomplete .typesense-docsearch-suggestion--text {
font-size: 0.8rem;
color: gray;
}
.algolia-autocomplete .typesense-docsearch-suggestion--highlight {
color: blue;
}
Notice that you still need to use .algolia-autocomplete
class names since we use autocomplete.js (opens new window) unmodified, but for docsearch classnames the class names are .typesense-docsearch-*
since this is a modified version of DocSearch.js.
Debugging CSS
In order to inspect and debug your CSS without having the searchbar close when you click on the devtool panels, you can initialize the docsearch library with the debug: true
option!
# Option E: Sphinx Documentation Generator
Here's (opens new window) a guide written by a Typesense user on how to integrate Sphinx (opens new window) with Typesense DocSearch.
# Semantic Search New
Typesense supports built-in Semantic Search as v0.25.1
of Typesense Server and v0.9.1
of the typesense-docsearch-scraper.
Semantic search uses Machine Learning models to provide users with conceptually related results, even if the exact keyword they are searching for doesn't exist in your documentation site.
For eg, if a user searches for "hard disk" and you documentation contains "hard drive", semantic search will still be able to pull these results up.
Step 1: To enable Semantic Search, first update your scraper config file to include following highlighted section:
{
"index_name": "your_docs",
"start_urls": ["..."],
"selectors": {},
"custom_settings": {
"field_definitions": [
{"name": "anchor", "type": "string", "optional": true},
{"name": "content", "type": "string", "optional": true},
{"name": "url", "type": "string", "facet": true},
{"name": "url_without_anchor", "type": "string", "facet": true, "optional": true},
{"name": "version", "type": "string[]", "facet": true, "optional": true},
{"name": "hierarchy.lvl0", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl1", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl2", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl3", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl4", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl5", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl6", "type": "string", "facet": true, "optional": true},
{"name": "type", "type": "string", "facet": true, "optional": true},
{"name": ".*_tag", "type": "string", "facet": true, "optional": true},
{"name": "language", "type": "string", "facet": true, "optional": true},
{"name": "tags", "type": "string[]", "facet": true, "optional": true},
{"name": "item_priority", "type": "int64"},
{
"name": "embedding",
"type": "float[]",
"embed": {
"from": [
"content",
"hierarchy.lvl0",
"hierarchy.lvl1",
"hierarchy.lvl2",
"hierarchy.lvl3",
"hierarchy.lvl4",
"hierarchy.lvl5",
"hierarchy.lvl6",
"tags"
],
"model_config": {
"model_name": "ts/all-MiniLM-L12-v2"
}
}
}
]
}
}
This instructs Typesense to automatically generate an
embedding
field
using the contents of the content
, hierarchy.*
and tags
fields.
If you have custom tags, you can edit the schema above to include those custom fields in embed.from
.
Step 2: Now, update your DocSearch initialization code in your frontend to set the following custom query_by
field, to include the embedding
field:
docsearch({
//... Other parameters as described above
typesenseSearchParameters: { // In some docsearch plugins (see above), this might be called `typesenseSearchParams`
// ...
query_by:
'hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,embedding',
vector_query: 'embedding:([], k: 5, distance_threshold: 1.0, alpha: 0.2)' // Optional vector search fine-tuning
},
});
And that's it!
You now have semantic search enabled DocSearch.
Tip: ML Model options
The example above uses one of the built-in ML models in Typesense, but you can use OpenAI, PaLM API or any other built-in ML model as described here.
Note: CPU Usage
Built-in Machine Learning models are computationally intensive.
So depending on the size of your documentation site, when you enable semantic search and use a built-in ML model, even a few thousand records could take 10s of minutes to generate embeddings and index.
If you want to speed this process up, you want to enable GPU Acceleration in Typesense.
When you use a remote embedding service like OpenAI within Typesense, then you do not need a GPU, since the model runs on OpenAI's servers.