What is fuzzy search?

Fuzzy search is the process of finding strings that approximately match a given string. A fuzzy matching search algorithm will be able to find relevant strings even if your original string contains typo errors and misspellings.

Ever seen Google correcting you like this?

Google search correcting a search query with typo

That’s Google using fuzzy search to match your misspelled query to the word with the correct spelling.

Fuzzy search is effective, when it can reliably detect the intent behind a given search term. However, it can also return bad results if it wrongly assumes a query to be badly formed, as is often the case with obscure words and band names!

This is why most search engines allow you to wrap your query “with quotes” to tell the search engine that you know what you are doing.

Searching google with exact query

Measuring string similarity

When two strings don’t match each other exactly, we need a metric to indicate how similar they are to each other.

Such a metric is called as the edit distance, and there are several algorithms for measuring the edit distance between two strings.

Levenshtein distance

Arguably the most popular string similarity metric, it computes the number of deletions, insertions and substitutions needed to transform one string into another.

insertion:     bat → boat
deletion:      coat → cot
substitution:  coat → cost

If there are multiple ways to transform a string to another, the Levenshtein distance is the smallest number of operations required to do the transformation.

Damerau-Levenshtein distance

Damerau-Levenshtein is an extension to the Levenshtein distance metric. In addiiton to insertion, deletion and substitution, it also treats transposition as a primitive operation.

Transposition is the swapping of two adjacent letters in a string.

transposition: caot → coat

Longest common subsequence (LCS)

LCS is a similarity metric that considers only insertion and deletion. Given two strings, LCS finds the longest subsequence between them. Such a subsequence is not required to occupy consecutive positions within the original sequence.

In the following example,

ABCD → ACBAD

The two strings share two common subsequences of length 3: ABD and ACD.

LCS is used in the Linux diff utility and in revision control systems such as Git for displaying the changes between two versions of a file.

Measuring string similarity#

Measuring string similarity