Fuzzy search is the process of finding strings that approximately match a given string. A fuzzy matching search algorithm will be able to find relevant strings even if your original string contains typo errors and misspellings.
Ever seen Google correcting you like this?
That’s Google using fuzzy search to match your misspelled query to the word with the correct spelling.
Fuzzy search is effective, when it can reliably detect the intent behind a given search term. However, it can also return bad results if it wrongly assumes a query to be badly formed, as is often the case with obscure words and band names!
This is why most search engines allow you to wrap your query “with quotes” to tell the search engine that you know what you are doing.
Measuring string similarity
When two strings don’t match each other exactly, we need a metric to indicate how similar they are to each other.
Such a metric is called as the edit distance, and there are several algorithms for measuring the edit distance between two strings.
Levenshtein distance
Arguably the most popular string similarity metric, it computes the number of deletions, insertions and substitutions needed to transform one string into another.
insertion: bat → boat
deletion: coat → cot
substitution: coat → cost
If there are multiple ways to transform a string to another, the Levenshtein distance is the smallest number of operations required to do the transformation.
Damerau-Levenshtein distance
Damerau-Levenshtein is an extension to the Levenshtein distance metric. In addiiton to insertion, deletion and substitution, it also treats transposition as a primitive operation.
Transposition is the swapping of two adjacent letters in a string.
transposition: caot → coat
Longest common subsequence (LCS)
LCS is a similarity metric that considers only insertion and deletion. Given two strings, LCS finds the longest subsequence between them. Such a subsequence is not required to occupy consecutive positions within the original sequence.
In the following example,
ABCD → ACBAD
The two strings share two common subsequences of length 3: ABD
and ACD
.
LCS is used in the Linux diff
utility and in revision control systems such as Git for displaying the changes between
two versions of a file.