In information systems, it is common to have the same entity being represented by slightly varying strings. Consider the following:

``````Joe Biden
Joseph Biden
Joseph R Biden
``````

All three strings refer to the same person, but in slightly different ways.

Fuzzy search is the process of finding strings that approximately match a given string.

Let’s explore how we can utilize various fuzzy string matching algorithms in Python to compute similarity between pairs of strings.

## SequenceMatcher from difflib#

SequenceMatcher is available as part of the Python standard library. It uses the Ratcliff/Obershelp string matching algorithm which calculates the similarity metric between two strings as:

Twice the number of matching (overlapping) characters between the two strings divided by the total number of characters in the two strings. ``````from difflib import SequenceMatcher as SM

str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'

similarity_ratio = SM(isjunk=None, str1, str2).ratio() # 0.9032258064516129
``````

Here, we can see that the two string are about 90% similar based on the similarity ratio calculated by `SequenceMatcher`.

The `difflib` module contains many useful string matching functions that you should certainly explore further.

## Levenshtein distance#

The Levenshtein distance between two strings is the number of deletions, insertions and substitutions needed to transform one string into another.

Once you install the `python-Levenshtein` package:

``````pip install python-Levenshtein
``````

You can compute both the Levenshtein edit distance and similarity ratio between two strings.

``````import Levenshtein as levenshtein

str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'

edit_distance = levenshtein.distance(str1, str2)
similarity_ratio = levenshtein.ratio(str1, str2)
``````

## Fuzzywuzzy#

Fuzzywuzzy is a more feature-rich library for computing string similarity and performing fuzzy string matching in Python.

Like the `python-Levenshtein` library, it also has a ratio function:

``````from fuzzywuzzy import fuzz

str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'

ratio = fuzz.ratio(str1, str2)
``````

The library also provides advanced functions for handling other complex string matching scenarios.

Handling sub-strings

Let’s take an example of a string which is a substring of another. Depending on the context, some text matching will require us to treat substring matches as complete match.

``````from fuzzywuzzy import fuzz

str1 = 'California, USA'
str2 = 'California'

ratio = fuzz.ratio(str1, str2)
partial_ratio = fuzz.partial_ratio(str1, str2)

print(ratio)
print(partial_ratio)
``````
``````80
100
``````

As you can see, the `partial ratio` is `100` while the plain `ratio` is `80` so relying on `partial ratio` in handy in some scenarios.

Handling out of order words

What if we wanted to ignore how the words are ordered in strings?

Let’s take the example of these two strings:

``````Chocolate Strawberry Cake
Strawberry Chocolate Cake
``````

Semantically, these two strings are the same. However, if you were to calculate the ratio of these strings, you will end up with a similarity ratio score of only `60`.

Thankfully, `fuzzywuzzy` has got your back. You can use the `token_set_ratio` function to treat the individual words as order independent tokens.

``````from fuzzywuzzy import fuzz

str1 = 'Chocolate Strawberry Cake'
str2 = 'Strawberry Chocolate Cake'

token_set_ratio = fuzz.token_set_ratio(str1, str2)
print(token_set_ratio)
``````
``````100
``````

The token set ratio of those two strings is now `100`. Perfect.

## Best matches of a given string#

So far, we have been looking at calculating pair-wise string similarity.

However, one typically wants to find the closest matching strings of a given string. Once again, `fuzzywuzzy` has got a convenience function for doing just that.

``````from fuzzywuzzy import process
from fuzzywuzzy import fuzz

str_list = ['Joe Biden', 'Joseph Biden', 'Joseph R Biden']

match_ratios = process.extract('joe r biden', str_list, scorer=fuzz.token_sort_ratio)
print(match_ratios)

best_match = process.extractOne('joe r biden', str_list, scorer=fuzz.token_sort_ratio)
print(best_match)
``````
``````[('Joe Biden', 90), ('Joseph R Biden', 88), ('Joseph Biden', 78)]
('Joe Biden', 90)
``````

Note how we’re passing a `scorer` function to the extraction functions. Depending on the context, we can also use `fuzz.partial_ratio` or `fuzz.ratio` scoring functions.