In information systems, it is common to have the same entity being represented by slightly varying strings. Consider the following:

Joe Biden
Joseph Biden
Joseph R Biden

All three strings refer to the same person, but in slightly different ways.

Fuzzy search is the process of finding strings that approximately match a given string.

Let’s explore how we can utilize various fuzzy string matching algorithms in Python to compute similarity between pairs of strings.

SequenceMatcher from difflib

SequenceMatcher is available as part of the Python standard library. It uses the Ratcliff/Obershelp string matching algorithm which calculates the similarity metric between two strings as:

Twice the number of matching (overlapping) characters between the two strings divided by the total number of characters in the two strings.

Ratcliff/Obershelp string matching formula

from difflib import SequenceMatcher as SM

str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'

similarity_ratio = SM(isjunk=None, str1, str2).ratio() # 0.9032258064516129

Here, we can see that the two string are about 90% similar based on the similarity ratio calculated by SequenceMatcher.

The difflib module contains many useful string matching functions that you should certainly explore further.

Levenshtein distance

The Levenshtein distance between two strings is the number of deletions, insertions and substitutions needed to transform one string into another.

Once you install the python-Levenshtein package:

pip install python-Levenshtein

You can compute both the Levenshtein edit distance and similarity ratio between two strings.

import Levenshtein as levenshtein

str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'

edit_distance = levenshtein.distance(str1, str2)
similarity_ratio = levenshtein.ratio(str1, str2)

Fuzzywuzzy

Fuzzywuzzy is a more feature-rich library for computing string similarity and performing fuzzy string matching in Python.

Like the python-Levenshtein library, it also has a ratio function:

from fuzzywuzzy import fuzz

str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'

ratio = fuzz.ratio(str1, str2)

The library also provides advanced functions for handling other complex string matching scenarios.

Handling sub-strings

Let’s take an example of a string which is a substring of another. Depending on the context, some text matching will require us to treat substring matches as complete match.

from fuzzywuzzy import fuzz

str1 = 'California, USA'
str2 = 'California'

ratio = fuzz.ratio(str1, str2)
partial_ratio = fuzz.partial_ratio(str1, str2)

print(ratio)
print(partial_ratio)
80
100

As you can see, the partial ratio is 100 while the plain ratio is 80 so relying on partial ratio in handy in some scenarios.

Handling out of order words

What if we wanted to ignore how the words are ordered in strings?

Let’s take the example of these two strings:

Chocolate Strawberry Cake
Strawberry Chocolate Cake 

Semantically, these two strings are the same. However, if you were to calculate the ratio of these strings, you will end up with a similarity ratio score of only 60.

Thankfully, fuzzywuzzy has got your back. You can use the token_set_ratio function to treat the individual words as order independent tokens.

from fuzzywuzzy import fuzz

str1 = 'Chocolate Strawberry Cake'
str2 = 'Strawberry Chocolate Cake'

token_set_ratio = fuzz.token_set_ratio(str1, str2)
print(token_set_ratio)
100

The token set ratio of those two strings is now 100. Perfect.

Best matches of a given string

So far, we have been looking at calculating pair-wise string similarity.

However, one typically wants to find the closest matching strings of a given string. Once again, fuzzywuzzy has got a convenience function for doing just that.

from fuzzywuzzy import process
from fuzzywuzzy import fuzz

str_list = ['Joe Biden', 'Joseph Biden', 'Joseph R Biden']

match_ratios = process.extract('joe r biden', str_list, scorer=fuzz.token_sort_ratio)
print(match_ratios)

best_match = process.extractOne('joe r biden', str_list, scorer=fuzz.token_sort_ratio)
print(best_match)
[('Joe Biden', 90), ('Joseph R Biden', 88), ('Joseph Biden', 78)]
('Joe Biden', 90)

Note how we’re passing a scorer function to the extraction functions. Depending on the context, we can also use fuzz.partial_ratio or fuzz.ratio scoring functions.