In information systems, it is common to have the same entity being represented by slightly varying strings. Consider the following:
Joe Biden
Joseph Biden
Joseph R Biden
All three strings refer to the same person, but in slightly different ways.
Fuzzy search is the process of finding strings that approximately match a given string.
Let’s explore how we can utilize various fuzzy string matching algorithms in Python to compute similarity between pairs of strings.
SequenceMatcher from difflib
SequenceMatcher is available as part of the Python standard library. It uses the Ratcliff/Obershelp string matching algorithm which calculates the similarity metric between two strings as:
Twice the number of matching (overlapping) characters between the two strings divided by the total number of characters in the two strings.
from difflib import SequenceMatcher as SM
str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'
similarity_ratio = SM(isjunk=None, str1, str2).ratio() # 0.9032258064516129
Here, we can see that the two string are about 90% similar based on the similarity ratio calculated by SequenceMatcher
.
The difflib
module contains many useful string matching functions that you should
certainly explore further.
Levenshtein distance
The Levenshtein distance between two strings is the number of deletions, insertions and substitutions needed to transform one string into another.
Once you install the python-Levenshtein
package:
pip install python-Levenshtein
You can compute both the Levenshtein edit distance and similarity ratio between two strings.
import Levenshtein as levenshtein
str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'
edit_distance = levenshtein.distance(str1, str2)
similarity_ratio = levenshtein.ratio(str1, str2)
Fuzzywuzzy
Fuzzywuzzy is a more feature-rich library for computing string similarity and performing fuzzy string matching in Python.
Like the python-Levenshtein
library, it also has a ratio function:
from fuzzywuzzy import fuzz
str1 = 'But I have promises to keep, and miles to go before I sleep.'
str2 = 'But I have many promises to keep, and miles to run before sleep.'
ratio = fuzz.ratio(str1, str2)
The library also provides advanced functions for handling other complex string matching scenarios.
Handling sub-strings
Let’s take an example of a string which is a substring of another. Depending on the context, some text matching will require us to treat substring matches as complete match.
from fuzzywuzzy import fuzz
str1 = 'California, USA'
str2 = 'California'
ratio = fuzz.ratio(str1, str2)
partial_ratio = fuzz.partial_ratio(str1, str2)
print(ratio)
print(partial_ratio)
80
100
As you can see, the partial ratio
is 100
while the plain ratio
is 80
so relying on partial ratio
in handy in
some scenarios.
Handling out of order words
What if we wanted to ignore how the words are ordered in strings?
Let’s take the example of these two strings:
Chocolate Strawberry Cake
Strawberry Chocolate Cake
Semantically, these two strings are the same. However, if you were to calculate the ratio of these strings, you will end
up with a similarity ratio score of only 60
.
Thankfully, fuzzywuzzy
has got your back. You can use the token_set_ratio
function to treat the individual words
as order independent tokens.
from fuzzywuzzy import fuzz
str1 = 'Chocolate Strawberry Cake'
str2 = 'Strawberry Chocolate Cake'
token_set_ratio = fuzz.token_set_ratio(str1, str2)
print(token_set_ratio)
100
The token set ratio of those two strings is now 100
. Perfect.
Best matches of a given string
So far, we have been looking at calculating pair-wise string similarity.
However, one typically wants to find the closest matching strings of a given string. Once again, fuzzywuzzy
has
got a convenience function for doing just that.
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
str_list = ['Joe Biden', 'Joseph Biden', 'Joseph R Biden']
match_ratios = process.extract('joe r biden', str_list, scorer=fuzz.token_sort_ratio)
print(match_ratios)
best_match = process.extractOne('joe r biden', str_list, scorer=fuzz.token_sort_ratio)
print(best_match)
[('Joe Biden', 90), ('Joseph R Biden', 88), ('Joseph Biden', 78)]
('Joe Biden', 90)
Note how we’re passing a scorer
function to the extraction functions. Depending on the context, we can also use
fuzz.partial_ratio
or fuzz.ratio
scoring functions.