In information systems, it is common to have the same entity being represented by slightly varying strings. Consider the following:
Joe Biden Joseph Biden Joseph R Biden
All three strings refer to the same person, but in slightly different ways.
Fuzzy search is the process of finding strings that approximately match a given string.
Let’s explore how we can utilize various fuzzy string matching algorithms in Python to compute similarity between pairs of strings.
SequenceMatcher from difflib
SequenceMatcher is available as part of the Python standard library. It uses the Ratcliff/Obershelp string matching algorithm which calculates the similarity metric between two strings as:
Twice the number of matching (overlapping) characters between the two strings divided by the total number of characters in the two strings.
from difflib import SequenceMatcher as SM str1 = 'But I have promises to keep, and miles to go before I sleep.' str2 = 'But I have many promises to keep, and miles to run before sleep.' similarity_ratio = SM(isjunk=None, str1, str2).ratio() # 0.9032258064516129
Here, we can see that the two string are about 90% similar based on the similarity ratio calculated by
difflib module contains many useful string matching functions that you should
certainly explore further.
The Levenshtein distance between two strings is the number of deletions, insertions and substitutions needed to transform one string into another.
Once you install the
pip install python-Levenshtein
You can compute both the Levenshtein edit distance and similarity ratio between two strings.
import Levenshtein as levenshtein str1 = 'But I have promises to keep, and miles to go before I sleep.' str2 = 'But I have many promises to keep, and miles to run before sleep.' edit_distance = levenshtein.distance(str1, str2) similarity_ratio = levenshtein.ratio(str1, str2)
Fuzzywuzzy is a more feature-rich library for computing string similarity and performing fuzzy string matching in Python.
python-Levenshtein library, it also has a ratio function:
from fuzzywuzzy import fuzz str1 = 'But I have promises to keep, and miles to go before I sleep.' str2 = 'But I have many promises to keep, and miles to run before sleep.' ratio = fuzz.ratio(str1, str2)
The library also provides advanced functions for handling other complex string matching scenarios.
Let’s take an example of a string which is a substring of another. Depending on the context, some text matching will require us to treat substring matches as complete match.
from fuzzywuzzy import fuzz str1 = 'California, USA' str2 = 'California' ratio = fuzz.ratio(str1, str2) partial_ratio = fuzz.partial_ratio(str1, str2) print(ratio) print(partial_ratio)
As you can see, the
partial ratio is
100 while the plain
80 so relying on
partial ratio in handy in
Handling out of order words
What if we wanted to ignore how the words are ordered in strings?
Let’s take the example of these two strings:
Chocolate Strawberry Cake Strawberry Chocolate Cake
Semantically, these two strings are the same. However, if you were to calculate the ratio of these strings, you will end
up with a similarity ratio score of only
fuzzywuzzy has got your back. You can use the
token_set_ratio function to treat the individual words
as order independent tokens.
from fuzzywuzzy import fuzz str1 = 'Chocolate Strawberry Cake' str2 = 'Strawberry Chocolate Cake' token_set_ratio = fuzz.token_set_ratio(str1, str2) print(token_set_ratio)
The token set ratio of those two strings is now
Best matches of a given string
So far, we have been looking at calculating pair-wise string similarity.
However, one typically wants to find the closest matching strings of a given string. Once again,
got a convenience function for doing just that.
from fuzzywuzzy import process from fuzzywuzzy import fuzz str_list = ['Joe Biden', 'Joseph Biden', 'Joseph R Biden'] match_ratios = process.extract('joe r biden', str_list, scorer=fuzz.token_sort_ratio) print(match_ratios) best_match = process.extractOne('joe r biden', str_list, scorer=fuzz.token_sort_ratio) print(best_match)
[('Joe Biden', 90), ('Joseph R Biden', 88), ('Joseph Biden', 78)] ('Joe Biden', 90)
Note how we’re passing a
scorer function to the extraction functions. Depending on the context, we can also use
fuzz.ratio scoring functions.