Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column - python

I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000 rows. After having looked through StackOverflow I have tried implementing a lambda row (apply) method but that seems to barely speed things up, if at all.
The first four rows of the dataframe look like this:
index org_name
0 cliftonlarsonallen llp minneapolis MN
1 loeb and troper llp newyork NY
2 dauby o'connor and zaleski llc carmel IN
3 wegner cpas llp madison WI
The following code block works but took around five days to process:
org_list = df['org_name']
from fuzzywuzzy import process
for index, row in df.iterrows():
x = process.extract(row['org_name'], org_list, limit=2)[1]
if x[1]>93:
df.loc[index, 'fuzzy_match'] = x[0]
df.loc[index, 'fuzzy_match_score'] = x[1]
In effect, for each row I am comparing the organization name against the list of all organization names, taking the top two matches, then selecting the second-best match (because the top match will be the identical name), and then setting a condition that the score must be higher than 93 in order to create the new columns. The reason I'm creating additional columns is that I do not want to simply replace values -- I'd like to double-check the results first.
Is there a way to speed this up? I read several blog posts and StackOverflow questions that talked about 'vectorizing' this code but my attempts at that failed. I also considered simply creating a 70,000 x 70,000 Levenshtein distance matrix and then extracting information from there. Is there a quicker way to generate the best match for each element in a list or PANDAS column?

Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best ratio among them. Therefore it even has to calculate the Levenshtein distance multiple times. So one goal should be to reduce the search space by excluding some possibilities using a way faster matching algorithm. Another issue is that the strings are preprocessed to remove punctuation and to lowercase the strings. While this is required for the matching (so e.g. a uppercased word becomes equal to a lowercased one) we can do this ahead of time. So we only have to preprocess the 70k strings once. I will use RapidFuzz instead of FuzzyWuzzy here, since it is quite a bit faster (I am the author).
The following version performs more than 10 times as fast as your previous solution in my experiments and applies the following improvements:
it preprocesses the strings ahead of time
it passes a score_cutoff to extractOne so it can skip calculations where it already knows they can not reach this ratio
import pandas as pd, numpy as np
from rapidfuzz import process, utils
org_list = df['org_name']
processed_orgs = [utils.default_process(org) for org in org_list]
for (i, processed_query) in enumerate(processed_orgs):
# None is skipped by extractOne, so we set the current element to None an
# revert this change after the comparision
processed_orgs[i] = None
match = process.extractOne(processed_query, processed_orgs, processor=None, score_cutoff=93)
processed_orgs[i] = processed_query
if match:
df.loc[i, 'fuzzy_match'] = org_list[match[2]]
df.loc[i, 'fuzzy_match_score'] = match[1]
Here is a list of the most relevant improvements of RapidFuzz to make it faster than FuzzyWuzzy in this example:
It is implemented fully in C++ while a big part of FuzzyWuzzy is implemented in Python
When calculating the levenshtein distance it takes into account the score_cutoff to choose an optimized implementation based. E.g. when the length difference between the strings is to big it can exit in O(1).
FuzzyWuzzy uses Python-Levenshtein to calculate the similarity between two strings, which uses a weightened Levenshtein distance with a weight of 2 for substitutions. This is implemented using Wagner-Fischer. RapidFuzz on the other hand uses a bitparallel implementation for this based on BitPal, which is faster
fuzz.WRatio is combining the results of multiple other string matching algorithms like fuzz.ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio and takes the maximum result after weighting them. So while fuzz.ratio has a weighting of 1 fuzz.token_sort_ratio and fuzz.token_set_ratio have one of 0.95. When the score_cutoff is bigger than 95 fuzz.token_sort_ratio and fuzz.token_set_ratio are not calculated anymore, since the results are guaranteed to be smaller than the score_cutoff
In process.extractOne RapidFuzz avoids calls through Python whenever possible and preprocesses the query once ahead of time. E.g. the BitPal algorithm requires one of the two strings which are compared to be stored into a bitvector which takes a big part of the algorithms runtime. In process.extractOne the query is stored into this bitvector only once and the bitvector is reused afterwards making the algorithm a lot faster.
since extractOne only searches for the best match it uses the ratio of the current best match as score_cutoff for the next elements. This way it can quickly discard more elements by using the improvements to the levenshtein distance calculation from 2) in many cases. When it finds a element with a similarity of 100 it exits early since there can't be a better match afterwards.

This solution leverages apply() and should demonstrate reasonable performance improvements. Feel free to play around with the scorer and change the threshold to meet your needs:
import pandas as pd, numpy as np
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['cliftonlarsonallen llp minneapolis MN'],
['loeb and troper llp newyork NY'],
["dauby o'connor and zaleski llc carmel IN"],
['wegner cpas llp madison WI']],
columns=['org_name'])
org_list = df['org_name']
threshold = 40
def find_match(x):
match = process.extract(x, org_list, limit=2, scorer=fuzz.partial_token_sort_ratio)[1]
match = match if match[1]>threshold else np.nan
return match
df['match found'] = df['org_name'].apply(find_match)
Returns:
org_name match found
0 cliftonlarsonallen llp minneapolis MN (wegner cpas llp madison WI, 50, 3)
1 loeb and troper llp newyork NY (wegner cpas llp madison WI, 46, 3)
2 dauby o'connor and zaleski llc carmel IN NaN
3 wegner cpas llp madison WI (cliftonlarsonallen llp minneapolis MN, 50, 0)
If you would just like to return the matching string itself, then you can modify as follows:
match = match[0] if match[1]>threshold else np.nan
I've added #user3483203's comment pertaining to a list comprehension here as an alternative option as well:
df['match found'] = [find_match(row) for row in df['org_name']]
Note that process.extract() is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. For that reason, you will have to evaluate that query against all 70,000 match options (the way you currently have your code setup). So therefore, you will be evaluating len(match_options)**2 (or 4,900,000,000) string comparisons. Therefore, I think the best performance improvements could be achieved by limiting the potential match options via more extensive logic in the find_match() function, e.g. enforcing that the match options start with the same letter as the query, etc.

Using iterrows() is not recommended on dataframes, you could use apply() instead. But that probably wouldn't speed things up by much. What is slow is fuzzywuzzy's extract method where your input is compared with all 70k rows (string distance methods are computationally expensive). So if you intend to stick to fuzzywuzzy, one solution would be to limit your search for example to only those with the same first letter. Or if you have another column in your data that could be used as a hint (State, City, ...)

My initial method is try to use spark via cluster to solve this or using multitasking parallel. But I found we can also improve the algorithm itself implementation like the rapidfuzzy.
import pandas as pd
from rapidfuzz import process, utils, fuzz
import multiprocessing
from multiprocessing import Process, Pool
def checker(wrong_option):
if wrong_option in choices:
##orign_array.append(wrong_option)
return wrong_option, wrong_option, 100
else:
x=process.extractOne(wrong_option, choices, processor=None, score_cutoff=0)
return wrong_option, x[0], x[1]
if __name__ == '__main__':
# setup cpu cores
pool = Pool(multiprocessing.cpu_count())
print("cpu counts:" + str(multiprocessing.cpu_count()))
# map multiple tasks
pool_outputs = pool.map(checker, checktokens)
# create DataFrame using data
df = pd.DataFrame(pool_outputs, columns=['Name', 'matched', 'Score'])
# output
print(df)

Related

How to merge two dataframes by similar (but not matching) values?

I have two dataframes.
player_stats:
player minutes total_points assists
1 Erling Haaland 77 13 0
2 Kevin De Bruyne 90 6 1
and, season_gw1:
player position gw team
10449 Erling Håland 4 1 Manchester City
10453 Kevin De Bruyne 3 1 Manchester City
I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.
This is the code I'm using:
season_gw1_stats = season_gw1.merge(player_stats, on = 'player')
And the resulting df (season_gw1_stats) is this one:
player position gw team minutes total_points assists
10453 Kevin de Bruyne 3 1 Manchester City 90 6 1
How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).

In order to use standard pandas to "merge dataframes"
you will pretty much need to eliminate "similar"
from the problem statement.
So we're faced with mapping to "matching" values
that are identical across dataframes.
Here's a pair of plausible approaches.
normalize each name in isolation
examine quadratic pairwise distances
1. normalize
Map variant spellings down to a smaller universe
of spellings where collisions (matches) are more likely.
There are many approaches:
case smash to lower
map accented vowels to [aeiou]
discard all vowels
use simplifying regexes like s/sch/sh/ and s/de /de/
use Soundex or later competitors like Metaphone
manually curate a restricted vocabulary of correct spellings
Cost is O(N) linear with total length of dataframes.
2. pairwise distances
We wish to canonicalize,
to boil down multiple variant spellings
to a distinguished canonical spelling.
Begin by optionally normalizing,
then sort, at cost of O(N log N), and finally
make a linear pass that outputs only unique names.
This trivial pre-processing step reduces N,
which helps a lot when dealing
with O(N^2) quadratic cost.
Define a distance metric which accepts two names.
When given a pair of identical names it must
report a distance of zero. Otherwise it
deterministically reports a positive real number.
You might use Levenshtein,
or MRA.
Use nested loops
to compare all names against all names.
If distance between names is less than
threshold, arbitrarily declare name1
the winner, overwriting 2nd name with that 1st value.
The effect is to cluster multiple variant spellings
down to a single winning spelling.
Cost is O(N^2) quadratic.
Perhaps you're willing to tinker with the
distance function a bit.
You might give initial letter(s) a heavier weight,
such that mismatched prefix guarantees the distance
shall exceed threshold.
In that case sorted names will help out,
and the nested loop can be confined to
just a small window of similarly prefixed names,
with early termination once it sees the prefix has changed.
Noting the distance between adjacent sorted names
can help with manually choosing a sensible
threshold parameter.
Finally, with adjusted names in hand,
you're in a position to .merge()
using exact equality tests.

What is the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings?

I'm writing a program that compares a smaller list of game titles to a master list of many games to see which games in the smaller list more closely match with the titles of the games in the master list than others. In order to do this, I've been checking the Levenshtein distance (in percent form) between each game in the smaller list and every game in the master list and taking the maximum of all of these values (the lower the maximum percentage, the more unique the game has to be) using both the difflib and the fuzzywuzzy modules. The problem that I'm having is that a typical search using either process.extractOne() or difflib.get_close_matches() takes about 5+ seconds per game (with 38000+ strings in the master list), and I have about 4500 games to search through (5 * 4500 is about 6 hours and 15 minutes, which I don't have time for).
In hopes of finding a better and faster method of searching through a list of strings, I'm asking here what the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings is. If there is no better way than by using the two functions above or writing some other looping code, then please say so.
The two functions I used in specific to search for the highest distance are these:
metric = process.extractOne(name, master_names)[1] / 100
metric = fuzz.ratio(name, difflib.get_close_matches(name, master_names, 1, 0)[0]) / 100

Through experimentation and further research I discovered that the fastest method of checking the Levenshtein ratio is through the python-Levenshtein library itself. The function Levenshtein.ratio() is significantly faster (for one game the entire search takes only 0.05 seconds on average) compared to using any function in fuzzywuzzy or difflib, likely because of its simplicity and C implementation. I used this function in a for loop iterating over every name in the master list to get the best answer:
from Levenshtein import ratio
metric = 0
for master_name in master_names:
new_metric = ratio(name, master_name)
if (new_metric > metric):
metric = new_metric
In conclusion I say that the fastest method of searching for the highest percent Levenshtein distance between a string and a list of strings is to iterate over the list of strings, use Levenshtein.ratio() to get the ratio of each string compared with the first string, and then check for the highest value ratio on each iteration.

How to find values below (or above) average

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.
from StringIO import StringIO
myst="""01/01/2016 8781262
01/02/2016 8958598
01/03/2016 8787628
01/04/2016 9770861
01/05/2016 8409410
01/06/2016 8924784
01/07/2016 8597500
01/08/2016 6436862
01/09/2016 1542677
"""
u_cols=['month', 'count']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)
Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?
This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.

A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.
Define a zscore function
def zscore(s):
return (s - np.mean(s)) / np.std(s)
Apply it to the count column
zscore(df['count'])
0 0.414005
1 0.488906
2 0.416694
3 0.831981
4 0.256946
5 0.474624
6 0.336390
7 -0.576197
8 -2.643349
Name: count, dtype: float64
Notice that the September observation is 2.6 standard deviations away.
Use abs and gt to identify outliers
zscore(df['count']).abs().gt(2)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
Name: count, dtype: bool
Again, September comes back true.
Tie it all together to filter your original dataframe
df[zscore(df['count']).abs().gt(2)]
filter the other way
df[zscore(df['count']).abs().le(2)]

first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.
But on the other side:
In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.
So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.
STRAIGHTFORWARD APPROACH
If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).
In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).
Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:
High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.
ALTERNATIVES
If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).
In this case I would regard two different approaches, depending again on how does your data look like:
Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.
Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.
I hope it helps!
Cheers

One way to filter outliers is the interquartile range (IQR, wikipedia), which is the difference between 75% (Q3) and 25% quartile (Q1).
The outliers are defined if the data falls below Q1 - k * IQR resp. above Q3 + k * IQR.
You can select the constant k based on your domain knowledge (a common choice is 1.5).
Given the data, a filter in pandas could look like this:
iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

Generating a list of distinct (distant, by edit distance) words by filtering

I have a long (> 1000 items) list of words, from which I would like to remove words that are "too similar" to other words, until the remaining words are all "significantly different". For example, so that no two words are within an edit distance D.
I do not need a unique solution, and it doesn't have to be exactly optimal, but it should be reasonably quick (in Python) and not discard way too many entries.
How can I achieve this? Thanks.
Edit: to be clear, I can google for a python routine that measures edit distance. The problem is how to do this efficiently, and, perhaps, in some way that finds a "natural" value of D. Maybe by constructing some kind of trie from all words and then pruning?

You can use a bk-tree, and before each item is added check that it is not within distance D of any others (thanks to #DietrichEpp in the comments for this idea.
You can use this recipe for a bk-tree (though any similar recipes are easily modified). Simply make two changes: change the line:
def __init__(self, items, distance, usegc=False):
to
def __init__(self, items, distance, threshold=0, usegc=False):
And change the line
if el not in self.nodes: # do not add duplicates
to
if (el not in self.nodes and
(threshold == None or len(self.find(el, threshold)) == 0)):
This makes sure there are no duplicates when an item is added. Then, the code to remove duplicates from a list is simply:
from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
tr = BKtree(iter(lst), distance, threshold)
return tr.nodes.keys()
Note that this relies on the python-Levenshtein package for its distance function, which is much faster than the one provided by bk-tree. python-Levenshtein has C-compiled components, but it's worth the installation.
Finally, I set up a performance test with an increasing number of words (grabbed randomly from /usr/share/dict/words) and graphed the amount of time each took to run:
import random
import time
from Levenshtein import distance
from bktree import BKtree
with open("/usr/share/dict/words") as inf:
word_list = [l[:-1] for l in inf]
def remove_duplicates(lst, threshold):
tr = BKtree(iter(lst), distance, threshold)
return tr.nodes.keys()
def time_remove_duplicates(n, threshold):
"""Test using n words"""
nwords = random.sample(word_list, n)
t = time.time()
newlst = remove_duplicates(nwords, threshold)
return len(newlst), time.time() - t
ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)
from matplotlib import pyplot as plt
plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")
Without confirming it mathematically, I don't think it's quadratic, and I think it might actually be n log n, which would make sense if inserting into a bk-tree is a log time operation. Most notably, it runs pretty quickly with under 5000 strings, which hopefully is the OP's goal (and it's reasonable with 15000, which a traditional for loop solution would not be).

Tries will not be helpful, nor will hash maps. They are simply not useful for spatial, high-dimensional problems like this one.
But the real problem here is the ill-specified requirement of "efficient". How fast is "efficient"?
import Levenshtein
def simple(corpus, distance):
words = []
while corpus:
center = corpus[0]
words.append(center)
corpus = [word for word in corpus
if Levenshtein.distance(center, word) >= distance]
return words
I ran this on 10,000 words selected uniformly from the "American English" dictionary I have on my hard drive, looking for sets with a distance of 5, yielding around 2,000 entries.
real 0m2.558s
user 0m2.404s
sys 0m0.012s
So, the question is, "How efficient is efficient enough"? Since you didn't specify your requirements, it's really hard for me to know if this algorithm works for you or not.
The rabbit hole
If you want something faster, here's how I would do it.
Create a VP tree, BK tree, or other suitable spatial index. For each word in the corpus, insert that word into the tree if it has a suitable minimum distance from every word in the index. Spatial indexes are specifically designed to support this kind of query.
At the end, you will have a tree containing nodes with the desired minimum distance.

Your trie thought is definitely and interesting one. This page has a great setup for fast edit distance calculations in a trie and would definitely be efficient if you needed to expand your wordlist to millions rather than a thousand, which is pretty small in the corpora linguistics business.
Good luck, it sounds like a fun representation of the problem!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.