How to merge two dataframes by similar (but not matching) values?

How to merge two dataframes by similar (but not matching) values? - python

I have two dataframes.
player_stats:
player minutes total_points assists
1 Erling Haaland 77 13 0
2 Kevin De Bruyne 90 6 1
and, season_gw1:
player position gw team
10449 Erling Håland 4 1 Manchester City
10453 Kevin De Bruyne 3 1 Manchester City
I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.
This is the code I'm using:
season_gw1_stats = season_gw1.merge(player_stats, on = 'player')
And the resulting df (season_gw1_stats) is this one:
player position gw team minutes total_points assists
10453 Kevin de Bruyne 3 1 Manchester City 90 6 1
How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).

In order to use standard pandas to "merge dataframes"
you will pretty much need to eliminate "similar"
from the problem statement.
So we're faced with mapping to "matching" values
that are identical across dataframes.
Here's a pair of plausible approaches.
normalize each name in isolation
examine quadratic pairwise distances
1. normalize
Map variant spellings down to a smaller universe
of spellings where collisions (matches) are more likely.
There are many approaches:
case smash to lower
map accented vowels to [aeiou]
discard all vowels
use simplifying regexes like s/sch/sh/ and s/de /de/
use Soundex or later competitors like Metaphone
manually curate a restricted vocabulary of correct spellings
Cost is O(N) linear with total length of dataframes.
2. pairwise distances
We wish to canonicalize,
to boil down multiple variant spellings
to a distinguished canonical spelling.
Begin by optionally normalizing,
then sort, at cost of O(N log N), and finally
make a linear pass that outputs only unique names.
This trivial pre-processing step reduces N,
which helps a lot when dealing
with O(N^2) quadratic cost.
Define a distance metric which accepts two names.
When given a pair of identical names it must
report a distance of zero. Otherwise it
deterministically reports a positive real number.
You might use Levenshtein,
or MRA.
Use nested loops
to compare all names against all names.
If distance between names is less than
threshold, arbitrarily declare name1
the winner, overwriting 2nd name with that 1st value.
The effect is to cluster multiple variant spellings
down to a single winning spelling.
Cost is O(N^2) quadratic.
Perhaps you're willing to tinker with the
distance function a bit.
You might give initial letter(s) a heavier weight,
such that mismatched prefix guarantees the distance
shall exceed threshold.
In that case sorted names will help out,
and the nested loop can be confined to
just a small window of similarly prefixed names,
with early termination once it sees the prefix has changed.
Noting the distance between adjacent sorted names
can help with manually choosing a sensible
threshold parameter.
Finally, with adjusted names in hand,
you're in a position to .merge()
using exact equality tests.

Related

Collapsing set of strings based on a given hamming distance

Given a set of strings (first column) along with counts (second column), e.g.:
aaaa 10
aaab 5
abbb 3
cbbb 2
dbbb 1
cccc 8
Are there any algorithms or even implementations (ideally as a Unix executive, R or python) which collapse this set into a new set based on a given hamming distance.
Collapsing implies adding the count
Strings with a lower count are collapsed into strings with higher counts.
For example say for hamming distance 1, the above set would collapse the second string aaab into aaaa since they are 1 hamming distance apart and aaaa has a higher count.
The collapsed entry would have the combined count, here aaaa 15
For this set, we'd, therefore, get the following collapsed set:
aaaa 15
abbb 6
cccc 8
Ideally, an implementation should be efficient, so even heuristics which do not guarantee an optimal solution would be appreciated.
Further background and motivation
Calculating the hamming distance between 2 strings (a pair) is been implemented in most programming languages. A brute force solution would compute compute the distance between all pairs. Maybe there is no way around it. However e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc. There are maybe clever ways to save some calculations based on metric theory (since hamming distance is a metric), e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z. Maybe there is a clever k-mer approach, or maybe some efficient solution for a constant distance (say d=1).
Even if it there was only a brute force solution, I'd be curious if this has been implemented before and how to use it (ideally without me having to implement it myself).

I thought up the following:
This reports the item with the highest score with the sum of its score and the scores of its near by neighbors. Once a neighbor is used it is not reported separately.
I suggest using a Vantage-point tree as the metric index.
The algorithm would look like this:
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
for the string with the highest score in the max heap:
use the metric index to find the near by strings
print the string, and the sum of its score and its near by strings
remove from the metric index the string and each of the near by strings
remove from the max heap the string and each of the near by strings
repeat 3-7 until the max heap is empty
Perhaps this could be simplified by using a used table rather than removing anything. The metric space index would not need to have efficient deletion nor would the max heap need to support deletion by value. But this would be slower if the neighborhoods are large and overlap frequently. So efficient deletion might be a necessary difficulty.
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
construct the used table from an empty set
for the string with the highest score in the max heap:
if this string is in the used table: start over with the next string
use the metric index to find the near by strings
remove any near by strings that are in the used table
print the string, and the sum of its score and its near by strings
add the near by strings to the used table
repeat 4-9 until the max heap is empty
I can not provide a complexity analysis.
I was thinking about the second algorithm. The part that I thought was slow was the checking of the neighborhood against used table. This is not needed as deletion from a Vantage-point tree can be done in linear time. When searching for the neighbors, remember where they were found and then remove them later using these locations. If a neighbor is used as a vantage-point, mark it as removed so that a search will not return it, but leave it alone otherwise. This I think restores it to below quadratic. As otherwise it would be something like number of items times size of neighborhood.
In response to the comment. The problem was "Strings with a lower count are collapsed into strings with higher counts." as such this does compute that. It is not a greedy approximation that could result non-optimal result as there was nothing to maximize or minimize. It is an exact algorithm. It returns the the item with the highest score combined with the score of its neighborhood.
This can be viewed as assigning a leader to each neighborhood such that each item has at most one leader and that leader has the largest overall score so far. This can be viewed as a directed graph.
The specification wasn't for dynamic programming or optimization problem. For that you would ask for the item with the highest score in the highest total scoring neighborhood. That can also be solved in a similar way by changing the ranking function strings from its score to the pair of the sum of its score and its neighborhood, and its score.
It does mean that it can't be solved with a max heap over the scores as removing items affects the neighbors of the neighborhood and one would have to recalculate their neighborhood score before again finding the item with the highest total scoring neighborhood.

Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000 rows. After having looked through StackOverflow I have tried implementing a lambda row (apply) method but that seems to barely speed things up, if at all.
The first four rows of the dataframe look like this:
index org_name
0 cliftonlarsonallen llp minneapolis MN
1 loeb and troper llp newyork NY
2 dauby o'connor and zaleski llc carmel IN
3 wegner cpas llp madison WI
The following code block works but took around five days to process:
org_list = df['org_name']
from fuzzywuzzy import process
for index, row in df.iterrows():
x = process.extract(row['org_name'], org_list, limit=2)[1]
if x[1]>93:
df.loc[index, 'fuzzy_match'] = x[0]
df.loc[index, 'fuzzy_match_score'] = x[1]
In effect, for each row I am comparing the organization name against the list of all organization names, taking the top two matches, then selecting the second-best match (because the top match will be the identical name), and then setting a condition that the score must be higher than 93 in order to create the new columns. The reason I'm creating additional columns is that I do not want to simply replace values -- I'd like to double-check the results first.
Is there a way to speed this up? I read several blog posts and StackOverflow questions that talked about 'vectorizing' this code but my attempts at that failed. I also considered simply creating a 70,000 x 70,000 Levenshtein distance matrix and then extracting information from there. Is there a quicker way to generate the best match for each element in a list or PANDAS column?

Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best ratio among them. Therefore it even has to calculate the Levenshtein distance multiple times. So one goal should be to reduce the search space by excluding some possibilities using a way faster matching algorithm. Another issue is that the strings are preprocessed to remove punctuation and to lowercase the strings. While this is required for the matching (so e.g. a uppercased word becomes equal to a lowercased one) we can do this ahead of time. So we only have to preprocess the 70k strings once. I will use RapidFuzz instead of FuzzyWuzzy here, since it is quite a bit faster (I am the author).
The following version performs more than 10 times as fast as your previous solution in my experiments and applies the following improvements:
it preprocesses the strings ahead of time
it passes a score_cutoff to extractOne so it can skip calculations where it already knows they can not reach this ratio
import pandas as pd, numpy as np
from rapidfuzz import process, utils
org_list = df['org_name']
processed_orgs = [utils.default_process(org) for org in org_list]
for (i, processed_query) in enumerate(processed_orgs):
# None is skipped by extractOne, so we set the current element to None an
# revert this change after the comparision
processed_orgs[i] = None
match = process.extractOne(processed_query, processed_orgs, processor=None, score_cutoff=93)
processed_orgs[i] = processed_query
if match:
df.loc[i, 'fuzzy_match'] = org_list[match[2]]
df.loc[i, 'fuzzy_match_score'] = match[1]
Here is a list of the most relevant improvements of RapidFuzz to make it faster than FuzzyWuzzy in this example:
It is implemented fully in C++ while a big part of FuzzyWuzzy is implemented in Python
When calculating the levenshtein distance it takes into account the score_cutoff to choose an optimized implementation based. E.g. when the length difference between the strings is to big it can exit in O(1).
FuzzyWuzzy uses Python-Levenshtein to calculate the similarity between two strings, which uses a weightened Levenshtein distance with a weight of 2 for substitutions. This is implemented using Wagner-Fischer. RapidFuzz on the other hand uses a bitparallel implementation for this based on BitPal, which is faster
fuzz.WRatio is combining the results of multiple other string matching algorithms like fuzz.ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio and takes the maximum result after weighting them. So while fuzz.ratio has a weighting of 1 fuzz.token_sort_ratio and fuzz.token_set_ratio have one of 0.95. When the score_cutoff is bigger than 95 fuzz.token_sort_ratio and fuzz.token_set_ratio are not calculated anymore, since the results are guaranteed to be smaller than the score_cutoff
In process.extractOne RapidFuzz avoids calls through Python whenever possible and preprocesses the query once ahead of time. E.g. the BitPal algorithm requires one of the two strings which are compared to be stored into a bitvector which takes a big part of the algorithms runtime. In process.extractOne the query is stored into this bitvector only once and the bitvector is reused afterwards making the algorithm a lot faster.
since extractOne only searches for the best match it uses the ratio of the current best match as score_cutoff for the next elements. This way it can quickly discard more elements by using the improvements to the levenshtein distance calculation from 2) in many cases. When it finds a element with a similarity of 100 it exits early since there can't be a better match afterwards.

This solution leverages apply() and should demonstrate reasonable performance improvements. Feel free to play around with the scorer and change the threshold to meet your needs:
import pandas as pd, numpy as np
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['cliftonlarsonallen llp minneapolis MN'],
['loeb and troper llp newyork NY'],
["dauby o'connor and zaleski llc carmel IN"],
['wegner cpas llp madison WI']],
columns=['org_name'])
org_list = df['org_name']
threshold = 40
def find_match(x):
match = process.extract(x, org_list, limit=2, scorer=fuzz.partial_token_sort_ratio)[1]
match = match if match[1]>threshold else np.nan
return match
df['match found'] = df['org_name'].apply(find_match)
Returns:
org_name match found
0 cliftonlarsonallen llp minneapolis MN (wegner cpas llp madison WI, 50, 3)
1 loeb and troper llp newyork NY (wegner cpas llp madison WI, 46, 3)
2 dauby o'connor and zaleski llc carmel IN NaN
3 wegner cpas llp madison WI (cliftonlarsonallen llp minneapolis MN, 50, 0)
If you would just like to return the matching string itself, then you can modify as follows:
match = match[0] if match[1]>threshold else np.nan
I've added #user3483203's comment pertaining to a list comprehension here as an alternative option as well:
df['match found'] = [find_match(row) for row in df['org_name']]
Note that process.extract() is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. For that reason, you will have to evaluate that query against all 70,000 match options (the way you currently have your code setup). So therefore, you will be evaluating len(match_options)**2 (or 4,900,000,000) string comparisons. Therefore, I think the best performance improvements could be achieved by limiting the potential match options via more extensive logic in the find_match() function, e.g. enforcing that the match options start with the same letter as the query, etc.

Using iterrows() is not recommended on dataframes, you could use apply() instead. But that probably wouldn't speed things up by much. What is slow is fuzzywuzzy's extract method where your input is compared with all 70k rows (string distance methods are computationally expensive). So if you intend to stick to fuzzywuzzy, one solution would be to limit your search for example to only those with the same first letter. Or if you have another column in your data that could be used as a hint (State, City, ...)

My initial method is try to use spark via cluster to solve this or using multitasking parallel. But I found we can also improve the algorithm itself implementation like the rapidfuzzy.
import pandas as pd
from rapidfuzz import process, utils, fuzz
import multiprocessing
from multiprocessing import Process, Pool
def checker(wrong_option):
if wrong_option in choices:
##orign_array.append(wrong_option)
return wrong_option, wrong_option, 100
else:
x=process.extractOne(wrong_option, choices, processor=None, score_cutoff=0)
return wrong_option, x[0], x[1]
if __name__ == '__main__':
# setup cpu cores
pool = Pool(multiprocessing.cpu_count())
print("cpu counts:" + str(multiprocessing.cpu_count()))
# map multiple tasks
pool_outputs = pool.map(checker, checktokens)
# create DataFrame using data
df = pd.DataFrame(pool_outputs, columns=['Name', 'matched', 'Score'])
# output
print(df)

Ways to create unique single elimination brackets

I'm looking for a method to create all possible unique starting positions for a N-player (N is a power of 2) single elimination (knockout) bracket tournament.
Lets say we have players 'A', 'B', 'C', and 'D' and want to find out all possible initial positions. The tournament would then look tike this:
A vs B, C vs D. Then winner(AB) vs winner(CD).
(I will use the notation (A,B,C,D) for the setup above)
Those would simply be all possible permutations of 4 elements, there are 4!=24 of those, and it's easy to generate them.
But they wouldn't be unique for the Tournament, since
(A,B,C,D), (B,A,C,D), (B,A,D,C), (C,D,A,B), ...
would all lead to the same matches being played.
In this case, the set of unique setups is, I think:
(A,B,C,D), (A,C,B,D), (A,D,C,B)
All other combinations would be "symmetric".
Now my questions would be for the general case of N=2^d players:
how many such unique setups are there?
is this a known problem I could look up? Haven't found it yet.
is there a method to generate them all?
how would this method look in python
(questions ranked by perceived usefulness)
I have stumpled upon this entry, but it does not really deal with the problem I'm discussing here.

how many such unique setups are there?
Let there be n teams. There are n! ways to list them in order. We'll start with that. Then deal with the over-counting.
Say we have 8 teams. One possibility is
ABCDEFGH
Swapping teams 1 and 2 won't make a difference. We can have
BACDEFGH
and the same teams play.Divide by 2 to account for that. Swapping 3 and 4 won't either. Divide by 2 again. Same with teams 5 and 6. Total there are 4 groups of 2 (4 matches in the first round). So we take n!, and divide by 2^(n/2).
But here is the thing. We can have order
CDABEFGH
In this example, we are swapping the first two with third and fourth. CDABEFGH is indistinguishable from ABCDEFGH for the purpose of this. So here, we can divide by 2^(n/4).
The same can happen over and over again. At the end, the total number of starting positions should be n!/(2^(n-1)).
We can also think of it a bit different. If we look at https://stackoverflow.com/posts/2269581/revisions, we can also think of it as a tree.
a b (runner up)
a e
a c e h
a b c d e f h g
Here there are 8! ways for us to arrange all the letters at the base, determining one way for the bracket to work out. If we are looking at the starting position, it doesn't matter who won. There were a total of 7 games (and each of the games could have turned out differently), so we divide by 2^7 to account for that over counting.

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

Comparing similarity between multiple strings with a random starting point

I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M#rrrrryy Richard etc etc. Some typos but some totally different names.
Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!
I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc

Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…
Here is how to generate the pairs for a given list:
import itertools
persons = ['person1', 'person2', 'person3']
for p1, p2 in itertools.combinations(persons, 2):
print "Compare", p1, "and", p2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.