Fuzzy matching a sorted column with itself using python - python

I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.
Example:
In the above example the third row is very similar to the first row so I want it to take the value of the first row.
I have a working python code but it is too slow:
import pandas as pd
import datetime
import time
import numpy as np
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray
data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")
# Create lookup column from the dataframe itself:
lookup_data=data['UNIQUE_ID']
lookup_data=pd.Series.to_frame(lookup_data)
# Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
start = time.time()
for index,row in data.iterrows():
if index%5000==0:print(index, time.time()-start)
for index2, row2 in lookup_data.iterrows():
ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
if ratio_val<0.15:
data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
data.set_value(index,'Ratio_Val',ratio_val)
break
Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?

One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:
if index2 <= index:
continue
This alone will speed up the matching by the factor of 2.

I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.
You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())
>>> from e_utils.utils import clean_df
>>> from e_utils.utils import homog_lev
>>> series
0 Bad Bunny
1 bad buny
2 bag bunny
3 Ozuna
4 De La Ghetto
5 de la geto
6 Daddy Yankee
7 dade yankee
8 Nicky Jam
9 nicky jam
10 J Balvin
11 jbalvin
12 Maluma
13 maluma
14 Anuel AA
>>> series2 = clean_df(series)
>>> series2 = homog_lev(series2, eps=3)
>>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])
*Original* *Fixed*
0 Bad Bunny Bad Bunny
1 bad buny Bad Bunny
2 bag bunny Bad Bunny
3 Ozuna Ozuna
4 De La Ghetto De La Ghetto
5 de la geto De La Ghetto
6 Daddy Yankee Daddy Yankee
7 dade yankee Daddy Yankee
8 Nicky Jam Nicky Jam
9 nicky jam Nicky Jam
10 J Balvin J Balvin
11 jbalvin J Balvin
12 Maluma Maluma
13 maluma Maluma
14 Anuel AA Anuel Aa

Related

How to add rows in Data Frame while for loop?

I want to add a row in an existing data frame, where I don't have a matching regex value.
For example,
import pandas as pd
import numpy as np
import re
lst = ['Sarah Kim', 'Added by January 21']
df = pd.DataFrame(lst)
df.columns = ['Info']
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
for index, row in dff.iterrows():
if re.findall(name_pat, str(row['Info'])):
print("Name matched")
elif re.findall(title_pat, str(row['Info'])):
print("Title matched")
if re.findall(title_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
elif re.findall(date_pat, str(row['Info'])):
print("Date matched")
if re.findall(date_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
So here in my dataframe df, I do not have a title, but just Name and Date. While looping df, I want to add an empty column for a title.
The output is:
Info
0 Sarah Kim
1 Added on January 21
My expected output is:
Info
0 Sarah Kim
1 None
2 Added on January 21
Is there any way that I can add an empty column, or is there a better way?
+++
The dataset I'm working with is just one column with many rows. The rows have some structure, that repeat data of "name, title, date". For example,
Info
0 Sarah Kim
1 Added on January 21
2 Jesus A. Moore
3 Marketer
4 Added on May 30
5 Bobbie J. Garcia
6 CEO
7 Anita Jobe
8 Designer
9 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I have sliced the data frame, so I can only extract data frame looks like this:
Info
0 Sarah Kim
1 Added on January 21
And I'm trying to run a loop for each section, and if a date or title is missing, I will fill with an empty row. So that in the end, I will have:
Info
0 Sarah Kim
1 **NULL**
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 **NULL**
9 Anita Jobe
10 Designer
11 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I see you have a long dataframe with information and each set of information is different. I think the your goal is possibly to have a dataframe where you have 3 columns.
Name,Title and Date
Here is a way I would approach this problem and some code samples. I would take advantage of the df.shift method so I could tie information and use your existing dataframe to create a new one.
I am also making some assumptions based on what you have listed above. First I will assume that only the Title and Date field could be missing. Second I will assume that the order of the is Name,Title and Date like you have mentioned above.
#first step create test data
test_list = ['Sarah Kim','Added on January 21','Jesus A. Moore','Marketer','Added on May 30','Bobbie J. Garcia','CEO','Anita Jobe','Designer','Added on January 3']
test_df =pd.DataFrame(test_list,columns=['Info'])
# second step use your regex to get what type of column each info value is
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
test_df['Col'] = test_df['Info'].apply(lambda x: 'Name' if re.findall(name_pat, x) else ('Date' if re.findall(date_pat,x) else 'Title'))
# third step is to get the next values from our dataframe using df.shift
test_df['Next_col'] = test_df['Col'].shift(-1)
test_df['Next_col2'] = test_df['Col'].shift(-2)
test_df['Next_val1'] = test_df['Info'].shift(-1)
test_df['Next_val2'] = test_df['Info'].shift(-2)
# Now filter to only the names and apply a function to get our name, title and date
new_df = test_df[test_df['Col']=='Name']
def apply_func(row):
name = row['Info']
title = None
date = None
if row['Next_col']=='Title':
title = row['Next_val1']
elif row['Next_col']=='Date':
date = row['Next_val1']
if row['Next_col2']=='Date':
date = row['Next_val2']
row['Name'] = name
row['Title'] = title
row['date'] = date
return row
final_df = new_df.apply(apply_func,axis=1)[['Name','Title','date']].reset_index(drop=True)
print(final_df)
Name Title date
0 Sarah Kim None Added on January 21
1 Jesus A. Moore Marketer Added on May 30
2 Bobbie J. Garcia CEO None
3 Anita Jobe Designer Added on January 3
There is probably a way that we could do this in less lines of code. I welcome anyone who can make this more efficient, but I believe this should work. Also if you wanted to flatten this back into an array.
flattened_df = pd.DataFrame(final_df.values.flatten(),columns=['Info'])
print(flattened_df)
Info
0 Sarah Kim
1 None
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 None
9 Anita Jobe
10 Designer
11 Added on January 3

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Pandas - row to new column by value merging by index

I am trying to manipulate some data in pandas so that it's compatible with an existing piece of software, the operation to perform would be similar to this:
original dataframe:
some_data language spelling
1 12 french un
1 12 english one
1 12 spanish uno
2 52 french deux
2 52 english two
2 52 spanish dos
target dataframe:
some_data lang_en lang_fr lang_sp
1 12 one un uno
2 52 two deux dos
So it will merge the indexes and reorder some of the rows to show it in a column, while keeping any supplementary column data.
All the columns that are not to be 'spitted' (some_data, in this example) contain duplicate data across a single index, many such columns exist in the real data.
I would definitely be able to do it by looping on the dataframe, but am trying to figure out if it's possible to do this entirely with pandas.
You can use:
df.set_index(['some_data','language'])['spelling']\
.unstack()\
.rename(columns=lambda x: 'lang_' + x[:2])\
.rename_axis([None], axis=1)\
.reset_index()
Output:
some_data lang_en lang_fr lang_sp
0 12 one un uno
1 52 two deux dos

Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):
df1:
PRODUCT_ID PRODUCT_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce"
1 185965653252 "Chicken Salad with Dressing"
2 165958565556 "Pork and Honey Rissoles"
3 655262522233 "Cheese, Ham and Tomato Sandwich"
4 857485966653 "Coleslaw with Yoghurt Dressing"
5 524156285551 "Lemon and Raspberry Cheesecake"
I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:
df2 (also saved as dict_2)
PROD_ID PROD_DESCRIPTION
0 548576 "Fish Burger"
1 156956 "Chckn Salad w/Ranch Dressing"
2 257848 "Rissoles - Lamb & Rosemary"
3 298770 "Lemn C-cake"
4 651452 "Potato Salad with Bacon"
5 100256 "Cheese Cake - Lemon Raspberry Coulis"
What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:
PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"
1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"
2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"
3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN
4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN
5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"
6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"
I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.
I've used the following questions and answers (amongst others):
Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array
and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.
I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.
I'm using Python 2.7, pandas and have fuzzywuzzy installed.
using fuzz.ratio as my distance metric, calculate my distance matrix like this
df3 = pd.DataFrame(index=df.index, columns=df2.index)
for i in df3.index:
for j in df3.columns:
vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
vj = df2.get_value(j, 'PROD_DESCRIPTION')
df3.set_value(
i, j, fuzz.ratio(vi, vj))
print(df3)
0 1 2 3 4 5
0 63 15 24 23 34 27
1 26 84 19 21 52 32
2 18 31 33 12 35 34
3 10 31 35 10 41 42
4 29 52 32 10 42 12
5 15 28 21 49 8 55
Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.
threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)
Make assignments
df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df
You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:
d = {
'df1_id': [],
'df1_prod_desc': [],
'df2_id': [],
'df2_prod_desc': [],
'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
for _, df2_row in df2.iterrows():
d['df1_id'] = df1_row['PRODUCT_ID']
...
df3 = pd.DataFrame.from_dict(d)
I don't have enough reputation to be able to comment on answer from #piRSquared. Hence this answer.
The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)
A million thanks to #piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.

Iterate over groups of rows in Pandas

I am new to pandas and python in general - grateful for any direction you can provide!
I have a csv file with 4 columns. I am trying to group together rows where the first three columns are the same on all rows (Column A Row 1 = Column A Row 2, Column B Row 1 = Column B Row 2, and so on)
My data look like this:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
3 9991112222 NJ 2015-05-14 Apartment rental
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
6 4443337777 NJ 2015-05-14 condo
So in this data, rows 1, 2 and 3 would be in one group, and rows 4 and 5 would be in another group. Row 6 would not be in the group with 1, 2, and 3 because it has a different phone_number.
Then, for each row, I want to compare the string in the description column against each other description in that group using Levenshtein distance, and keep the rows where the descriptions are sufficiently similar.
"Condo" from row 1 would be compared to "Condo sales call" from row 2 and to "Apartment rental" in row 3. It would not be compared to "condo" from row 6.
In the end, the goal is to weed out rows where the description is not sufficiently similar to another description in the same group. Phrased differently, to print out all rows where description is at least somewhat similar to another (any other) description in that group. Ideal output:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
Row 6 does not print because it was never in a group. Row 3 doesn't print because "Apartment rental" is insufficiently similar to "Condo" or "Condo sales call"
This is the code I have so far. I can't tell if this is the best way to do it. And if I have done it right so far, I can't figure out how to print the full row of interest:
import Levenshtein
import itertools
import pandas as pd
test_data = pd.DataFrame.from_csv('phone_state_etc_test.csv', index_col=None)
for pn in test_data['phone_number']:
for dt in test_data['date']:
for st in test_data['state']:
for a, b in itertools.combinations(test_data[
(test_data['phone_number'] == pn) &
(test_data['state'] == st) &
(test_data['date'] == dt)
]
['description'], 2):
if Levenshtein.ratio(a,b) > 0.35:
print pn, "|", dt, "|", st, "|" #description
This prints a bunch of duplicates of these lines:
9991112222 | NJ | 2015-05-14 |
6668885555 | CA | 2015-05-06 |
But if I add description to the end of the print line, I get a
SyntaxError: invalid syntax
Any thoughts on how I can print the full row? Whether in pandas dataframe, or some other format, doesn't matter - I just need to output to csv.
Why don't you use the pandas.groupby option to find the unique groups (based on phone-number, state and date). Doing this lets you treat all the Description values separately and do whatever you want to do with them.
For example, I'll groupby with the above said columns and get the unique values for the Description columns within this group -
In [49]: df.groupby(['phone_number','state','date']).apply(lambda v: v['description'].unique())
Out[49]:
phone_number state date
4443337777 NJ 2015-05-14 [condo]
6668885555 CA 2015-05-06 [Apartment, Apartment-rental]
9991112222 NJ 2015-05-14 [Condo, Condo-sales-call, Apartment-rental]
dtype: object
You can use any function within the apply. More examples here - http://pandas.pydata.org/pandas-docs/stable/groupby.html
I'm not entirely sure how best to do a calculation for all pairs of values in pandas- here I've made a matrix with the descriptions as both the rows and columns (so the main diagonal of the matrix compares the description with itself), but it doesn't seem entirely idiomatic:
def find_similar_rows(group, threshold=0.35):
sim_matrix = pd.DataFrame(index=group['description'],
columns=group['description'])
for d1 in sim_matrix.index:
for d2 in sim_matrix.columns:
# Leave diagonal entries as nan
if d1 != d2:
sim_matrix.loc[d1, d2] = Levenshtein.ratio(d1, d2)
keep = sim_matrix.gt(threshold, axis='columns').any()
# A bit of possibly unnecessary mucking around with the index
# here, could probably be cleaned up
rows_to_keep = group.loc[keep[group['description']].tolist(), :]
return rows_to_keep
grouped = test_data.groupby('phone_number', group_keys=False)
grouped.apply(find_similar_rows)
Out[64]:
phone_number state date description
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
It seems form the data provided that you want to keep rows for which the first word in the description matches the most common first word for that group.
If that's the case, you can do this:
test_data['description_root'] = test_data.str.split().str[0]
# this adds a columns with the first word from the description column
grouped = test_data.groupby(['phone_number', 'state', 'date'])
most_frequent_root = grouped.description_root.transform(
lambda s: s.value_counts().idxmax())
# this is a series with the same index as the original df containing
# the most frequently occuring root for each group
test_data[test_data.description_root == most_frequent_root]
# this will give you the matching rows
You could also call .describe on grouped to give some additional information for each group. Sorry if this is off topic but I think the you might well find the Series string methods (.str) and the groupby useful.

Categories