how to 'fuzzy' match strings when merge two dataframe in pandas - python

I have two dataframe df1 and df2.
df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John Weber', 'Ian Ford'],
'gender': ['M', 'F', 'M', 'M']})
I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:
pd.merge(df1, df2, how='inner', on='Name')
I only got a dataframe back with only one row, which is 'Ian Ford'.
Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

I am using fuzzywuzzy here
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])
df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford

Not sure if fuzzy match is what you are looking for. Maybe make every name a proper name?
df1.Name = df1.Name.apply(lambda x: x.title())
df2.Name = df2.Name.apply(lambda x: x.title())
pd.merge(df1, df2, how='inner', on='Name')

Related

Search for multiple encounters across rows in pandas

I'm trying to take a dataframe of patient data and create a new df that includes their name and date if they had an encounter with three services on the same date.
first I have a dataframe
import pandas as pd
df = pd.DataFrame({'name': ['Bob', 'Charlie', 'Bob', 'Sam', 'Bob', 'Sam', 'Chris'],
'date': ['06-02-2023', '01-02-2023', '06-02-2023', '20-12-2022', '06-02-2023','08-06-2015', '26-08-2020'],
'department': ['urology', 'urology', 'oncology', 'primary care', 'radiation', 'primary care', 'oncology']})
I tried group by on the name and date with an agg function to create a list
df_group = df.groupby(['name', 'date']).agg({'department': pd.Series.unique})
For bob, this created made department contain [urology, oncology, radiation].
now when I try to search for the departments in the list, to then just find the rows that contain the departments in question, I get an error.
df_group.loc[df_group['department'].str.contains('primary care')]
for instance results in KeyError: '[nan nan nan nan nan] not in index'
I assume there is a much easier way but ultimately, I want to just get a dataframe of people with the date when they have an encounter for urology, oncology, and radiation. In the above df it would result in:
Name Date
Bob 06-02-2023
Easy solution
# define a set of departments to check for
s = {'urology', 'oncology', 'radiation'}
# groupby and aggregate to identify the combination
# of name and date that has all the required departments
out = df.groupby(['name', 'date'], as_index=False)['department'].agg(s.issubset)
Result
# out
name date department
0 Bob 06-02-2023 True
1 Charlie 01-02-2023 False
2 Chris 26-08-2020 False
3 Sam 08-06-2015 False
4 Sam 20-12-2022 False
# out[out['department'] == True]
name date department
0 Bob 06-02-2023 True

Perform Fuzzy Matching in 2 pandas dataframe

I have two dataframes with different rows numbers contain information about players. The first has all names that I need.
df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'],
'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})
The another dataframe is missing some players, but has a column with age. The player's names have smaller differences in some cases.
df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'],
'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
'Age': [22, 21, 26, 18, 19, 25]})
The equals names are different persons, because of that i need match at the same time Player and Team. I want to create a new dataframe with all names from first dataframe with respective age from second dataframe. In case of missing players in second, complete new dataframe with constant value(like XX years, can be any age..just to illustrate). The final dataframe:
print(final_df)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25
You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.
First, import the following libraries :
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
You can use the text matching capabilities of the fuzzywuzzy python library :
#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
#iterration on dataframe filtered by team
for index, row in df1.loc[df1['Team']==team].iterrows():
#get list of players in this team
lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
#use of fuzzywuzzy to make text matching
output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
#check if there is players from df2 in this team
if output_ratio !=[]:
#put arbitrary threshold to get most similar text
if output_ratio[0][1]>thres:
df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')
with this code and a threshold defined as 70, you get the following result:
print(df1)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25
It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.
Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.
You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/
here is one way:
df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)

Pandas to lookup and return corresponding values from many dataframes

A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West

pandas join rows/groupby with categorical data and lots of nan values

I'm trying to simplify a dataframe by joining rows based on 2 columns. Now, the rest is a bit messy with lots of nan values and such. I'll show an example:
initial:
Name Sex Shoes Bike Car
0 John Male Specialised
1 John Male Bridgestone
2 Lucy Female BMW
3 John Male Vans
4 Lucy Female Nike
target:
Name Sex Shoes Bike Car
0 John Male Vans Specialised, Bridgestone
1 Lucy Female Nike BMW
What's the function I should use? I couldn't figure out how to do it with groupby and the .agg(','.join) addition...
(the data above is just exemplary - the one I have to work with has much more many rows with many occurrences of the same name, and about 20 'category' columns... also note, each row should have string in only one of the 'categories' - shoes/bike/car etc.)
Thanks in advance!
Assume empty cells are NaN (not empty string), the following will achieve the result,
(df.set_index(['Name','Sex'])
.groupby(level=[0,1])
.apply(lambda x:x.apply(lambda y: ', '.join(y.dropna())))
.reset_index())
Second approach,
(df.set_index(['Name','Sex'])
.stack()
.groupby(level=[0,1,2])
.apply(', '.join)
.unstack()
.reset_index()
You can fillna with an empty string and then clean up the bad data at the end.
u = df.fillna('').groupby(['Name', 'Sex']).agg(', '.join)
u.stack().str.replace('(, ){2,}|^, |, $', '').unstack()
Shoes Bike Car
Name Sex
John Male Vans Specialised, Bridgestone
Lucy Female Nike BMW
The order of the regular expression is very important
you can use this using group by like below
df = pd.DataFrame([['John', 'Male', 'na', 'Specialised', 'na'], ['John', 'Male', 'na', 'Bridgestone', 'na'], ['Lucy', 'Female', 'na', 'na', 'BMW'], ['John', 'Male', 'Vans', 'na', 'na'], ['Lucy', 'Female', 'Nike', 'na', 'na']], columns=('Name', 'Sex', 'Shoes', 'Bike', 'Car'))
df = df.mask(df == "na", '')
df.groupby(["Name", "Sex"]).agg(lambda row: ",".
join([val for val in row if val.strip()!=""]))

merge dataframes using fuzzywuzzy [duplicate]

I have two dataframe df1 and df2.
df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John Weber', 'Ian Ford'],
'gender': ['M', 'F', 'M', 'M']})
I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:
pd.merge(df1, df2, how='inner', on='Name')
I only got a dataframe back with only one row, which is 'Ian Ford'.
Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.
I am using fuzzywuzzy here
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])
df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford
Not sure if fuzzy match is what you are looking for. Maybe make every name a proper name?
df1.Name = df1.Name.apply(lambda x: x.title())
df2.Name = df2.Name.apply(lambda x: x.title())
pd.merge(df1, df2, how='inner', on='Name')

Categories