Perform Fuzzy Matching in 2 pandas dataframe

Perform Fuzzy Matching in 2 pandas dataframe - python

I have two dataframes with different rows numbers contain information about players. The first has all names that I need.
df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'],
'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})
The another dataframe is missing some players, but has a column with age. The player's names have smaller differences in some cases.
df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'],
'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
'Age': [22, 21, 26, 18, 19, 25]})
The equals names are different persons, because of that i need match at the same time Player and Team. I want to create a new dataframe with all names from first dataframe with respective age from second dataframe. In case of missing players in second, complete new dataframe with constant value(like XX years, can be any age..just to illustrate). The final dataframe:
print(final_df)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25

You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.
First, import the following libraries :
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
You can use the text matching capabilities of the fuzzywuzzy python library :
#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
#iterration on dataframe filtered by team
for index, row in df1.loc[df1['Team']==team].iterrows():
#get list of players in this team
lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
#use of fuzzywuzzy to make text matching
output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
#check if there is players from df2 in this team
if output_ratio !=[]:
#put arbitrary threshold to get most similar text
if output_ratio[0][1]>thres:
df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')
with this code and a threshold defined as 70, you get the following result:
print(df1)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25
It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.
Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.
You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/

here is one way:
df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)

Related

Transform multiple rows of data into one based on multiple keys in pandas

I have a large CSV file of sports data and I need to transform the data so that teams with the same game_id are on the same row and create new columns based on the homeAway column and existing columns. Is there a way to do this wih Pandas?
Existing format:
game_id school conference homeAway points
332410041 Connecticut American Athletic home 18
332410041 Towson CAA away 33
Desired format:
game_id home_school home_conference home_points away_school away_conference away_points
332410041 Connecticut American Athletic 18 Towson CAA 33

One way to solve this is to convert the table into a Pandas dataframe. Filter the main table by 'homeaway', to create 'home' and 'away' dataframes. The columns in the 'away' table are relabelled, and original column of the key is preserved. We then run a join to both to produce the desired output.
import pandas as pd
data = {'game_id': [332410041, 332410041],
'school': ['Connecticut', 'Towson'],
'conference':['American Athletic', 'CAA'],
'homeAway': ['home', 'away'],
'points': [18, 33]
}
df = pd.DataFrame(data)
home = df[df['homeAway'] == 'home']
del home['homeAway']
away = df[df['homeAway'] == 'away']
del away['homeAway']
away.columns = ['game_id', 'away_school', 'away_conference', 'away_points']
home.merge(away)

Create two dataframes selected by the unique values in the 'homeAway' column, 'home' and 'away', using Boolean indexing.
Drop the obsolete 'homeAway' column
Rename the appropriate columns with a 'home_', and 'away_' prefix.
This can be done in a for-loop, with each dataframe added to a list, which can be consolidated into a simple list-comprehension.
Use pd.merge to combine the two dataframes on the common 'game_id' column.
See Merge, join, concatenate and compare and Pandas Merging 101 for additional details.
import pandas as pd
# test dataframe
data = {'game_id': [332410041, 332410041, 662410041, 662410041, 772410041, 772410041],
'school': ['Connecticut', 'Towson', 'NY', 'CA', 'FL', 'AL'],
'conference': ['American Athletic', 'CAA', 'a', 'b', 'c', 'd'],
'homeAway': ['home', 'away', 'home', 'away', 'home', 'away'], 'points': [18, 33, 1, 2, 3, 4]}
df = pd.DataFrame(data)
# create list of dataframes
dfl = [(df[df.homeAway.eq(loc)]
.drop('homeAway', axis=1)
.rename({'school': f'{loc}_school',
'conference': f'{loc}_conference',
'points': f'{loc}_points'}, axis=1))
for loc in df.homeAway.unique()]
# combine the dataframes
df_new = pd.merge(dfl[0], dfl[1])
# display(df_new)
game_id home_school home_conference home_points away_school away_conference away_points
0 332410041 Connecticut American Athletic 18 Towson CAA 33
1 662410041 NY a 1 CA b 2
2 772410041 FL c 3 AL d 4

How to turn header inside rows into columns?

How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.

Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.

You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6

Categorize column according to lists and aggregate with result

Let's say I have a dataframe as follows:
d = {'name': ['spain', 'greece','belgium','germany','italy'], 'davalue': [3, 4, 6, 9, 3]}
df = pd.DataFrame(data=d)
index name davalue
0 spain 3
1 greece 4
2 belgium 6
3 germany 9
4 italy 3
I would like to aggregate and sum based on a list of strings in the name column. So for example, I may have: southern=['spain', 'greece', 'italy'] and northern=['belgium','germany'].
My goal is to aggregate by using sum, and obtain:
index name davalue
0 southern 10
1 northen 15
where 10=3+4+3 and 15=6+9
I imagined something like:
df.groupby(by=[['spain','greece','italy'],['belgium','germany']])
could exist. The docs say
A label or list of labels may be passed to group by the columns in self
but I'm not sure I understand what that means in terms of syntax.

I would build a dictionary and map:
d = {v:'southern' for v in southern}
d.update({v:'northern' for v in northern})
df['davalue'].groupby(df['name'].map(d)).sum()
Output:
name
northern 15
southern 10
Name: davalue, dtype: int64

One way could be using np.select and using the result as a grouper:
import numpy as np
southern=['spain', 'greece', 'italy']
northern=['belgium','germany']
g = np.select([df.name.isin(southern),
df.name.isin(northern)],
['southern', 'northern'],
'others')
df.groupby(g).sum()
davalue
northern 15
southern 10

df["regional_group"]=df.apply(lambda x: "north" if x["home_team_name"] in ['belgium','germany'] else "south",axis=1)
You create a new column by which you later groubpy.
df.groupby("regional_group")["davavalue"].sum()

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?

One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN

If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN

IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

how to 'fuzzy' match strings when merge two dataframe in pandas

I have two dataframe df1 and df2.
df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John Weber', 'Ian Ford'],
'gender': ['M', 'F', 'M', 'M']})
I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:
pd.merge(df1, df2, how='inner', on='Name')
I only got a dataframe back with only one row, which is 'Ian Ford'.
Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

I am using fuzzywuzzy here
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])
df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford

Not sure if fuzzy match is what you are looking for. Maybe make every name a proper name?
df1.Name = df1.Name.apply(lambda x: x.title())
df2.Name = df2.Name.apply(lambda x: x.title())
pd.merge(df1, df2, how='inner', on='Name')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Perform Fuzzy Matching in 2 pandas dataframe - python

here is one way: df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)

Related

Transform multiple rows of data into one based on multiple keys in pandas

How to turn header inside rows into columns?

Categorize column according to lists and aggregate with result

Merging Two Dataframes without a Key Column

how to 'fuzzy' match strings when merge two dataframe in pandas

Categories

Resources