How to compare pandas dataframe using given keys - python

I have two datasets. I want to compare using id, name and need to write in different data frame with mismatched values as "Mismatched" and mismatched rows as it is.
df1
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom Support Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Pune
4 5 Lee Dev Delhi
df2
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom QA Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Hyderabad
And I need result like,
Result
Index id name dept addr
0 2 Tom Mismatched Bangalore
1 4 Kaif IT Mismatched
2 5 Lee Dev Delhi

One way to do what you intend to do (if 'id' and 'name' already match as in the case you show) is to do an inner merge according to the 'name' column and then change the 'Dept' value to 'mismatch' if the 'dept_x' and 'dept_y' value of the merged dataframe don't match.
A = pd.merge(df1,df2, on='name', how='inner')
# It creates new columns
print(A)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom Support Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
B = A.copy()
B['dept_x'] = A.apply(lambda x : 'mismatch' if x.dept_x!=x.dept_y else x.dept_x, axis=1)
print(B)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom mismatch Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
Then you can do the same for the address column, filter the rows with mismatch if you intend to only keep them, and rename or delete the columns that you need/don't need accordingly.
If you have many columns, you can use a function inside the .apply() to make it more general :
# the columns that you intend to check the mismatch for
cols = ['dept','addr']
# or if you want to do it on all columns except the first two because there's too many
cols = [a for a in df1.columns if a not in ['name','id']]
# define a function that compares for all columns
def is_mismatch(x) :
L = ['mismatch' if x[cols[i]+'_x']!=x[cols[i]+'_y'] else x[cols[i]+'_x'] for i in range(len(cols))]
return pd.Series(L)
C = A.copy()
C[cols] = C.apply(is_mismatch, axis=1) # don't forget that axis=1 here !
print(C)
id_x name dept_x addr_x id_y dept_y addr_y dept \
0 1 Jeff IT Delhi 1 IT Delhi IT
1 2 Tom Support Bangalore 2 QA Bangalore mismatch
2 3 Peter Admin Pune 3 Admin Pune Admin
3 4 Kaif IT Pune 4 IT Hyderabad IT
addr
0 Delhi
1 Bangalore
2 Pune
3 mismatch
# if you want to clean the columns
C = C[['id_x','name']+cols]
print(C)
id_x name dept addr
0 1 Jeff IT Delhi
1 2 Tom mismatch Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT mismatch

Related

How to add a suffix to the first N columns in pandas?

I want to add a suffix to the first N columns. But I can't.
This is how to add a suffix to all columns:
import pandas as pd
df = pd.DataFrame( {"name" : ["John","Alex","Kate","Martin"], "surname" : ["Smith","Morgan","King","Cole"],
"job": ["Engineer","Dentist","Coach","Teacher"],"Age":[25,20,25,30],
"Id": [1,2,3,4]})
df.add_suffix("_x")
And this is the result:
name_x surname_x job_x Age_x Id_x
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
But I want to add the first N columns so let's say the first 3. Desired output is:
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
Work with the indices and take slices to modify a subset of them:
df.columns = (df.columns[:3]+'_x').union(df.columns[3:], sort=False)
print(df)
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
This should work:
N=3
cols=[i for i in df.columns[:N]]
new_cols=[i+'_x' for i in df.columns[:N]]
dict_cols=dict(zip(cols,new_cols))
df.rename(dict_cols,axis=1)
set the column labels using a list comprehension:
n = 3
df.columns = [f'{c}_x' if i < n else c for i, c in enumerate(df.columns)]
results in
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

Pandas: compare how to compare two columns in different sheets and return matched value

I have two dataframes with multiple columns.
I would like to compare df1['id'] and df2['id'] and return a new df with another column that have the match value.
example:
df1
**id** **Name**
1 1 Paul
2 2 Jean
3 3 Alicia
4 4 Jennifer
df2
**id** **Name**
1 1 Paul
2 6 Jean
3 3 Alicia
4 7 Jennifer
output
**id** **Name** *correct_id*
1 1 Paul 1
2 2 Jean N/A
3 3 Alicia 3
4 4 Jennifer N/A
Note- the length of the two columns I want to match is not the same.
Try:
df1["correct_id"] = (df1["id"].isin(df2["id"]) * df1["id"]).replace(0, "N/A")
print(df1)
Prints:
id Name correct_id
0 1 Paul 1
1 2 Jean N/A
2 3 Alicia 3
3 4 Jennifer N/A

Merged column doesn't display properly

This is my first data frame df1
ID
Name
1
Sam
2
Mam
3
Dam
This is my second data frame df2
ID
Parent ID
1
4
2
5
3
6
This is my third dataframe df3
Parent ID
Location
7
New York
8
London
9
San Diego
4
Mumbai
5
Pataya
6
Canberra
df4=pd.merge(df1,df2,on=['ID'], how='left')
ID
Game
Parent ID
1
Sam
4
2
Mam
5
3
Dam
6
I want to merge now the df5 which is giving me the invalid result
df5=pd.merge(df4,df3,on=['Parent ID'], how='left')
ID
Game
Parent ID
Location
1
Sam
4
New york
2
Mam
5
London
3
Dam
6
San Diego
I am not sure why it is selecting the first values instead of the common ones like the first merge

Pandas group by a specific value in any of given columns

Given the pandas dataframe as follows:
Partner1 Partner2 Interactions
0 Ann Alice 1
1 Alice Kate 8
2 Kate Tony 9
3 Tony Ann 2
How can I group by a specific partner, let's say to find the total number of interactions of Ann?
Something like
gb = df.groupby(['Partner1'] or ['Partner2']).agg({'Interactions': 'sum'})
and getting the answer:
Partner Interactions
Ann 3
Alice 9
Kate 17
Tony 11
You can use melt together with groupby. First melt:
df = pd.melt(df, id_vars='Interactions', value_vars=['Partner1', 'Partner2'], value_name='Partner')
This will give:
Interactions variable Partner
0 1 Partner1 Ann
1 8 Partner1 Alice
2 9 Partner1 Kate
3 2 Partner1 Tony
4 1 Partner2 Alice
5 8 Partner2 Kate
6 9 Partner2 Tony
7 2 Partner2 Ann
Now, group by Partner and sum:
df.groupby('Partner')[['Interactions']].sum()
Result:
Partner Interactions
Alice 9
Ann 3
Kate 17
Tony 11
You can do merge dataframe itself:
# join the df to itself
join_df = df.merge(df, left_on='Partner1', right_on='Partner2', suffixes=('', '_'))
# get sum
join_df['InteractionsSum'] = join_df[['Interactions', 'Interactions_']].agg(sum, 1)
join_df = join_df[['Partner1', 'Interactions']].copy()
print(join_df)
Partner1 Interactions
0 Ann 1
1 Alice 8
2 Kate 9
3 Tony 2

Pandas groupby give any non nan values

I'm trying to perform a groupby on a table where given this groupby index, all values are either correct or Nan. EG:
id country name
0 1 France None
1 1 France Pierre
2 2 None Marge
3 1 None Pierre
4 3 USA Jim
5 3 None Jim
6 2 UK None
7 4 Spain Alvaro
8 2 None Marge
9 3 None Jim
10 4 Spain None
11 3 None Jim
I just want to get the values for each of the 4 people, which should never clash, eg:
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
I've tried:
groupby().first()
groupby.nth(0,dropna='any'/'all')
and even
groupby().apply(lambda x: x.loc[x.first_valid_index()])
All to no avail. What am I missing?
EDIT: to help you making the example dataframe for testing:
df = pd.DataFrame({'id':[1,1,2,1,3,3,2,4,2,3,4,3],'country':['France','France',None,None,'USA',None,'UK','Spain',None,None,'Spain',None],'name':[None,'Pierre','Marge','Pierre','Jim','Jim',None,'Alvaro','Marge','Jim',None,'Jim']})
Pandas groupby.first returns first not-null value but does not support None, try
df.fillna(np.nan).groupby('id').first()
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
Possible specifying to dropna when values are None
df.groupby('id').first(dropna=True)
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro

Categories