This is my first data frame df1
ID
Name
1
Sam
2
Mam
3
Dam
This is my second data frame df2
ID
Parent ID
1
4
2
5
3
6
This is my third dataframe df3
Parent ID
Location
7
New York
8
London
9
San Diego
4
Mumbai
5
Pataya
6
Canberra
df4=pd.merge(df1,df2,on=['ID'], how='left')
ID
Game
Parent ID
1
Sam
4
2
Mam
5
3
Dam
6
I want to merge now the df5 which is giving me the invalid result
df5=pd.merge(df4,df3,on=['Parent ID'], how='left')
ID
Game
Parent ID
Location
1
Sam
4
New york
2
Mam
5
London
3
Dam
6
San Diego
I am not sure why it is selecting the first values instead of the common ones like the first merge
Related
I have two datasets. I want to compare using id, name and need to write in different data frame with mismatched values as "Mismatched" and mismatched rows as it is.
df1
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom Support Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Pune
4 5 Lee Dev Delhi
df2
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom QA Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Hyderabad
And I need result like,
Result
Index id name dept addr
0 2 Tom Mismatched Bangalore
1 4 Kaif IT Mismatched
2 5 Lee Dev Delhi
One way to do what you intend to do (if 'id' and 'name' already match as in the case you show) is to do an inner merge according to the 'name' column and then change the 'Dept' value to 'mismatch' if the 'dept_x' and 'dept_y' value of the merged dataframe don't match.
A = pd.merge(df1,df2, on='name', how='inner')
# It creates new columns
print(A)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom Support Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
B = A.copy()
B['dept_x'] = A.apply(lambda x : 'mismatch' if x.dept_x!=x.dept_y else x.dept_x, axis=1)
print(B)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom mismatch Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
Then you can do the same for the address column, filter the rows with mismatch if you intend to only keep them, and rename or delete the columns that you need/don't need accordingly.
If you have many columns, you can use a function inside the .apply() to make it more general :
# the columns that you intend to check the mismatch for
cols = ['dept','addr']
# or if you want to do it on all columns except the first two because there's too many
cols = [a for a in df1.columns if a not in ['name','id']]
# define a function that compares for all columns
def is_mismatch(x) :
L = ['mismatch' if x[cols[i]+'_x']!=x[cols[i]+'_y'] else x[cols[i]+'_x'] for i in range(len(cols))]
return pd.Series(L)
C = A.copy()
C[cols] = C.apply(is_mismatch, axis=1) # don't forget that axis=1 here !
print(C)
id_x name dept_x addr_x id_y dept_y addr_y dept \
0 1 Jeff IT Delhi 1 IT Delhi IT
1 2 Tom Support Bangalore 2 QA Bangalore mismatch
2 3 Peter Admin Pune 3 Admin Pune Admin
3 4 Kaif IT Pune 4 IT Hyderabad IT
addr
0 Delhi
1 Bangalore
2 Pune
3 mismatch
# if you want to clean the columns
C = C[['id_x','name']+cols]
print(C)
id_x name dept addr
0 1 Jeff IT Delhi
1 2 Tom mismatch Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT mismatch
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1
I have two dataframes with multiple columns.
I would like to compare df1['id'] and df2['id'] and return a new df with another column that have the match value.
example:
df1
**id** **Name**
1 1 Paul
2 2 Jean
3 3 Alicia
4 4 Jennifer
df2
**id** **Name**
1 1 Paul
2 6 Jean
3 3 Alicia
4 7 Jennifer
output
**id** **Name** *correct_id*
1 1 Paul 1
2 2 Jean N/A
3 3 Alicia 3
4 4 Jennifer N/A
Note- the length of the two columns I want to match is not the same.
Try:
df1["correct_id"] = (df1["id"].isin(df2["id"]) * df1["id"]).replace(0, "N/A")
print(df1)
Prints:
id Name correct_id
0 1 Paul 1
1 2 Jean N/A
2 3 Alicia 3
3 4 Jennifer N/A
This question already has answers here:
Concatenate a list of pandas dataframes together
(6 answers)
Closed 2 years ago.
i have a dataframe df1
id Name City type
1 Anna Paris AB
2 Marc Rome D
3 erika madrid AC
and a dataframe df2
id Name City type
1 Anna Paris B
and a dataframe df3
id Name City type
1 Anna Paris C
i want to append df2 and df3 to df1 , this is my expected output :
id Name City type
1 Anna Paris AB
2 Marc Rome D
3 erika madrid AC
1 Anna Paris B
1 Anna Paris C
df1 = df1.append(df2)
df1 = df1.append(df3)
but the dataframe add only the last row and delete the other rows with the same id
id Name City type
2 Marc Rome D
3 erika madrid AC
1 Anna Paris C
i mtrying also concat
df1= pd.concat([df1,df2,df3], join='inner')
I think the problem with pd.concat() is that you are passing the parameter join = inner. I expect this to work:
output = pd.concat([df1,df2,df3])
Using this an example code:
df1 = pd.DataFrame({'Name':['Anna','Marc','erika'],
'City':['Paris','Rome','madrid'],
'Type':['AB','D','AC']})
df2 = pd.DataFrame({'Name':['Anna'],
'City':['Paris'],
'Type':['B']})
df3 = pd.DataFrame({'Name':['Anna'],
'City':['Paris'],
'Type':['C']})
pd.concat([df1,df2,df3])
It outputs:
Name City Type
0 Anna Paris AB
1 Marc Rome D
2 erika madrid AC
0 Anna Paris B
0 Anna Paris C
I'm trying to perform a groupby on a table where given this groupby index, all values are either correct or Nan. EG:
id country name
0 1 France None
1 1 France Pierre
2 2 None Marge
3 1 None Pierre
4 3 USA Jim
5 3 None Jim
6 2 UK None
7 4 Spain Alvaro
8 2 None Marge
9 3 None Jim
10 4 Spain None
11 3 None Jim
I just want to get the values for each of the 4 people, which should never clash, eg:
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
I've tried:
groupby().first()
groupby.nth(0,dropna='any'/'all')
and even
groupby().apply(lambda x: x.loc[x.first_valid_index()])
All to no avail. What am I missing?
EDIT: to help you making the example dataframe for testing:
df = pd.DataFrame({'id':[1,1,2,1,3,3,2,4,2,3,4,3],'country':['France','France',None,None,'USA',None,'UK','Spain',None,None,'Spain',None],'name':[None,'Pierre','Marge','Pierre','Jim','Jim',None,'Alvaro','Marge','Jim',None,'Jim']})
Pandas groupby.first returns first not-null value but does not support None, try
df.fillna(np.nan).groupby('id').first()
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
Possible specifying to dropna when values are None
df.groupby('id').first(dropna=True)
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro