Let's say I have those 2 pandas dataframes.
In [3]: df1 = pd.DataFrame({'id':[None,20,None,40,50],'value':[1,2,3,4,5]})
In [4]: df2 = pd.DataFrame({'index':[None,20,None], 'value':[1,2,3]})
In [7]: df1
Out[7]: id value
0 NaN 1
1 20.0 2
2 NaN 3
3 40.0 4
4 50.0 5
In [8]: df2
Out[8]: index value
0 NaN 1
1 20.0 2
2 NaN 3
When I'm merging those dataframes (based on the id and index columns) - the result include rows that the id and index have missing values.
df3 = df1.merge(df2, left_on='id', right_on = 'index', how='inner')
In [9]: df3
Out[9]: id value_x index value_y
0 NaN 1 NaN 1
1 NaN 1 NaN 3
2 NaN 3 NaN 1
3 NaN 3 NaN 3
4 20.0 2 20.0 2
that's what I tried but I guess it's not the best solution:
I replaced all the missing values with some value in one dataframe column,
and the same in the second dataframe but with another value - the purpose is that the condition will return False and the rows will not be in the result.
In [14]: df1_fill = df1.fillna({'id':'NONE1'})
In [13]: df2_fill = df2.fillna({'index':'NONE2'})
In [15]: df1_fill
Out[15]: id value
0 NONE1 1
1 20 2
2 NONE1 3
3 40 4
4 50 5
In [16]: df2_fill
Out[16]: index value
0 NONE2 1
1 20 2
2 NONE2 3
What is the best solution for that issue?
Also, in the example - the daya type of the join columns is numeric, but it can be another type like text or date...
EDIT:
So, with the solutions here I can use dropna function to drop the rows with the missing values before the join - but this is good with inner join that I don't want those rows at all.
What about a left join or full join?
Let's say I have those 2 dataframes I've used before - df1, df2.
So for inner and left join I realy can use the dropna function:
In [61]: df_inner = df1.dropna(subset=['id']).merge(df2.dropna(subset=['index']), left_on='id', right_on = 'index', how='inner')
In [62]: df_inner
Out[62]: id value_x index value_y
0 20.0 2 20.0 6
In [63]: df_left = df1.merge(df2.dropna(subset=['index']), left_on='id', right_on = 'index', how='left')
In [64]: df_left
Out[64]: id value_x index value_y
0 NaN 1 NaN NaN
1 20.0 2 20.0 6.0
2 NaN 3 NaN NaN
3 40.0 4 NaN NaN
4 50.0 5 NaN NaN
In [65]: df_full = df1.merge(df2, left_on='id', right_on = 'index', how='outer')
In [66]: df_full
Out[66]: id value_x index value_y
0 NaN 1 NaN 5.0
1 NaN 1 NaN 7.0
2 NaN 3 NaN 5.0
3 NaN 3 NaN 7.0
4 20.0 2 20.0 6.0
5 40.0 4 NaN NaN
6 50.0 5 NaN NaN
In the left I droped the missing-values-rows from the "right" dataframe and then I used merge.
It was ok because in left join you know that If the condition returns false you have null in the right-source columns - so it's not matter if the rows realy exists or they jusr return false.
But for full join - I need all the rows from the 2 sources both...
I cant use dropna because it will drop me rows that I need and if I don't use it - I get wrong result.
Thanks.
Why not to do something like this:
pd.merge(df1.dropna(subset=['id']), df2.dropna(subset=['index']),
left_on='id',right_on='index', how='inner')
Output:
id value_x index value_y
0 20.0 2 20.0 2
If you dont want nan values then you can drop the nan values i.e
df3 = df1.merge(df2, left_on='id', right_on = 'index', how='inner').dropna()
or
df3 = df1.dropna().merge(df2.dropna(), left_on='id', right_on = 'index', how='inner')
Output:
id value_x index value_y
0 20.0 2 20.0 2
For outer merge drop after merging ie.
df_full = df1.merge(df2, left_on='id', right_on = 'index', how='outer').dropna(subset = ['id'])
Output:
id value_x index value_y
4 20.0 2 20.0 2.0
5 40.0 4 NaN NaN
6 50.0 5 NaN NaN
Since you are doing an 'inner' join, what you could do is drop the rows in df1 where the id column is NaN before you merge.
df1_nonan = df1.dropna(subset = ['id'])
df3 = df1_nonan.merge(df2, left_on='id', right_on = 'index', how='inner')
Related
How can I pick out the difference between to columns of the same name in two dataframes?
I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X']), i'll get the common X values of A and B, but how can i get the "non-common" ones?
If you change the merge type to how='outer' and indicator=True this will add a column to tell you whether the values are left/both/right only:
In [2]:
A = pd.DataFrame({'x':np.arange(5)})
B = pd.DataFrame({'x':np.arange(3,8)})
print(A)
print(B)
x
0 0
1 1
2 2
3 3
4 4
x
0 3
1 4
2 5
3 6
4 7
In [3]:
pd.merge(A,B, how='outer', indicator=True)
Out[3]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
3 3.0 both
4 4.0 both
5 5.0 right_only
6 6.0 right_only
7 7.0 right_only
You can then filter the resultant merged df on the _merge col:
In [4]:
merged = pd.merge(A,B, how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
Out[4]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
You can also use isin and negate the mask to find values not in B:
In [5]:
A[~A['x'].isin(B['x'])]
Out[5]:
x
0 0
1 1
2 2
The accepted answer gives a so called LEFT JOIN IF NULL in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both.
In this case we use DataFrame.merge & DataFrame.query:
df1 = pd.DataFrame({'A':list('abcde')})
df2 = pd.DataFrame({'A':list('cdefgh')})
print(df1, '\n')
print(df2)
A
0 a # <- only df1
1 b # <- only df1
2 c # <- both
3 d # <- both
4 e # <- both
A
0 c # both
1 d # both
2 e # both
3 f # <- only df2
4 g # <- only df2
5 h # <- only df2
df = (
df1.merge(df2,
on='A',
how='outer',
indicator=True)
.query('_merge != "both"')
.drop(columns='_merge')
)
print(df)
A
0 a
1 b
5 f
6 g
7 h
I am trying to merge based on two columns being equal to each other for two Dataframes.
Here is the code:
>>> df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'))
Name Age
0 1 2
1 3 4
2 4 5
>>> df
Name Age
0 1 2
1 3 4
0 4 5
>>> df1
Name Age
0 5 6
1 3 4
0 4 7
What I actually expected from the merge was
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Why does pandas think that there all three matching rows for this merge?
So you mean merge on Name right ?
df.merge(df1, how='left', on='Name', suffixes=('', '_#'))
Out[120]:
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Using indicator to see what is your output
df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'),indicator=True)
Out[121]:
Name Age _merge
0 1 2 left_only
1 3 4 both
2 4 5 left_only
Since you df and df1 have the same columns and all of the columns had been used as merge key , so there is not other columns indicate whether they share the same items in df or not (since you using the left , so that the default is show all left items in the result ).
In an exercise, I was asked to merge 3 DataFrames with inner join (df1+df2+df3 = mergedDf), then in another question I was asked to tell how many entries I've lost when performing this 3-way merging.
#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
Goals Medals
Argentina 5 2
Angola 1 0
Bolivia 3 1
#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
Dates Medals
Venezuela 1 0
Africa 2 1
Argentina 2 2
#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
Players Goals
Argentina 11 5
Australia 11 1
Spain 10 0
#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
Goals_X Medals_X Dates Medals_Y Players Goals_Y
Argentina 5 2 2 2 11 2
#Calculate number of lost entries by code
I tried to merge everything with outer join and then subtracting the mergedDf, but I don't know how to do this, can anyone help me?
I've found a simple but effective solution:
Merging the 3 DataFrames, inner and outer:
df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')
Now, the number of missed entries (rows) is:
return (len(outer)-len(inner))
Solution with outer join and parameter indicator, last count rows with no both in both indicator columns a and b by sum of True values (processes like 1s):
mergedDf = pd.merge(df1,df2,how='outer',left_index=True, right_index=True, indicator='a')
mergedDf = pd.merge(mergedDf,df3,how='outer',left_index=True, right_index=True, indicator='b')
print(mergedDf)
Goals_x Medals_x Dates Medals_y a Players Goals_y \
Africa NaN NaN 2.0 1.0 right_only NaN NaN
Angola 1.0 0.0 NaN NaN left_only NaN NaN
Argentina 5.0 2.0 2.0 2.0 both 11.0 5.0
Australia NaN NaN NaN NaN NaN 11.0 1.0
Belgica NaN NaN NaN NaN NaN 10.0 0.0
Bolivia 3.0 1.0 NaN NaN left_only NaN NaN
Venezuela NaN NaN 1.0 0.0 right_only NaN NaN
b
Africa left_only
Angola left_only
Argentina both
Australia right_only
Belgica right_only
Bolivia left_only
Venezuela left_only
missing = ((mergedDf['a'] != 'both') & (mergedDf['b'] != 'both')).sum()
print (missing)
6
Another solution is use inner join and sum filtered values of each index which not matched mergedDf.index:
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
vals = mergedDf.index
print (vals)
Index(['Argentina'], dtype='object')
dfs = [df1, df2, df3]
missing = sum((~x.index.isin(vals)).sum() for x in dfs)
print (missing)
6
Anoter solution if unique values in each index:
dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]
#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6
You can passing True to the indicator in merge
df1=pd.DataFrame({'A':[1,2,3],'B':[1,1,1]})
df2=pd.DataFrame({'A':[2,3],'B':[1,1]})
df1.merge(df2,on='A',how='inner')
Out[257]:
A B_x B_y
0 2 1 1
1 3 1 1
df1.merge(df2,on='A',how='outer',indicator =True)
Out[258]:
A B_x B_y _merge
0 1 1 NaN left_only
1 2 1 1.0 both
2 3 1 1.0 both
mergedf=df1.merge(df2,on='A',how='outer',indicator =True)
Then with value_counts you know how many you lost when do inner , since only the both will keep when how='inner'
mergedf['_merge'].value_counts()
Out[260]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
For 3 df and filter with both merge columns words is both
df1.merge(df2, on='A',how='outer',indicator =True).rename(columns={'_merge':'merge'}).merge(df3, on='A',how='outer',indicator =True)
I have two dataframes, df1 and df2
df1
skuid brand
0 ax12 C
1 zm23 F
2 zm23 NaN
3 zm24 NaN
df2
sid brand
0 ax11 G
1 ax12 C
2 zm23 F
3 zm23 NaN
I need to combine the two dataframes based on the values of skuid and sid.
df1.merge(df2, how='right')
skuid brand sid
0 ax12 C ax12
1 zm23 F zm23
2 zm23 NaN zm23
3 zm24 NaN zm23
4 NaN G ax11
How can I get the output as shown below?
skuid brand sid
0 ax12 C ax12
1 zm23 F zm23
2 zm23 NaN NaN
3 zm24 NaN NaN
4 NaN NaN zm23
5 NaN G ax11
NaN value for sid on row id 2 and 3,
and one additional row for zm23 in df2
Why you want do this? I think you can't do this in one operation. If you use right, you lose zm24 from df1. if you use left, you will lose ax11 from df2. So you need to use outer, but it won't do what you want. You will have raw zm23 NaN zm23 because you merge by skuid and sid. They are the same at this raw. You can merge your dataframes and then do another manipulation.
And I think you can try use left_on and right_on, but it won't resolve your problem.
And if you use there, you don't have sid in 3 raw.
df1.merge(df2, how='outer', left_on=['skuid', 'brand'], right_on=['sid', 'brand'])
Update: I found the solution. You can fill NaNs different values before merging.
df1 = df1.fillna(0)
df2 = df2.fillna(1)
df = df1.merge(df2, how='outer', left_on=['skuid', 'brand'], right_on=['sid', 'brand'])
If you want you can convert them back.
How can I pick out the difference between to columns of the same name in two dataframes?
I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X']), i'll get the common X values of A and B, but how can i get the "non-common" ones?
If you change the merge type to how='outer' and indicator=True this will add a column to tell you whether the values are left/both/right only:
In [2]:
A = pd.DataFrame({'x':np.arange(5)})
B = pd.DataFrame({'x':np.arange(3,8)})
print(A)
print(B)
x
0 0
1 1
2 2
3 3
4 4
x
0 3
1 4
2 5
3 6
4 7
In [3]:
pd.merge(A,B, how='outer', indicator=True)
Out[3]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
3 3.0 both
4 4.0 both
5 5.0 right_only
6 6.0 right_only
7 7.0 right_only
You can then filter the resultant merged df on the _merge col:
In [4]:
merged = pd.merge(A,B, how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
Out[4]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
You can also use isin and negate the mask to find values not in B:
In [5]:
A[~A['x'].isin(B['x'])]
Out[5]:
x
0 0
1 1
2 2
The accepted answer gives a so called LEFT JOIN IF NULL in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both.
In this case we use DataFrame.merge & DataFrame.query:
df1 = pd.DataFrame({'A':list('abcde')})
df2 = pd.DataFrame({'A':list('cdefgh')})
print(df1, '\n')
print(df2)
A
0 a # <- only df1
1 b # <- only df1
2 c # <- both
3 d # <- both
4 e # <- both
A
0 c # both
1 d # both
2 e # both
3 f # <- only df2
4 g # <- only df2
5 h # <- only df2
df = (
df1.merge(df2,
on='A',
how='outer',
indicator=True)
.query('_merge != "both"')
.drop(columns='_merge')
)
print(df)
A
0 a
1 b
5 f
6 g
7 h