"Anti-merge" in pandas (Python) - python

How can I pick out the difference between to columns of the same name in two dataframes?
I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X']), i'll get the common X values of A and B, but how can i get the "non-common" ones?

If you change the merge type to how='outer' and indicator=True this will add a column to tell you whether the values are left/both/right only:
In [2]:
A = pd.DataFrame({'x':np.arange(5)})
B = pd.DataFrame({'x':np.arange(3,8)})
print(A)
print(B)
x
0 0
1 1
2 2
3 3
4 4
x
0 3
1 4
2 5
3 6
4 7
In [3]:
pd.merge(A,B, how='outer', indicator=True)
Out[3]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
3 3.0 both
4 4.0 both
5 5.0 right_only
6 6.0 right_only
7 7.0 right_only
You can then filter the resultant merged df on the _merge col:
In [4]:
merged = pd.merge(A,B, how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
Out[4]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
You can also use isin and negate the mask to find values not in B:
In [5]:
A[~A['x'].isin(B['x'])]
Out[5]:
x
0 0
1 1
2 2

The accepted answer gives a so called LEFT JOIN IF NULL in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both.
In this case we use DataFrame.merge & DataFrame.query:
df1 = pd.DataFrame({'A':list('abcde')})
df2 = pd.DataFrame({'A':list('cdefgh')})
print(df1, '\n')
print(df2)
A
0 a # <- only df1
1 b # <- only df1
2 c # <- both
3 d # <- both
4 e # <- both
A
0 c # both
1 d # both
2 e # both
3 f # <- only df2
4 g # <- only df2
5 h # <- only df2
df = (
df1.merge(df2,
on='A',
how='outer',
indicator=True)
.query('_merge != "both"')
.drop(columns='_merge')
)
print(df)
A
0 a
1 b
5 f
6 g
7 h

Related

how to I find which rows weren't merged in Pandas [duplicate]

How can I pick out the difference between to columns of the same name in two dataframes?
I mean I have dataframe A with a column named X and dataframe B with column named X, if i do pd.merge(A, B, on=['X']), i'll get the common X values of A and B, but how can i get the "non-common" ones?
If you change the merge type to how='outer' and indicator=True this will add a column to tell you whether the values are left/both/right only:
In [2]:
A = pd.DataFrame({'x':np.arange(5)})
B = pd.DataFrame({'x':np.arange(3,8)})
print(A)
print(B)
x
0 0
1 1
2 2
3 3
4 4
x
0 3
1 4
2 5
3 6
4 7
In [3]:
pd.merge(A,B, how='outer', indicator=True)
Out[3]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
3 3.0 both
4 4.0 both
5 5.0 right_only
6 6.0 right_only
7 7.0 right_only
You can then filter the resultant merged df on the _merge col:
In [4]:
merged = pd.merge(A,B, how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
Out[4]:
x _merge
0 0.0 left_only
1 1.0 left_only
2 2.0 left_only
You can also use isin and negate the mask to find values not in B:
In [5]:
A[~A['x'].isin(B['x'])]
Out[5]:
x
0 0
1 1
2 2
The accepted answer gives a so called LEFT JOIN IF NULL in SQL terms. If you want all the rows except the matching ones from both DataFrames, not only left. You have to add another condition to the filter, since you want to exclude all rows which are in both.
In this case we use DataFrame.merge & DataFrame.query:
df1 = pd.DataFrame({'A':list('abcde')})
df2 = pd.DataFrame({'A':list('cdefgh')})
print(df1, '\n')
print(df2)
A
0 a # <- only df1
1 b # <- only df1
2 c # <- both
3 d # <- both
4 e # <- both
A
0 c # both
1 d # both
2 e # both
3 f # <- only df2
4 g # <- only df2
5 h # <- only df2
df = (
df1.merge(df2,
on='A',
how='outer',
indicator=True)
.query('_merge != "both"')
.drop(columns='_merge')
)
print(df)
A
0 a
1 b
5 f
6 g
7 h

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.
Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y
Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

pandas dataframe how to merge all rows based on groupby

I have dataframe with many columns, 2 are categorical and the rest are numeric:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8]
I want to apply a merge based on the operation groupby(["type1","type2"]) that will create the following dataframe:
df = [type1 , type2 ,type3, val1, val2, val3 , val1_a, val2_b, val3_b
a b q 1 2 3 4 6 7
a c w 3 5 2 2 1 8
b c t 2 9 0 2 9 0
Please notice: there could be 1 or 2 rows at each groupby, but not more. in case of 1 - just duplicate the single row
Idea is use GroupBy.cumcount for counter by type1, type2, then is created MultiIndex, reshaped by DataFrame.unstack, forward filling missing values per rows by ffill, converting to integers, sorting by counter level and last in list comprehension flatten MultiIndex:
g = df.groupby(["type1","type2"]).cumcount()
df1 = (df.set_index(["type1","type2", g])
.unstack()
.ffill(axis=1)
.astype(int)
.sort_index(level=1, axis=1))
df1.columns = [f'{a}_{b}' if b != 0 else a for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
type1 type2 val1 val2 val3 val1_1 val2_1 val3_1
0 a b 1 2 3 4 6 7
1 a c 3 5 2 2 1 8
2 b c 2 9 0 2 9 0

Pandas dataframe merge not working as expected with multiple column equality checks

I am trying to merge based on two columns being equal to each other for two Dataframes.
Here is the code:
>>> df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'))
Name Age
0 1 2
1 3 4
2 4 5
>>> df
Name Age
0 1 2
1 3 4
0 4 5
>>> df1
Name Age
0 5 6
1 3 4
0 4 7
What I actually expected from the merge was
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Why does pandas think that there all three matching rows for this merge?
So you mean merge on Name right ?
df.merge(df1, how='left', on='Name', suffixes=('', '_#'))
Out[120]:
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Using indicator to see what is your output
df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'),indicator=True)
Out[121]:
Name Age _merge
0 1 2 left_only
1 3 4 both
2 4 5 left_only
Since you df and df1 have the same columns and all of the columns had been used as merge key , so there is not other columns indicate whether they share the same items in df or not (since you using the left , so that the default is show all left items in the result ).

Best way to avoid merge nulls

Let's say I have those 2 pandas dataframes.
In [3]: df1 = pd.DataFrame({'id':[None,20,None,40,50],'value':[1,2,3,4,5]})
In [4]: df2 = pd.DataFrame({'index':[None,20,None], 'value':[1,2,3]})
In [7]: df1
Out[7]: id value
0 NaN 1
1 20.0 2
2 NaN 3
3 40.0 4
4 50.0 5
In [8]: df2
Out[8]: index value
0 NaN 1
1 20.0 2
2 NaN 3
When I'm merging those dataframes (based on the id and index columns) - the result include rows that the id and index have missing values.
df3 = df1.merge(df2, left_on='id', right_on = 'index', how='inner')
In [9]: df3
Out[9]: id value_x index value_y
0 NaN 1 NaN 1
1 NaN 1 NaN 3
2 NaN 3 NaN 1
3 NaN 3 NaN 3
4 20.0 2 20.0 2
that's what I tried but I guess it's not the best solution:
I replaced all the missing values with some value in one dataframe column,
and the same in the second dataframe but with another value - the purpose is that the condition will return False and the rows will not be in the result.
In [14]: df1_fill = df1.fillna({'id':'NONE1'})
In [13]: df2_fill = df2.fillna({'index':'NONE2'})
In [15]: df1_fill
Out[15]: id value
0 NONE1 1
1 20 2
2 NONE1 3
3 40 4
4 50 5
In [16]: df2_fill
Out[16]: index value
0 NONE2 1
1 20 2
2 NONE2 3
What is the best solution for that issue?
Also, in the example - the daya type of the join columns is numeric, but it can be another type like text or date...
EDIT:
So, with the solutions here I can use dropna function to drop the rows with the missing values before the join - but this is good with inner join that I don't want those rows at all.
What about a left join or full join?
Let's say I have those 2 dataframes I've used before - df1, df2.
So for inner and left join I realy can use the dropna function:
In [61]: df_inner = df1.dropna(subset=['id']).merge(df2.dropna(subset=['index']), left_on='id', right_on = 'index', how='inner')
In [62]: df_inner
Out[62]: id value_x index value_y
0 20.0 2 20.0 6
In [63]: df_left = df1.merge(df2.dropna(subset=['index']), left_on='id', right_on = 'index', how='left')
In [64]: df_left
Out[64]: id value_x index value_y
0 NaN 1 NaN NaN
1 20.0 2 20.0 6.0
2 NaN 3 NaN NaN
3 40.0 4 NaN NaN
4 50.0 5 NaN NaN
In [65]: df_full = df1.merge(df2, left_on='id', right_on = 'index', how='outer')
In [66]: df_full
Out[66]: id value_x index value_y
0 NaN 1 NaN 5.0
1 NaN 1 NaN 7.0
2 NaN 3 NaN 5.0
3 NaN 3 NaN 7.0
4 20.0 2 20.0 6.0
5 40.0 4 NaN NaN
6 50.0 5 NaN NaN
In the left I droped the missing-values-rows from the "right" dataframe and then I used merge.
It was ok because in left join you know that If the condition returns false you have null in the right-source columns - so it's not matter if the rows realy exists or they jusr return false.
But for full join - I need all the rows from the 2 sources both...
I cant use dropna because it will drop me rows that I need and if I don't use it - I get wrong result.
Thanks.
Why not to do something like this:
pd.merge(df1.dropna(subset=['id']), df2.dropna(subset=['index']),
left_on='id',right_on='index', how='inner')
Output:
id value_x index value_y
0 20.0 2 20.0 2
If you dont want nan values then you can drop the nan values i.e
df3 = df1.merge(df2, left_on='id', right_on = 'index', how='inner').dropna()
or
df3 = df1.dropna().merge(df2.dropna(), left_on='id', right_on = 'index', how='inner')
Output:
id value_x index value_y
0 20.0 2 20.0 2
For outer merge drop after merging ie.
df_full = df1.merge(df2, left_on='id', right_on = 'index', how='outer').dropna(subset = ['id'])
Output:
id value_x index value_y
4 20.0 2 20.0 2.0
5 40.0 4 NaN NaN
6 50.0 5 NaN NaN
Since you are doing an 'inner' join, what you could do is drop the rows in df1 where the id column is NaN before you merge.
df1_nonan = df1.dropna(subset = ['id'])
df3 = df1_nonan.merge(df2, left_on='id', right_on = 'index', how='inner')

Categories