How to merge dataframes together with matching columns side by side? - python

I have two dataframes with matching keys. I would like to merge them together based on their keys and have the corresponding columns line up side by side. I am not sure how to achieve this as the pd.merge displays all columns for the first dataframe and then all columns for the second data frame:
df1 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame(data={'key': ['a', 'b'], 'col1': [5, 6], 'col2': [7, 8]})
print(pd.merge(df1, df2, on=['key']))
key col1_x col2_x col1_y col2_y
0 a 1 3 5 7
1 b 2 4 6 8
I am looking for a way to do the same merge and have the columns displays side by side as such:
key col1_x col1_y col2_x col2_y
0 a 1 5 3 7
1 b 2 6 4 8
Any help achieving this would be greatly appreciated!

If you're ok with a bit of a shuffle you can sort the columns.
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = sorted(df.columns))
or you could do this to maintain the key in the front
cols = list(df.columns)
cols.remove('key')
print(cols)
df = pd.merge(df1, df2, on=['key'])
df = df.reindex(columns = ['key']+sorted(cols))

Related

Left Join with multiple columns as the Key in Pandas Dataframe [duplicate]

I've two pandas data frames that have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin which will produce NaN rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
See this other question for an example:
Compare PandaS DataFrames and return rows that are missing from the first one
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.
Why this is the best way?
index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.
Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:
ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]
And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.
You can also concat df1, df2:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
I have an easier way in 2 simple steps:
As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
### Step 1: just append the 2nd df at the end of the 1st df
df_both = df1.append(df2)
### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)
## mission accompliched!
df_dif
Out[20]:
col1 col2
3 4 13
4 5 14
5 3 10
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
extract the dissimilar rows using the merge function
df = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
Easier, simpler and elegant
uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.
Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.
Also, if the dataframes have a different order of columns, it will also affect the final result.

pandas merge and update efficiently

Iam getting df1 from the database.
Df2 needs to be merged with df1. Df1 contains additional columns not present in df2. df2 contains indexes that are already present in df1 and which rows need to be updated. the dataframe are multi indexed.
What i want:
-keep rows in df1 that are not in df2
-update df1's values with df2's values for matching indexes
-in the updated rows keep the values of the columns that are not present in df2.
-append rows that are in df2 but not in df1
My Solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={'idx1': ['A', 'B', 'C', 'D', 'E'], 'idx2': [1, 2, 3, 4, 5], 'one': ['df1', 'df1', 'df1', 'df1', 'df1'],
'two': ["y", "x", "y", "x", "y"]})
df2 = pd.DataFrame(data={'idx1': ['D', 'E', 'F', 'G'], 'idx2': [4, 5, 6, 7], 'one': ['df2', 'df2', 'df2', 'df2']})
desired_result = pd.DataFrame(data={'idx1': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'idx2': [1, 2, 3, 4, 5, 6, 7],
'one': ['df1','df1','df1','df2', 'df2', 'df2', 'df2'], 'two': ["y", "x", "y", "x", "y",np.nan,np.nan]})
updated = pd.merge(df1[['idx1', 'idx2']], df2, on=['idx1', 'idx2'], how='right')
keep = df1[~df1.isin(df2)].dropna()
my_res = pd.concat([updated, keep])
my_res.drop(columns='two', inplace=True)
my_res = pd.merge(my_res,df1[['idx1','idx2','two']], on=['idx1','idx2'])
This is very inefficient as i:
merge by right outer join df2 into index only columns of df1
find indexes that are in df2 but not in df1
concat the two dataframes
drop the columns that were not included in df2
merge on index to append those columns that i've previously dropped
Is there maybe a more efficient easier way to do this? I just cannot wrap my head around this.
EDIT:
By mutliindexed i mean that to identify a row i need to look at 4 different columns combined.
And unfortunately my solution does not work properly.
Merge the dataframes, update the column one with the values from one_, then drop this temporary column.
df = df1.merge(df2, on=['idx1', 'idx2'], how='outer', suffixes=['', '_'])
df['one'].update(df['one_'])
>>> df.drop(columns=['one_'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
Using DataFrame.append, Dataframe.drop_duplicates and Series.update:
First we append df1 and df2. Then we drop the duplicates based on column idx1 and idx2. Finally we update the two column NaN based on existing values in df1.
df3 = (df1.append(df2, sort=False)
.drop_duplicates(subset=['idx1', 'idx2'], keep='last')
.reset_index(drop=True))
df3['two'].update(df1['two'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
One line combine_first
Yourdf=df2.set_index(['idx1','idx2']).combine_first(df1.set_index(['idx1','idx2'])).reset_index()
Yourdf
Out[216]:
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN

How can I get the differnce rows between 2 dataframes? [duplicate]

I've two pandas data frames that have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin which will produce NaN rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
See this other question for an example:
Compare PandaS DataFrames and return rows that are missing from the first one
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.
Why this is the best way?
index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.
Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:
ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]
And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.
You can also concat df1, df2:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
I have an easier way in 2 simple steps:
As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
### Step 1: just append the 2nd df at the end of the 1st df
df_both = df1.append(df2)
### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)
## mission accompliched!
df_dif
Out[20]:
col1 col2
3 4 13
4 5 14
5 3 10
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
extract the dissimilar rows using the merge function
df = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
Easier, simpler and elegant
uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.
Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.
Also, if the dataframes have a different order of columns, it will also affect the final result.

Comparing two different dataframes of different sizes using Pandas [duplicate]

I've two pandas data frames that have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin which will produce NaN rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
See this other question for an example:
Compare PandaS DataFrames and return rows that are missing from the first one
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.
Why this is the best way?
index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.
Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:
ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]
And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.
You can also concat df1, df2:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
I have an easier way in 2 simple steps:
As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
### Step 1: just append the 2nd df at the end of the 1st df
df_both = df1.append(df2)
### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)
## mission accompliched!
df_dif
Out[20]:
col1 col2
3 4 13
4 5 14
5 3 10
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
extract the dissimilar rows using the merge function
df = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
Easier, simpler and elegant
uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.
Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.
Also, if the dataframes have a different order of columns, it will also affect the final result.

Getting dataframe records that do not exist in second data frame [duplicate]

I've two pandas data frames that have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin which will produce NaN rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
See this other question for an example:
Compare PandaS DataFrames and return rows that are missing from the first one
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.
Why this is the best way?
index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.
Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:
ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]
And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.
You can also concat df1, df2:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
I have an easier way in 2 simple steps:
As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
### Step 1: just append the 2nd df at the end of the 1st df
df_both = df1.append(df2)
### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)
## mission accompliched!
df_dif
Out[20]:
col1 col2
3 4 13
4 5 14
5 3 10
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
extract the dissimilar rows using the merge function
df = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
Easier, simpler and elegant
uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.
Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.
Also, if the dataframes have a different order of columns, it will also affect the final result.

Categories