I have a pandas dataframe full of tuple (it could be the same with arrays) and I would like to split all the columns into even more columns (each array or tuple has the same length).
Let's take this as an example:
df=pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], df.columns=['column0', 'column1'])
which outputs:
column0 column1
0 (1, 2) (3, 4)
1 (5, 6) (7, 8)
I tried to build over this solution here(https://stackoverflow.com/a/16245109/4218755) using derivates off the expression:
df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})
like
df.column0.apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]})
which outputs:
feature1 feature2
0 1 2
1 5 6
This is the desired behavior. So it works well, but if I happen to try to use
df2=df[df.columns].apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]}))
then df2 is:
colonne0 colonne1
feature1 (1, 2) (3, 4)
feature2 (5, 6) (7, 8)
which is obviously wrong. I can't either apply on df, it output the same result as df2.
How to apply such splitting technique to a whole dataframe, and are there alternatives?
Thanks
You could extract the DataFrame values as a NumPy array, use IT.chain.from_iterable to extract the ints from the tuples, and then reshape and rebuild the array into a new DataFrame:
import itertools as IT
import numpy as np
import pandas as pd
df = pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], columns=['column0', 'column1'])
arr = df.values
arr = np.array(list(IT.chain.from_iterable(arr))).reshape(len(df), -1)
result = pd.DataFrame(arr)
yields
0 1 2 3
0 1 2 3 4
1 5 6 7 8
By the way, you might have fallen into an XY-trap -- you're asking for X when
you really should be looking for Y. Instead of trying to transform df into
result, it might be easier to build the desired DataFrame, result, from
the original data source.
For example, if your original data is a list of lists of tuples:
data = [[(1,2),(3,4)],[(5,6),(7,8)]]
Then the desired DataFrame could be built using
df = pd.DataFrame(np.array(data).reshape(2,-1))
# 0 1 2 3
# 0 1 2 3 4
# 1 5 6 7 8
Once you have non-NumPy-native data types in your DataFrame
(such as tuples), you are doomed to using at least one Python loop to extract
the ints from the tuples. (I'm regarding things like df.apply(func) and
list(IT.chain.from_iterable(arr)) as essentially Python loops since they work
at Python-loop speed.)
IIUC you can use:
df=pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], columns=['column0', 'column1'])
print (df)
column0 column1
0 (1, 2) (3, 4)
1 (5, 6) (7, 8)
for col in df.columns:
df[col]=df[col].apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]}))
print (df)
column0 column1
0 1 3
1 5 7
You may iterate over each column you want to split and assign the new columns to your DataFrame:
import pandas as pd
df=pd.DataFrame( [ [ (1,2), (3,4)],
[ (5,6), (7,8)] ], columns=['column0', 'column1'])
# empty DataFrame
df2 = pd.DataFrame()
for col in df.columns:
# names of new columns
feature_columns = [ "{col}_feature1".format(col=col), "{col}_feature2".format(col=col) ]
# split current column
df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series({ feature_columns[0]: s[0],
feature_columns[1]: s[1]} ) )
print df2
which gives
column0_feature1 column0_feature2 column1_feature1 column2_feature2
0 1 2 3 4
1 5 6 7 8
Related
A simple pandas question:
Is there a drop_duplicates() functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1) will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!
Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives:
set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))
Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))
Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]:
col1 col2
0 1.2 4
1 1.5 4
2 1.3 4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible
There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)
Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Original Answer
Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.
df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Numpy's setdiff1d would work and perhaps be faster.
For each column:
np.setdiff1(df1.col1.values, df2.col1.values)
So something like:
setdf = pd.DataFrame({
col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
for col in df1.columns
})
numpy.setdiff1d docs
Get the indices of the intersection with a merge, then drop them:
>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
A B
0 0 1
1 2 3
2 4 5
3 6 7
>>> df_completed = df_all.iloc[::2]; df_completed
A B
0 0 1
2 4 5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
index A B
0 0 0 1
1 2 4 5
>>> df_pending = df_all.drop(merged['index']); df_pending
A B
1 2 3
3 6 7
Assumption:
df1 and df2 have identical columns
it is a set operation so duplicates are ignored
sets are not extremely large so you do not worry about memory
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]
I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on #radream's answer.
Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)
This yields:
col1 col2
0 1 2
2 3 4
3 4 6
4 5 5
In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)
Result:
col1 col2
1 2 -1.0
2 3 0.0
3 4 -1.0
4 6 1.0
5 5 1.0
dtype: float64
Get positive counts:
diff[diff > 0].reset_index(name='counts')
col1 col2 counts
0 4 6 1.0
1 5 5 1.0
this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.
set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
With multiple columns you can also use:
col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names],
df1[col_names]]).drop_duplicates(keep=False)
I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?
This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7
You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )
Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40
having a dataframe, I want to update subset of columns with a series of same length as number of columns being updated:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df
col1 col2
0 1 0
1 2 4
2 4 4
3 4 0
4 0 0
5 3 1
>>> df.loc[:,['col1','col2']] = pd.Series([0,1])
...
ValueError: shape mismatch: value array of shape (6,) could not be broadcast to indexing result of shape (2,6)
it fails, however, I am able to do the same thing using list:
>>> df.loc[:,['col1','col2']] = list(pd.Series([0,1]))
>>> df
col1 col2
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
could you please help me to understand, why updating with series fails? do I have to perform some particular reshaping?
When assigning with a pandas object, pandas treats the assignment more "rigorously". A pandas to pandas assignment must pass stricter protocols. Only when you turn it to a list (or equivalently pd.Series([0, 1]).values) did pandas give in and allow you to assign in the way you'd imagine it should work.
That higher standard of assignment requires that the indices line up as well, so even if you had the right shape, it still wouldn't have worked without the correct indices.
df.loc[:, ['col1', 'col2']] = pd.DataFrame([[0, 1] for _ in range(6)])
df
df.loc[:, ['col1', 'col2']] = pd.DataFrame([[0, 1] for _ in range(6)], columns=['col1', 'col2'])
df
I'm trying to assign one row of a hierarchically indexed Pandas DataFrame to another row of the DataFrame. What follows is a minimal example.
import numpy as np
import pandas as pd
columns = pd.MultiIndex.from_tuples([('a', 0), ('a', 1), ('b', 0), ('b', 1)])
data = pd.DataFrame(np.random.randn(3, 4), columns=columns)
print(data)
data.loc[0, 'a'] = data.loc[1, 'b']
print(data)
This fills row 0 with NaNs instead of the values from row 1. I noticed I can get around it by converting to an ndarray before assignment:
data.loc[0, 'a'] = np.array(data.loc[1, 'b'])
Presumably there's a reason for this behavior, and an idiomatic way to make the assignment?
Edit: modified the question after Jeff's answer made me realize I oversimplified the problem.
In [38]: data = pd.DataFrame(np.random.randn(3, 2), columns=columns)
In [39]: data
Out[39]:
a
0 1
0 1.657540 -1.086500
1 0.700830 1.688279
2 -0.912225 -0.199431
In [40]: data.loc[0,'a']
Out[40]:
0 1.65754
1 -1.08650
Name: 0, dtype: float64
In [41]: data.loc[1,'a']
Out[41]:
0 0.700830
1 1.688279
Name: 1, dtype: float64
In your example notice that the index of the assigned element are [0,1]; These don't match the columns which are ('a',0),('a',1). So you end up effectively reindexing to elements which don't exist and hence you get nan.
In general its better to let pandas 'figure' out the rhs alignment (and like you are doing here, mask the lhs).
In [42]: data.loc[0,'a'] = data.loc[1,:]
In [43]: data
Out[43]:
a
0 1
0 0.700830 1.688279
1 0.700830 1.688279
2 -0.912225 -0.199431
You also could do
data.loc[0] = data.loc[1]
Here's another way:
In [96]: data = pd.DataFrame(np.arange(12).reshape(3,4), columns=pd.MultiIndex.from_product([['a','b'],[0,1]]))
In [97]: data
Out[97]:
a b
0 1 0 1
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [98]: data.loc[0,'a'] = data.loc[1,'b'].values
In [99]: data
Out[99]:
a b
0 1 0 1
0 6 7 2 3
1 4 5 6 7
2 8 9 10 11
Pandas will always align the data, that's why this doesn't work naturally. You are deliberately NOT aligning.
A simple pandas question:
Is there a drop_duplicates() functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1) will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!
Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives:
set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))
Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))
Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]:
col1 col2
0 1.2 4
1 1.5 4
2 1.3 4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible
There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)
Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Original Answer
Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.
df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Numpy's setdiff1d would work and perhaps be faster.
For each column:
np.setdiff1(df1.col1.values, df2.col1.values)
So something like:
setdf = pd.DataFrame({
col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
for col in df1.columns
})
numpy.setdiff1d docs
Get the indices of the intersection with a merge, then drop them:
>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
A B
0 0 1
1 2 3
2 4 5
3 6 7
>>> df_completed = df_all.iloc[::2]; df_completed
A B
0 0 1
2 4 5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
index A B
0 0 0 1
1 2 4 5
>>> df_pending = df_all.drop(merged['index']); df_pending
A B
1 2 3
3 6 7
Assumption:
df1 and df2 have identical columns
it is a set operation so duplicates are ignored
sets are not extremely large so you do not worry about memory
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]
I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on #radream's answer.
Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)
This yields:
col1 col2
0 1 2
2 3 4
3 4 6
4 5 5
In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)
Result:
col1 col2
1 2 -1.0
2 3 0.0
3 4 -1.0
4 6 1.0
5 5 1.0
dtype: float64
Get positive counts:
diff[diff > 0].reset_index(name='counts')
col1 col2 counts
0 4 6 1.0
1 5 5 1.0
this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.
set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
With multiple columns you can also use:
col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names],
df1[col_names]]).drop_duplicates(keep=False)