I have a dataframe that has two columns, user_id and item_bought.
Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.
How do I do that?
From version 0.20.1 it is simplier:
Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
'B': np.arange(8)}, index=index)
print (df)
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
print (df.groupby(['second', 'A']).sum())
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
this should work:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df
col1 col2
ind1 ind2
A 0 3 2
1 2 0
2 2 3
B 3 2 4
C 4 3 1
5 0 0
>>> df.groupby([df.index.get_level_values(0),'col1']).count()
col2
ind1 col1
A 2 2
3 1
B 2 1
C 0 1
3 1
I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...
check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"
import pandas as pd
import numpy as np
In [11]:
df = pd.DataFrame()
In [12]:
df['user_id'] = ['b','b','b','c']
In [13]:
df['item_bought'] = ['x','x','y','y']
In [14]:
df['ct'] = 1
In [15]:
df
Out[15]:
user_id item_bought ct
0 b x 1
1 b x 1
2 b y 1
3 c y 1
In [16]:
pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)
Out[16]:
user_id item_bought
b x 2
y 1
c y 1
I had the same problem - imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.
I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):
grouped = df.reset_index().groupby(by=['Field1','Field2'])
then I can use 'grouped' in a bunch of ways for different reports
grouped[['Field3','Field4']].agg([np.mean, np.std])
(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2
For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be
df.reset_index().groupby(by=['user_id']).count()
If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.
Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe
Related
In my application I am multiplying two Pandas Series which both have multiple index levels. Sometimes, a level contains only a single unique value, in which case I don't get all the index levels from both Series in my result.
To illustrate the problem, let's take two series:
s1 = pd.Series(np.random.randn(4), index=[[1, 1, 1, 1], [1,2,3,4]])
s1.index.names = ['A', 'B']
A B
1 1 -2.155463
2 -0.411068
3 1.041838
4 0.016690
s2 = pd.Series(np.random.randn(4), index=[['a', 'a', 'a', 'a'], [1,2,3,4]])
s2.index.names = ['C', 'B']
C B
a 1 0.043064
2 -1.456251
3 0.024657
4 0.912114
Now, if I multiply them, I get the following:
s1.mul(s2)
A B
1 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
While my desired result would be
A C B
1 a 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
How can I keep index level C in the multiplication?
I have so far been able to get the right result as shown below, but would much prefer a neater solution which keeps my code more simple and readable.
s3 = s2.mul(s1).to_frame()
s3['C'] = 'a'
s3.set_index('C', append=True, inplace=True)
You can use Series.unstack with DataFrame.stack:
s = s2.unstack(level=0).mul(s1, level=1, axis=0).stack().reorder_levels(['A','C','B'])
print (s)
A C B
1 a 1 0.827482
2 -0.476929
3 -0.473209
4 -0.520207
dtype: float64
A simple pandas question:
Is there a drop_duplicates() functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1) will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!
Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives:
set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))
Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))
Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]:
col1 col2
0 1.2 4
1 1.5 4
2 1.3 4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible
There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)
Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Original Answer
Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.
df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Numpy's setdiff1d would work and perhaps be faster.
For each column:
np.setdiff1(df1.col1.values, df2.col1.values)
So something like:
setdf = pd.DataFrame({
col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
for col in df1.columns
})
numpy.setdiff1d docs
Get the indices of the intersection with a merge, then drop them:
>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
A B
0 0 1
1 2 3
2 4 5
3 6 7
>>> df_completed = df_all.iloc[::2]; df_completed
A B
0 0 1
2 4 5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
index A B
0 0 0 1
1 2 4 5
>>> df_pending = df_all.drop(merged['index']); df_pending
A B
1 2 3
3 6 7
Assumption:
df1 and df2 have identical columns
it is a set operation so duplicates are ignored
sets are not extremely large so you do not worry about memory
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]
I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on #radream's answer.
Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)
This yields:
col1 col2
0 1 2
2 3 4
3 4 6
4 5 5
In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)
Result:
col1 col2
1 2 -1.0
2 3 0.0
3 4 -1.0
4 6 1.0
5 5 1.0
dtype: float64
Get positive counts:
diff[diff > 0].reset_index(name='counts')
col1 col2 counts
0 4 6 1.0
1 5 5 1.0
this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.
set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
With multiple columns you can also use:
col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names],
df1[col_names]]).drop_duplicates(keep=False)
I have an empty DataFrame:
import pandas as pd
df = pd.DataFrame()
I want to add a hierarchically-named column. I tried this:
df['foo', 'bar'] = [1,2,3]
But it gives a column whose name is a tuple:
(foo, bar)
0 1
1 2
2 3
I want this:
foo
bar
0 1
1 2
2 3
Which I can get if I construct a brand new DataFrame this way:
pd.DataFrame([1,2,3], columns=pd.MultiIndex.from_tuples([('foo', 'bar')]))
How can I create such a layout when adding new columns to an existing DataFrame? The number of levels is always 2...and I know all the possible values for the first level in advance.
If you are looking to build the multi-index DF one column at a time, you could append the frames and drop the Nan's introduced leaving you with the desired multi-index DF as shown:
Demo:
df = pd.DataFrame()
df['foo', 'bar'] = [1,2,3]
df['foo', 'baz'] = [3,4,5]
df
Taking one column at a time and build the corresponding headers.
pd.concat([df[[0]], df[[1]]]).apply(lambda x: x.dropna())
Due to the Nans produced, the values are typecasted into float dtype which could be re-casted back to integers with the help of DF.astype(int).
Note:
This assumes that the number of levels are matching during concatenation.
I'm not sure there is a way to get away with this without redefining the index of the columns to be a Multiindex. If I am not mistaken the levels of the MultiIndex class are actually made up of Index objects. While you can have DataFrames with Hierarchical indices that do not have values for one or more of the levels the index object itself still must be a MultiIndex. For example:
In [2]: df = pd.DataFrame({'foo': [1,2,3], 'bar': [4,5,6]})
In [3]: df
Out[3]:
bar foo
0 4 1
1 5 2
2 6 3
In [4]: df.columns
Out[4]: Index([u'bar', u'foo'], dtype='object')
In [5]: df.columns = pd.MultiIndex.from_tuples([('', 'foo'), ('foo','bar')])
In [6]: df.columns
Out[6]:
MultiIndex(levels=[[u'', u'foo'], [u'bar', u'foo']],
labels=[[0, 1], [1, 0]])
In [7]: df.columns.get_level_values(0)
Out[7]: Index([u'', u'foo'], dtype='object')
In [8]: df
Out[8]:
foo
foo bar
0 4 1
1 5 2
2 6 3
In [9]: df['bar', 'baz'] = [7,8,9]
In [10]: df
Out[10]:
foo bar
foo bar baz
0 4 1 7
1 5 2 8
2 6 3 9
So as you can see, once the MultiIndex is in place you can add columns as you thought, but unfortunately I am not aware of any way of coercing the DataFrame to adaptively adopt a MultiIndex.
I have two index-related questions on Python Pandas dataframes.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
'B' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'amount' : np.random.randn(8)})
df = df.ix[df.B != 'three'] # remove where B = three
df.index
>> Int64Index([0, 1, 2, 4, 6, 7], dtype=int64) # the original index is preserved.
1) I do not understand why the indexing is not automatically updated after I modify the dataframe. Is there a way to automatically update the indexing while modifying a dataframe? If not, what is the most efficient manual way to do this?
2) I want to be able to set the B column of the 5th element of df to 'three'. But df.iloc[5]['B'] = 'three' does not do that. I checked on the manual but it does not cover how to change a specific cell value accessed by location.
If I were accessing by row name, I could do: df.loc[5,'B'] = 'three' but I don't know what the index access equivalent is.
P.S. Link1 and link2 are relevant answers to my second question. However, they do not answer my question.
1) I do not understand why the indexing is not automatically updated after I modify the dataframe.
If you want to reset the index after removing/adding rows you can do this:
df = df[df.B != 'three'] # remove where B = three
df.reset_index(drop=True)
B amount id
0 one -1.176137 1
1 one 0.434470 2
2 two -0.887526 3
3 two 0.126969 5
4 one 0.090442 7
5 two -1.511353 8
Indexes are meant to label/tag/id a row... so you might think about making your 'id' column the index, and then you'll appreciate that Pandas doesn't 'automatically update' the index when deleting rows.
df.set_index('id')
B amount
id
1 one -0.410671
2 one 0.092931
3 two -0.100324
4 three 0.322580
5 two -0.546932
6 three -2.018198
7 one -0.459551
8 two 1.254597
2) I want to be able to set the B column of the 5th element of df to 'three'. But df.iloc[5]['B'] = 'three' does not do that. I checked on the manual but it does not cover how to change a specific cell value accessed by location.
Jeff already answered this...
In [5]: df = pd.DataFrame({'id' : range(1,9),
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'three', 'one', 'two'],
...: 'amount' : np.random.randn(8)})
In [6]: df
Out[6]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
3 three -0.654062 4
4 two 0.587660 5
5 three -0.719589 6
6 one 0.860739 7
7 two -2.041390 8
[8 rows x 3 columns]
Your question 1) your code above is correct (see #Briford Wylie for resetting the index,
which is what I think you want)
In [7]: df.ix[df.B!='three']
Out[7]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
4 two 0.587660 5
6 one 0.860739 7
7 two -2.041390 8
[6 rows x 3 columns]
In [8]: df = df.ix[df.B!='three']
In [9]: df.index
Out[9]: Int64Index([0, 1, 2, 4, 6, 7], dtype='int64')
In [10]: df.iloc[5]
Out[10]:
B two
amount -2.04139
id 8
Name: 7, dtype: object
Question 2):
You are trying to set a copy; In 0.13 this will raise/warn. see here
In [11]: df.iloc[5]['B'] = 5
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
In [24]: df.iloc[5,df.columns.get_indexer(['B'])] = 'foo'
In [25]: df
Out[25]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
4 two 0.587660 5
6 one 0.860739 7
7 foo -2.041390 8
[6 rows x 3 columns]
You can also do this. This is NOT setting a copy and since it selects a Series (that is what df['B'] is, then it CAN be set directly
In [30]: df['B'].iloc[5] = 5
In [31]: df
Out[31]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
4 two 0.587660 5
6 one 0.860739 7
7 5 -2.041390 8
[6 rows x 3 columns]
A simple pandas question:
Is there a drop_duplicates() functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
In [7]: df1
Out[7]:
col1 col2
0 1 2
1 2 3
2 3 4
In [8]: df2
Out[8]:
col1 col2
0 4 6
1 2 3
2 5 5
so maybe something like df2.set_diff(df1) will produce this:
col1 col2
0 4 6
2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!
Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives:
set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))
Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2
0 4 6
2 5 5
from pandas import DataFrame
df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))
Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)
In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 4 6
2 5 5
Same thing, but include all values in df1, but still per column in df2
In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]:
col1 col2
0 NaN 6
2 5 5
2nd example
In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})
In [35]: g.columns=df1.columns
In [36]: g
Out[36]:
col1 col2
0 1.2 4
1 1.5 4
2 1.3 4
In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]:
col1 col2
0 1.2 NaN
1 1.5 NaN
2 1.3 NaN
Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible
There are 3 methods which work, but two of them have some flaws.
Method 1 (Hash method):
It worked for all cases I tested.
df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]
Method 2 (Dict method):
It fails if DataFrames contain datetime columns.
df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]
Method 3 (MultiIndex method):
I encountered cases when it failed on columns with None's or NaN's.
df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)
Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer
df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Original Answer
Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.
df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)
Numpy's setdiff1d would work and perhaps be faster.
For each column:
np.setdiff1(df1.col1.values, df2.col1.values)
So something like:
setdf = pd.DataFrame({
col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
for col in df1.columns
})
numpy.setdiff1d docs
Get the indices of the intersection with a merge, then drop them:
>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
A B
0 0 1
1 2 3
2 4 5
3 6 7
>>> df_completed = df_all.iloc[::2]; df_completed
A B
0 0 1
2 4 5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
index A B
0 0 0 1
1 2 4 5
>>> df_pending = df_all.drop(merged['index']); df_pending
A B
1 2 3
3 6 7
Assumption:
df1 and df2 have identical columns
it is a set operation so duplicates are ignored
sets are not extremely large so you do not worry about memory
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]
I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on #radream's answer.
Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)
This yields:
col1 col2
0 1 2
2 3 4
3 4 6
4 5 5
In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:
df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)
Result:
col1 col2
1 2 -1.0
2 3 0.0
3 4 -1.0
4 6 1.0
5 5 1.0
dtype: float64
Get positive counts:
diff[diff > 0].reset_index(name='counts')
col1 col2 counts
0 4 6 1.0
1 5 5 1.0
this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.
set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
With multiple columns you can also use:
col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names],
df1[col_names]]).drop_duplicates(keep=False)