Related
What is the proper way to go from this df:
>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
a b
0 jeff bob
1 bob jeff
2 jill mike
To this:
>>> df2
a b
0 jeff bob
2 jill mike
where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.
I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:
>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \
key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]
I think you can sort each row independently and then use duplicated to see which ones to drop.
dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]
A faster way to get dupes. Thanks to #DSM.
dupes = df.T.apply(sorted).T.duplicated()
I think simpliest is use apply with axis=1 for sorted per rows and then call DataFrame.duplicated:
df = df[~df.apply(sorted, 1).duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
A bit complicated, but very fast, is use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
df = df[~df1.duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
Timings:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
'B': np.random.randint(100,size=N).astype(str)})
#print (df)
In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
100 loops, best of 3: 3.25 ms per loop
In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
1 loop, best of 3: 1.09 s per loop
#Ted Petrou solution1
In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
1 loop, best of 3: 2.89 s per loop
#Ted Petrou solution2
In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
1 loop, best of 3: 1.56 s per loop
This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear
If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).
Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop
Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s
I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.
If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.
#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms
You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')
following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)
Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.
I have 2 dataframes as follows:
data1 looks like this:
id address
1 11123451
2 78947591
data2 looks like the following:
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
I want to create another column in data1 called "place" which contains the place the id is from.
For example, in the above case,
for id 1, I want the place column to contain Y and for id 2, I want the place column to contain X.
There will be many ids coming from the same place. And some ids don't have a match.
I am trying to do it using the following piece of code.
places = []
for index, row in data1.iterrows():
for idx, r in data2.iterrows():
if r['lowerbound_address'] <= row['address'] <= r['upperbound_address']:
places.append(r['place'])
The addresses here are float values.
It's taking forever to run this piece of code. It makes me wonder if my code is correct or if there's a faster way of executing the same.
Any help will be much appreciated.
Thank you!
You can use first cross join with merge and then filter values by boolean indexing. Last remove unecessary columns by drop:
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
print (df)
id address place
1 1 11123451 Y
2 2 78947591 X
Another solution with itertuples, last create DataFrame.from_records:
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
print (places)
[(1, 11123451, 'Y'), (2, 78947591, 'X')]
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
Another solution with apply:
def f(x):
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
df = data1.set_index('id')['address'].apply(f).reset_index()
print (df)
id address place
0 1 11123451 Y
1 2 78947591 X
EDIT:
Timings:
N = 1000:
If saome values are not in range, in solution b and c are omited. Check last row of df1.
In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
1 loop, best of 3: 2.06 s per loop
In [74]: %timeit (a(df1a, df2a))
1 loop, best of 3: 82.2 ms per loop
In [75]: %timeit (b(df1b, df2b))
1 loop, best of 3: 3.17 s per loop
In [76]: %timeit (c(df1c, df2c))
100 loops, best of 3: 2.71 ms per loop
Code for timings:
np.random.seed(123)
N = 1000
data1 = pd.DataFrame({'id':np.arange(1,N+1),
'address': np.random.randint(N*10, size=N)}, columns=['id','address'])
#add last row with value out of range
data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
data1 = data1.astype(int)
print (data1.tail())
data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10),
'upperbound_address':np.arange(10,N*10+10, 10),
'place': np.random.randint(40, size=N)})
print (data2.tail())
df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()
def a(data1, data2):
data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
return (df)
def b(data1, data2):
places = []
for row1 in data1.itertuples():
for row2 in data2.itertuples():
#print (row1.address)
if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
places.append((row1.id, row1.address, row2.place))
df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
return (df)
def f(x):
#use for ... else for add NaN to values out of range
#http://stackoverflow.com/q/9979970/2901002
for row2 in data2.itertuples():
if (row2.lowerbound_address <= x <= row2.upperbound_address):
return pd.Series([x, row2.place], index=['address','place'])
else:
return pd.Series([x, np.nan], index=['address','place'])
def c(data1,data2):
data1 = data1.sort_values('address')
data2 = data2.sort_values('lowerbound_address')
df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
return df.sort_values('id')
print (data1.set_index('id')['address'].apply(f).reset_index())
print (a(df1a, df2a))
print (b(df1b, df2b))
print (c(df1c, df2c))
Only solution c with merge_asof works very nice with large DataFrame:
N=1M:
In [84]: %timeit (c(df1c, df2c))
1 loop, best of 3: 525 ms per loop
More about merge asof in docs.
This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear
If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).
Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop
Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s
I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.
df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.
If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.
#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms
You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')
following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)
Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.
How to remove a pandas dataframe from another dataframe, just like the set subtraction:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
Thank you.
Solution
Use pd.concat followed by drop_duplicates(keep=False)
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
It looks like
a b
1 3 4
Explanation
pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.
A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.
You can use .duplicated, which has the benefit of being fairly expressive:
%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]
1000 loops, best of 3: 875 µs per loop
For comparison:
%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
100 loops, best of 3: 4.57 ms per loop
%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
1000 loops, best of 3: 987 µs per loop
%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]
1000 loops, best of 3: 546 µs per loop
In sum, using the np.array comparison is fastest. Don't need the .tolist() there.
To get dataframe with all records which are in DF1 but not in DF2
DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame
idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)
pd.DataFrame(list(idx1 - idx2), columns=df1.columns)
a b
0 3 4
My shot with merge df1 and df2 from the question.
Using 'indicator' parameter
In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]:
a b
1 3 4
This solution works when your df_to_drop is a subset of main data frame data.
data_clean = data.drop(df_to_drop.index)
A masking approach
df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]
a b
1 3 4
I think the first tolist() needs to be removed, but keep the second one:
df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
An easiest option is to use indexes.
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) )
result_index = df.index[~index_df2]
result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
Solution if df1 contains duplicates + keeps the index.
A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.
df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]
If your dataframes are big, you may want to store the result of
pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)
in a variable before the df1.apply call.