How to remove a pandas dataframe from another dataframe, just like the set subtraction:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
Thank you.
Solution
Use pd.concat followed by drop_duplicates(keep=False)
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
It looks like
a b
1 3 4
Explanation
pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.
A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.
You can use .duplicated, which has the benefit of being fairly expressive:
%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]
1000 loops, best of 3: 875 µs per loop
For comparison:
%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
100 loops, best of 3: 4.57 ms per loop
%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
1000 loops, best of 3: 987 µs per loop
%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]
1000 loops, best of 3: 546 µs per loop
In sum, using the np.array comparison is fastest. Don't need the .tolist() there.
To get dataframe with all records which are in DF1 but not in DF2
DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame
idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)
pd.DataFrame(list(idx1 - idx2), columns=df1.columns)
a b
0 3 4
My shot with merge df1 and df2 from the question.
Using 'indicator' parameter
In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]:
a b
1 3 4
This solution works when your df_to_drop is a subset of main data frame data.
data_clean = data.drop(df_to_drop.index)
A masking approach
df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]
a b
1 3 4
I think the first tolist() needs to be removed, but keep the second one:
df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
An easiest option is to use indexes.
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) )
result_index = df.index[~index_df2]
result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
Solution if df1 contains duplicates + keeps the index.
A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.
df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]
If your dataframes are big, you may want to store the result of
pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)
in a variable before the df1.apply call.
Related
What is the proper way to go from this df:
>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
a b
0 jeff bob
1 bob jeff
2 jill mike
To this:
>>> df2
a b
0 jeff bob
2 jill mike
where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.
I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:
>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \
key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]
I think you can sort each row independently and then use duplicated to see which ones to drop.
dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]
A faster way to get dupes. Thanks to #DSM.
dupes = df.T.apply(sorted).T.duplicated()
I think simpliest is use apply with axis=1 for sorted per rows and then call DataFrame.duplicated:
df = df[~df.apply(sorted, 1).duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
A bit complicated, but very fast, is use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
df = df[~df1.duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
Timings:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
'B': np.random.randint(100,size=N).astype(str)})
#print (df)
In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
100 loops, best of 3: 3.25 ms per loop
In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
1 loop, best of 3: 1.09 s per loop
#Ted Petrou solution1
In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
1 loop, best of 3: 2.89 s per loop
#Ted Petrou solution2
In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
1 loop, best of 3: 1.56 s per loop
Assume we have df and df_drop:
df = pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]})
df_drop = df[df.A==df.B]
I want to delete df_drop from df without using the explicit conditions used when creating df_drop. I.e. I'm not after the solution df[df.A!=df.B], but would like to, basically, take df minus df_drop somehow. Hopes this is clear enough. Otherwise happy to elaborate!
You can merge both dataframes setting indicator=True and drop those columns where the indicator column is both:
out = pd.merge(df,df_drop, how='outer', indicator=True)
out[out._merge.ne('both')].drop('_merge',1)
A B
1 2 1
2 3 1
Or as jon clements points out, if checking by index is enough, you could simply use:
df.drop(df_drop.index)
In this case, drop_duplicates works because the test criteria is the equality of two rows.
More generally, you can use loc to find the rows that meet or do not meet the specified criteria.
a = np.random.randint(1, 50, 100)
b = np.random.randint(1, 50, 100)
df = pd.DataFrame({'a': a, 'b': b})
criteria = df.a > 2 * df.b
df.loc[criteria, :]
Like this maybe:
In [1468]: pd.concat([df, df_drop]).drop_duplicates(keep=False)
Out[1468]:
A B
1 2 1
2 3 1
How do you scan if a pandas dataframe row contains a certain substring?
for example i have a dataframe with 11 columns
all the columns contains names
ID name1 name2 name3 ... name10
-------------------------------------------------------
AA AA_balls AA_cakee1 AA_lavender ... AA_purple
AD AD_cakee AD_cats AD_webss ... AD_ballss
CS CS_cakee CS_cats CS_webss ... CS_purble
.
.
.
I would like to get rows which contains, say "ball" in the dataframe and get the ID
so the result would be ID 'AA' and ID 'AD' since AA_balls and AD_ballss are in the rows.
I have searched on google but seems there is no specific result for these.
people usually ask questions about searching substring in a specific columns but not all columns (a single row)
df[df["col_name"].str.contains("ball")]
The Methods I have thought of are as follows, you can skip this if you have little time:
(1) loop through the columns
for col_name in col_names:
df.append(df[df[col_name].str.contains('ball')])
and then drop duplicates rows which have same ID values
but this method would be very slow
(2) Make data frame to a 2 column dataframe by appending name2- name10 columns into one column and use df[df["concat_col"].str.contains("ball")]["ID] to get the IDs and drop duplicate
ID concat_col
AA AA_balls
AA AA_cakeee
AA AA_lavender
AA AA_purple
.
.
.
CS CS_purble
(3) Use the dataframe like (2) to make a dictionay
where
dict[df["concat_col"].value] = df["ID"]
then get the
[value for key, value in programs.items() if 'ball' in key()]
but in this method i need to loop through dictionary and become slow
if there is a method that i can apply faster without these processes,
i would prefer doing so.
If anyone knows about this,
would appreciate a lot if you kindly let me know:)
Thanks!
One idea is use melt:
df = df.melt('ID')
a = df.loc[df['value'].str.contains('ball'), 'ID']
print (a)
0 AA
10 AD
Name: ID, dtype: object
Another:
df = df.set_index('ID')
a = df.index[df.applymap(lambda x: 'ball' in x).any(axis=1)]
Or:
mask = np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns])
a = df.loc[, 'ID']
Timings:
np.random.seed(145)
L = list('abcdefgh')
df = pd.DataFrame(np.random.choice(L, size=(4000, 10)))
df.insert(0, 'ID', np.arange(4000).astype(str))
a = np.random.randint(4000, size=15)
b = np.random.randint(1, 10, size=15)
for i, j in zip(a,b):
df.iloc[i, j] = 'AB_ball_DE'
#print (df)
In [85]: %%timeit
...: df1 = df.melt('ID')
...: a = df1.loc[df1['value'].str.contains('ball'), 'ID']
...:
10 loops, best of 3: 24.3 ms per loop
In [86]: %%timeit
...: df.loc[np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns]), 'ID']
...:
100 loops, best of 3: 12.8 ms per loop
In [87]: %%timeit
...: df1 = df.set_index('ID')
...: df1.index[df1.applymap(lambda x: 'ball' in x).any(axis=1)]
...:
100 loops, best of 3: 11.1 ms per loop
Maybe this might work?
mask = df.apply(lambda row: row.map(str).str.contains('word').any(), axis=1)
df.loc[mask]
Disclaimer: I haven't tested this. Perhaps the .map(str) isn't necessary.
I have plenty of DataFrames need to be merged by axis=0, it's important to find some fast way to do this op.
So far, I try merge and append, but these functions all need assignment after merging like df = df.append(df2) and will become slower and slower, is there some another method which can merge in place and more efficient?
Assuming your dataframes have the same index, you can use pd.concat:
In [61]: df1
Out[61]:
a
0 1
In [62]: df2
Out[62]:
b
0 2
Create a list of dataframes:
In [63]: df_list = [df1, df2]
Now, call pd.concat:
In [64]: pd.concat(df_list, axis=1)
Out[64]:
a b
0 1 2
I have a current df containing entries like this:
date tags ease
0 'date1' 'tag1' 1
1 'date1' 'tag1' 2
2 'date1' 'tag1' 1
3 'date1' 'tag2' 2
4 'date1' 'tag2' 2
5 'date2' 'tag1' 3
6 'date2' 'tag1' 1
7 'date2' 'tag2' 1
8 'date2' 'tag3' 1
I'd like to create a df (or some other type array if there is a better way to go about this-I'm green to Python and welcome suggestions) that counts the number of time a specific tag has a specific ease for each date in the df. For example, if I wanted to count the number of times each tag has an ease of 1 it would look something like this:
date1 date2
tag1 2 1
tag2 1 2
tag3 0 1
I can think of ways to do this using a loop, but my the final outputs are going to be about 700 x 800 and I need to make one for each "ease." I feel like there must be an efficient way to do this using indexing, hence why I looked first to pandas. As I said mentioned, I'm very new to Python if there are alternate approaches or packages I should consider using, I'm open for it.
I think you need boolean indexing with crosstab:
df1 = df[df['ease'] == 1]
df = pd.crosstab(df1['tags'], df1['date'])
print (df)
date 'date1' 'date2'
tags
'tag1' 2 1
'tag2' 0 1
'tag3' 0 1
Another solution where instead crosstab use groupby with size and for reshape unstack:
df = df[df['ease'] == 1].groupby(["date", "tags"]).size().unstack(level=0, fill_value=0)
print (df)
date 'date1' 'date2'
tags
'tag1' 2 1
'tag2' 0 1
'tag3' 0 1
EDIT:
After testing solution I released is necessery add function reindex and sort_index, becasue if filter non 1 out values, it remove rows in final DataFrame.
print (df[df['ease'] == 1].groupby(["date", "tags"])
.size()
.unstack(level=0, fill_value=0)
.reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0)
.sort_index()
.sort_index(axis=1))
And also second solution:
df1 = df[df['ease'] == 1]
df2 = pd.crosstab(df1['tags'], df1['date'])
.reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0)
.sort_index()
.sort_index(axis=1)
Timings:
(second solution of Psidom is wrong in general df, so I omit it from timings)
np.random.seed(123)
N = 10000
dates = pd.date_range('2017-01-01', periods=100)
tags = ['tag' + str(i) for i in range(100)]
ease = range(10)
df = pd.DataFrame({'date':np.random.choice(dates, N),
'tags': np.random.choice(tags, N),
'ease': np.random.choice(ease, N)})
df = df.reindex_axis(['date','tags','ease'], axis=1)
#[10000 rows x 3 columns]
#print (df)
print (df.groupby(["date", "tags"]).agg({"ease": lambda x: (x == 1).sum()}).ease.unstack(level=0).fillna(0))
print (df[df['ease'] == 1].groupby(["date", "tags"]).size().unstack(level=0, fill_value=0).reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0).sort_index().sort_index(axis=1))
def jez(df):
df1 = df[df['ease'] == 1]
return pd.crosstab(df1['tags'], df1['date']).reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0).sort_index().sort_index(axis=1)
print (jez(df))
#Psidom solution
In [56]: %timeit (df.groupby(["date", "tags"]).agg({"ease": lambda x: (x == 1).sum()}).ease.unstack(level=0).fillna(0))
1 loop, best of 3: 1.94 s per loop
In [57]: %timeit (df[df['ease'] == 1].groupby(["date", "tags"]).size().unstack(level=0, fill_value=0).reindex(index=df.tags.unique(), columns=df.date.unique(), fill_value=0).sort_index().sort_index(axis=1))
100 loops, best of 3: 5.74 ms per loop
In [58]: %timeit (jez(df))
10 loops, best of 3: 54.5 ms per loop
Here is one option; Use groupby.agg to calculate the count, and then unstack the result to wide format:
(df.groupby(["date", "tags"])
.agg({"ease": lambda x: (x == 1).sum()})
.ease.unstack(level=0).fillna(0))
Or if you like to use crosstab:
pd.crosstab(df.tags, df.date, df.ease == 1, aggfunc="sum").fillna(0)
# date 'date1' 'date2'
# tags
#'tag1' 2.0 1.0
#'tag2' 0.0 1.0
#'tag3' 0.0 1.0
You could look at using the pivot_table method on the DataFrame with a function of your own to do something that only counts if the condition you want is true. This should then also populate tags and dates where there is no data with a 0. Something like:
def calc(column):
total = 0
for e in column:
if e == 1:
total += 1
return total
check_res = df.pivot_table(index='tags',columns='date', values='ease', aggfunc=calc, fill_value=0)