Merge multiple DataFrames in an efficient manner - python

I have plenty of DataFrames need to be merged by axis=0, it's important to find some fast way to do this op.
So far, I try merge and append, but these functions all need assignment after merging like df = df.append(df2) and will become slower and slower, is there some another method which can merge in place and more efficient?

Assuming your dataframes have the same index, you can use pd.concat:
In [61]: df1
Out[61]:
a
0 1
In [62]: df2
Out[62]:
b
0 2
Create a list of dataframes:
In [63]: df_list = [df1, df2]
Now, call pd.concat:
In [64]: pd.concat(df_list, axis=1)
Out[64]:
a b
0 1 2

Related

Pandas drop subset of dataframe

Assume we have df and df_drop:
df = pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]})
df_drop = df[df.A==df.B]
I want to delete df_drop from df without using the explicit conditions used when creating df_drop. I.e. I'm not after the solution df[df.A!=df.B], but would like to, basically, take df minus df_drop somehow. Hopes this is clear enough. Otherwise happy to elaborate!
You can merge both dataframes setting indicator=True and drop those columns where the indicator column is both:
out = pd.merge(df,df_drop, how='outer', indicator=True)
out[out._merge.ne('both')].drop('_merge',1)
A B
1 2 1
2 3 1
Or as jon clements points out, if checking by index is enough, you could simply use:
df.drop(df_drop.index)
In this case, drop_duplicates works because the test criteria is the equality of two rows.
More generally, you can use loc to find the rows that meet or do not meet the specified criteria.
a = np.random.randint(1, 50, 100)
b = np.random.randint(1, 50, 100)
df = pd.DataFrame({'a': a, 'b': b})
criteria = df.a > 2 * df.b
df.loc[criteria, :]
Like this maybe:
In [1468]: pd.concat([df, df_drop]).drop_duplicates(keep=False)
Out[1468]:
A B
1 2 1
2 3 1

Apply the same operation to multiple DataFrames efficiently

I have two data frames with the same columns, and similar content.
I'd like apply the same functions on each, without having to brute force them, or concatenate the dfs. I tried to pass the objects into nested dictionaries, but that seems more trouble than it's worth (I don't believe dataframe.to_dict supports passing into an existing list).
However, it appears that the for loop stores the list of dfs in the df object, and I don't know how to get it back to the original dfs... see my example below.
df1 = {'Column1': [1,2,2,4,5],
'Column2': ["A","B","B","D","E"]}
df1 = pd.DataFrame(df1, columns=['Column1','Column2'])
df2 = {'Column1': [2,11,2,2,14],
'Column2': ["B","Y","B","B","V"]}
df2 = pd.DataFrame(df2, columns=['Column1','Column2'])
def filter_fun(df1, df2):
for df in (df1, df2):
df = df[(df['Column1']==2) & (df['Column2'].isin(['B']))]
return df1, df2
filter_fun(df1, df2)
If you write the filter as a function you can apply it in a list comprehension:
def filter(df):
return df[(df['Column1']==2) & (df['Column2'].isin(['B']))]
df1, df2 = [filter(df) for df in (df1, df2)]
I would recommend concatenation with custom specified keys, because 1) it is easy to assign it back, and 2) you can do the same operation once instead of twice.
# Concatenate df1 and df2
df = pd.concat([df1, df2], keys=['a', 'b'])
# Perform your operation
out = df[(df['Column1'] == 2) & df['Column2'].isin(['B'])]
out.loc['a'] # result for `df1`
Column1 Column2
1 2 B
2 2 B
out.loc['b'] # result for `df2`
Column1 Column2
0 2 B
2 2 B
3 2 B
This should work fine for most operations. For groupby, you will want to group on the 0th index level as well.

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.
If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084
How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]
You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)
Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

How to remove a pandas dataframe from another dataframe

How to remove a pandas dataframe from another dataframe, just like the set subtraction:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
Thank you.
Solution
Use pd.concat followed by drop_duplicates(keep=False)
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
It looks like
a b
1 3 4
Explanation
pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.
A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.
You can use .duplicated, which has the benefit of being fairly expressive:
%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]
1000 loops, best of 3: 875 µs per loop
For comparison:
%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
100 loops, best of 3: 4.57 ms per loop
%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
1000 loops, best of 3: 987 µs per loop
%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]
1000 loops, best of 3: 546 µs per loop
In sum, using the np.array comparison is fastest. Don't need the .tolist() there.
To get dataframe with all records which are in DF1 but not in DF2
DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame
idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)
pd.DataFrame(list(idx1 - idx2), columns=df1.columns)
a b
0 3 4
My shot with merge df1 and df2 from the question.
Using 'indicator' parameter
In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]:
a b
1 3 4
This solution works when your df_to_drop is a subset of main data frame data.
data_clean = data.drop(df_to_drop.index)
A masking approach
df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]
a b
1 3 4
I think the first tolist() needs to be removed, but keep the second one:
df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
An easiest option is to use indexes.
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) )
result_index = df.index[~index_df2]
result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
Solution if df1 contains duplicates + keeps the index.
A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.
df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]
If your dataframes are big, you may want to store the result of
pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)
in a variable before the df1.apply call.

python pandas: computing argmax of column in matrix subset

Consider toy dataframes df1 and df2, where df2 is a subset of df1 (excludes the first row).
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'colA':[3.0,9,45,7],'colB':['A','B','C','D']})
df2 = df1[1:]
Now lets find argmax of colA for each frame
np.argmax(df1.colA) ## result is "2", which is what I expected
np.argmax(df2.colA) ## result is still "2", which is not what I expected. I expected "1"
If my matrix of insterest is df2, how do I get around this indexing issue? Is this quirk related to pandas, numpy, or just python memory?
I think it's due to index. You could use reset_index when you assign df2:
df1 = pd.DataFrame({'colA':[3.0,9,45,7],'colB':['A','B','C','D']})
df2 = df1[1:].reset_index(drop=True)
In [464]: np.argmax(df1.colA)
Out[464]: 2
In [465]: np.argmax(df2.colA)
Out[465]: 1
I think it's better to use method argmax instead of np.argmax:
In [467]: df2.colA.argmax()
Out[467]: 1
You need to reset the index of df2:
df2.reset_index(inplace=True, drop=True)
np.argmax(df2.colA)
>> 1

Categories