Add multiple frames together (more than two) - python

How would I add multiple dataframes together. I have like 10 dataframes to sum up. I tried to use
df_add = df1.add(df2, df3 fill_value=0)
And it doesn't work.
This is the code to create the dfs
df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
a b
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=
['a','b'])
df3 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=
['a','b'])
Now how would I be able to add these so that
a b
0 201 402
1 603 804
2 1005 1206

Looks like the simple
sum((df1,df2,df3))
works and gives:
a b
0 201 402
1 603 804
2 1005 1206

Just as an addendum,
df1 + df2 + df3
works just fine. If performance is important and index matching is not important, then consider vectorized np.sum and avoid pandas overhead
%timeit np.sum([df1.values,df2.values,df3.values],axis=0)
27.2 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df1+df2+df3
1.21 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sum((df1,df2,df3))
2.04 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

How do I check if a value already appeared in pandas df column?

I have a Dataframe of stock prices...
I wish to have a boolean column that indicates if the price had reached a certain threshold in the previous rows or not.
My output should be something like this (let's say my threshold is 100):
index
price
bool
0
98
False
1
99
False
2
100.5
True
3
101
True
4
99
True
5
98
True
I've managed to do this with the following code but it's not efficient and takes a lot of time:
(df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
Please, any suggestions?
Use a comparison and cummax:
threshold = 100
df['bool'] = df['price'].ge(threshold).cummax()
Note that it would work the other way around (although maybe less efficiently*):
threshold = 100
df['bool'] = df['price'].cummax().ge(threshold)
Output:
index price bool
0 0 98.0 False
1 1 99.0 False
2 2 100.5 True
3 3 101.0 True
4 4 99.0 True
5 5 98.0 True
* indeed on a large array:
%%timeit
df['price'].ge(threshold).cummax()
# 193 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['price'].cummax().ge(threshold)
# 309 µs ± 4.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timing
# setting up a dummy example with 10M rows
np.random.seed(0)
df = pd.DataFrame({'price': np.random.choice([0,1], p=[0.999,0.001], size=10_000_000)})
threshold = 0.5
## comparison
%%timeit
df['bool'] = (df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
# 271 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['bool'] = df['price'].ge(threshold).cummax()
# 109 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%%timeit
df['bool'] = np.maximum.accumulate(df['price'].to_numpy()>threshold)
# 75.8 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest ways to filter for values in pandas dataframe

A similar dataframe can be created:
import pandas as pd
df = pd.DataFrame()
df["nodes"] = list(range(1, 11))
df["x"] = [1,4,9,12,27,87,99,121,156,234]
df["y"] = [3,5,6,1,8,9,2,1,0,-1]
df["z"] = [2,3,4,2,1,5,9,99,78,1]
df.set_index("nodes", inplace=True)
So the dataframe looks like this:
x y z
nodes
1 1 3 2
2 4 5 3
3 9 6 4
4 12 1 2
5 27 8 1
6 87 9 5
7 99 2 9
8 121 1 99
9 156 0 78
10 234 -1 1
My first try for searching e.g. all nodes containing number 1 is:
>>> df[(df == 1).any(axis=1)].index.values
[1 4 5 8 10]
As i have to do this for many numbers and my real dataframe is much bigger than this one, i'm searching for a very fast way to do this.
Just tried something that may be enlightening
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df.set_index("A", inplace=True)
df_no_index = df.reset_index()
So set up a dataframe with ints right the way through. This is not the same as yours but it will suffice.
Then I ran four tests
%timeit df[(df == 1).any(axis=1)].index.values
%timeit df[(df['B'] == 1) | (df['C']==1)| (df['D']==1)].index.values
%timeit df_no_index[(df_no_index == 1).any(axis=1)].A.values
%timeit df_no_index[(df_no_index['B'] == 1) | (df_no_index['C']==1)| (df_no_index['D']==1)].A.values
The results I got were,
940 µs ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.08 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.55 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which showed that the initial method that you took, with index seems to be the fastest of these approaches. Removing the index does not improve the speed with a moderately sized dataframe

How do I count specific values across multiple columns in pandas

I have the DataFrame
df = pd.DataFrame({
'colA':['?',2,3,4,'?'],
'colB':[1,2,'?',3,4],
'colC':['?',2,3,4,5]
})
I would like to get the count the the number of '?' in each column and return the following output -
colA - 2
colB - 1
colC - 1
Is there a way to return this output at once. Right now the only way I know how to do it is write a for loop for each column.
looks like the simple way is
df[df == '?'].count()
the result is
colA 2
colB 1
colC 1
dtype: int64
where df[df == '?'] give us DataFrame with ? and Nan
colA colB colC
0 ? NaN ?
1 NaN NaN NaN
2 NaN ? NaN
3 NaN NaN NaN
4 ? NaN NaN
and the count non-NA cells for each column.
Please, look on the other solutions: good readable and the most faster
You can use numpy.count_nonzero here.
pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
# pd.Series((df.values == '?').sum(0), index=df.columns)
colA 2
colB 1
colC 1
dtype: int64
Timeit results:
Benchmarking with df of shape (1_000_000, 3)
big_df = pd.DataFrame(df.to_numpy().repeat(200_000,axis=0))
big_df.shape
(1000000, 3)
In [186]: %timeit pd.Series(np.count_nonzero(big_df.to_numpy()=='?', axis=0), index=big_df.columns)
53.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit big_df.eq('?').sum()
171 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [188]: %timeit big_df[big_df == '?'].count()
314 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [189]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, big_df.values), index=big_df.columns)
174 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can do sum
df.eq('?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64
#Bear Brown Answer is probably the most elegant, a faster option is to use numpy:
from collections import Counter
%%timeit
df[df == '?'].count()
5.2 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)
218 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Variation to BENY's answer:
(df=='?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64

avoid repeating the dataframe name when operating on pandas columns

Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?
In R, data.table allows to operate on a column without repeating the dataframe name like this
very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))
# operate on the columns without repeating the dt name:
very_long_dt_name[,ratio:=round(col1/col2,2)]
I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:
data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)
# operate on the columns requires repeating the df name
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.
Try assign:
very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))
Output:
col1 col2 ratio
0 1 3 0.33
1 2 3 0.67
2 3 1 3.00
Edit: to reflect comments, tests on 1 million rows:
%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And with np.round, assign
%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and not-assign:
%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
SO it appears that assign is vectorized, just not as well tuned.

Is it possible to use apply function or vectorization on this code logic?

I am trying to calculate closing balance
Input dataframe:
open inOut close
0 3 100 0
1 0 300 0
2 0 200 0
3 0 230 0
4 0 150 0
Output DataFrame
open inOut close
0 3 100 103
1 103 300 403
2 403 200 603
3 603 230 833
4 833 150 983
I am able to achieve this using crude for-loop and to optimize it i have used iterrow()
For-Loop
%%timeit
for i in range(len(df.index)):
if i>0:
df.iloc[i]['open'] = df.iloc[i-1]['close']
df.iloc[i]['close'] = df.iloc[i]['open']+df.iloc[i]['inOut']
else:
df.iloc[i]['close'] = df.iloc[i]['open']+df.iloc[i]['inOut']
1.64 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
iterrows
%%timeit
for index,row in dfOg.iterrows():
if index>0:
row['open'] = dfOg.iloc[index-1]['close']
row['close'] = row['open']+row['inOut']
else:
row['close'] = row['open']+row['inOut']
627 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
performance optimized from 1.64ms -> 627µs
As per this blog, I am struggling to figure out how to write the above logic using apply() and vectorization.
for vectorization, I tried shifting the columns but not able to achieve the desired output.
Edit: I changed things around to match the edits OP made to the question
You can do what you want in a vectorized way without any loops like this:
import pandas as pd
d = {'open': [3] + [0]*4, 'inOut': [100, 300, 200, 230, 150], 'close': [0]*5}
df = pd.DataFrame(d)
df['close'].values[:] = df['open'].values[0] + df['inOut'].values.cumsum()
df['open'].values[1:] = df['close'].values[:-1]
Timing with %%timeit:
529 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Output:
close inOut open
0 103 100 3
1 403 300 103
2 603 200 403
3 833 230 603
4 983 150 833
So vectorizing your code this way is indeed somewhat faster. In fact, it's probably about as fast as possible. You can see this by timing just the dataframe creation code:
%%timeit
d = {'open': [3] + [0]*4, 'inOut': [100, 300, 200, 230, 150], 'close': [0]*5}
df = pd.DataFrame(d)
Result:
367 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Subtracting out the time it takes to create the dataframe, the vectorized version of filling in your dataframe only takes about ~160 µs.
You can use np.where
%%timeit
df['open'] = np.where(df.index==0, df['open'], df['inOut'].shift())
df['close'] = df['open'] + df['inOut']
# 1.07 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Output:
open inOut close
0 3.0 100 103.0
1 100.0 300 300.0
2 300.0 200 200.0
3 200.0 230 230.0
4 230.0 150 150.0

Categories