Usecase and conclusions
Create a 2*2 table filled with integers, change the values of one specific row, and access several rows. For this I was planning using pandas DataFrame but got very disappointed with the performances. The conclusion shows an incompressible cost for pandas DataFrame when the data is a small tables:
- pandas.loc = pandas.iloc = 115 µs
- pandas.iat = 5 µs (20 times faster, but only access one cell)
- numpy access = 0.5 (200 times faster, acceptable performance)
Am I doing a incorrect usage of pandas DataFrame ? Is it supposed to be used only for massive tables of data ? Given that my goal is to use a very simple MultiIndexation (type 1, type 2 and date), is there an existing data-structure that would support similar performance that numpy arrays and is as easy as pandas DataFrame ?
Or the other option I consider is to create my own class of MultiIndexed numpy arrays.
Code of the usecase
import pandas as pd
import numpy as np
from datetime import datetime
import time
data = [[1,2],[3,4]]
array = np.array(data)
index= pd.DatetimeIndex([datetime(2000,2,2), datetime(2000,2,4)])
df = pd.DataFrame(data = data, index = index)
It takes around 0.1 ms using pandas using the loc function
%timeit df.loc[datetime(2000,2,2)] = [12, 42]
115 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
iloc has the same bad performances, although I would have intuitively think it would be faster
%timeit df.iloc[0] = [12, 42]
114 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Testing iat, we are down to 5 µs, which is more acceptable
%timeit df.iat[0,0] = 42
5.03 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally testing the same behavior with a numpy.array, we have excellent performances :
%timeit array[0,:] = [12, 42]
705 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Related
Given a large DataFrame df, which is faster in general?
# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]
# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]
The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.
It depends on your dataset.
First let's generate a DataFrame with almost all values that should be dropped in the first condition:
n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
'column2': np.random.choice([0,20], size=n)})
And as expected:
# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is faster to first generate a small intermediate.
Now, let's change the probability to p = 0.9999. This means that the first condition will remove very few rows.
We could expect both solutions to run with a similar speed, but:
# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the overhead of creating the intermediate DataFrame is not negligible.
I need to group a DataFrame and apply several chained functions on each group.
My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.
There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?
First here is some large testing data:
from string import ascii_lowercase
import numpy as np
import pandas as pd
n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
Below is the performance of each function:
%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.
%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.
def cumsum_shift(s):
return s.cumsum().shift()
%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.
Let me know if this helps, few weeks back I was having the same issue.
I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.
# creating groupby object
g = df.groupby('x')['y']
%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would suggest to give a try to transform instead of apply
try this:
%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())
or, also try using
from toolz import pipe
%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())
I am pretty sure that pipe can be more efficient than apply or transform
Let us know if it works well
I am surprised to find that adding a column to dataframe so time consuming, and after doing a reset_index can improve the performance to such a great deal. Below is the code snippet that showing the performance difference.
import pandas as pd
import numpy as np
import time
np.random.seed(0)
df = pd.DataFrame(np.random.uniform(size=(1000, 3)))
sub_df = df.loc[30:32, :]
sub_df_1 = sub_df.reset_index(drop=True)
Now I am going to add a new column on sub_df and sub_df_1 (which is reset_index version of sub_df)
%timeit sub_df[3] = 1
201 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sub_df_1[3] = 1
139 µs ± 8.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The performance improvement is larger than 1000 times! And it is also worth noting that without reset_index, it will take 0.2 seconds to add a column to a three-row dataframe...
Is there any reason for the difference of performance? Thanks in advance!
I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations
You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.
I'm applying harmonic mean from scipy.stats for aggfunc parameter in Pandas pivot_table but it is much slower than a simple mean by orders of magnitude.
I would like to know if this is excepted behavior or there is a way to turn this calculation more efficient as I need to do this calculation thousands of times.
I need to use harmonic mean but this is taking a huge amount of processing time.
I've tried using harmonic_mean from statistics form Python 3.6 but still the overhead is the same.
Thanks
import numpy as np
import pandas as pd
import statistics
data = pd.DataFrame({'value1':np.random.randint(1000,size=200000),
'value2':np.random.randint(24,size=200000),
'value3':np.random.rand(200000)+1,
'value4':np.random.randint(100000,size=200000)})
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=hmean)
1.74 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
1.9 s ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=np.mean)
37.4 ms ± 938 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Single run for both functions
%timeit hmean(data.value3[:100])
155 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(data.value3[:100])
138 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I would recommend using multiprocessing.Pool, the code below has been tested for 20 million records, it is 3 times faster than the original, give it try please, for sure code still needs more improvements to answer your specific question about the slow performance of statistics.harmonic_mean.
note: you can get even better results for records > 100 M.
import time
import numpy as np
import pandas as pd
import statistics
import multiprocessing
data = pd.DataFrame({'value1':np.random.randint(1000,size=20000000),
'value2':np.random.randint(24,size=20000000),
'value3':np.random.rand(20000000)+1,
'value4':np.random.randint(100000,size=20000000)})
def chunk_pivot(data):
result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
return result
DataFrameDict=[]
for i in range(4):
print(i*250,i*250+250)
DataFrameDict.append(data[:][data.value1.between(i*250,i*250+249)])
def parallel_pivot(prcsr):
# 6 is a number of processes I've tested
p = multiprocessing.Pool(prcsr)
out_df=[]
for result in p.imap(chunk_pivot, DataFrameDict):
#print (result)
out_df.append(result)
return out_df
start =time.time()
dict_pivot=parallel_pivot(6)
multiprocessing_result=pd.concat(dict_pivot,axis=0)
#singleprocessing_result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
end = time.time()
print(end-start)