Pandas groupby: efficiently chain several functions - python

I need to group a DataFrame and apply several chained functions on each group.
My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.
There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?
First here is some large testing data:
from string import ascii_lowercase
import numpy as np
import pandas as pd
n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
Below is the performance of each function:
%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.
%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.
def cumsum_shift(s):
return s.cumsum().shift()
%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.

Let me know if this helps, few weeks back I was having the same issue.
I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.
# creating groupby object
g = df.groupby('x')['y']
%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would suggest to give a try to transform instead of apply
try this:
%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())
or, also try using
from toolz import pipe
%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())
I am pretty sure that pipe can be more efficient than apply or transform
Let us know if it works well

Related

Vectorization of array creation with variable indices in python - How to remove the for loop?

I am trying to vectorize creation of an array with variable indices that change with the loop variable. In the code snippet below, I want to remove the for loop and vectorize the array creation. Can someone kindly help?
#Vectorize 1
def abc(x):
return str(x)+'_variable'
ar = []
for i in range(0,100):
ar += [str('vectorize_')+abc(i)]
You're not going to get much improvement from "vectorization" here since you're working with strings, unfortunately. A pure Python comprehension is about as good as you'll be able to get, because of this constraint. "Vectorized" operations are only able to take advantage of optimized numerical C code when the data are numeric.
Here's an example of one way you might do what you want here:
In [4]: %timeit np.char.add(np.repeat("vectorize_variable_", 100), np.arange(100).astype(str))
108 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
versus a pure Python comprehension:
In [5]: %timeit [f"vectorize_variable_{i}" for i in range(100)]
11.1 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As far as I know, using numpy really doesn't net you any performance benefits when working with strings. Of course, I may be mistaken, and would love if I am.
If you're still not convinced, here's the same test with n=10000:
In [6]: %timeit [f"vectorize_variable_{i}" for i in range(n)]
1.21 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit np.char.add(np.repeat("vectorize_variable_", n), np.arange(n).astype(str)
...: )
9.97 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pure Python is about 10x faster than the "vectorized" version.

Efficiently aggregate results into a Python Data Structure

I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations
You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.

Improve Harmonic Mean efficiency in Pandas pivot_table

I'm applying harmonic mean from scipy.stats for aggfunc parameter in Pandas pivot_table but it is much slower than a simple mean by orders of magnitude.
I would like to know if this is excepted behavior or there is a way to turn this calculation more efficient as I need to do this calculation thousands of times.
I need to use harmonic mean but this is taking a huge amount of processing time.
I've tried using harmonic_mean from statistics form Python 3.6 but still the overhead is the same.
Thanks
import numpy as np
import pandas as pd
import statistics
data = pd.DataFrame({'value1':np.random.randint(1000,size=200000),
'value2':np.random.randint(24,size=200000),
'value3':np.random.rand(200000)+1,
'value4':np.random.randint(100000,size=200000)})
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=hmean)
1.74 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
1.9 s ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=np.mean)
37.4 ms ± 938 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Single run for both functions
%timeit hmean(data.value3[:100])
155 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(data.value3[:100])
138 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I would recommend using multiprocessing.Pool, the code below has been tested for 20 million records, it is 3 times faster than the original, give it try please, for sure code still needs more improvements to answer your specific question about the slow performance of statistics.harmonic_mean.
note: you can get even better results for records > 100 M.
import time
import numpy as np
import pandas as pd
import statistics
import multiprocessing
data = pd.DataFrame({'value1':np.random.randint(1000,size=20000000),
'value2':np.random.randint(24,size=20000000),
'value3':np.random.rand(20000000)+1,
'value4':np.random.randint(100000,size=20000000)})
def chunk_pivot(data):
result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
return result
DataFrameDict=[]
for i in range(4):
print(i*250,i*250+250)
DataFrameDict.append(data[:][data.value1.between(i*250,i*250+249)])
def parallel_pivot(prcsr):
# 6 is a number of processes I've tested
p = multiprocessing.Pool(prcsr)
out_df=[]
for result in p.imap(chunk_pivot, DataFrameDict):
#print (result)
out_df.append(result)
return out_df
start =time.time()
dict_pivot=parallel_pivot(6)
multiprocessing_result=pd.concat(dict_pivot,axis=0)
#singleprocessing_result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
end = time.time()
print(end-start)

Comparing zip operation with pandas slice operation

An interesting observation I felt I should clarify.
I expect that pandas slice operation should be faster than zipping columns of a dataframe, but on running %timeit on both operations, the zip operation is faster...
import pandas as pd, numpy as np
s = pd.DataFrame({'Column1':range(50), 'Column2':np.random.randn(50), 'Column3':np.random.randn(50)})
And on running
%timeit s[['Column1','Column3']].loc[30].values
1.06 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dict(zip(s['Column1'],s['Column3']))[30]
53.7 µs ± 6.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This tells that pandas is significantly slower than using the zip function, right? And probably only better for its ease of use, I believe.
Would an apply-map operation be faster?
Zip is optimized to run in the processor cache. Its very fast, as is itertools in general.

How to speed up pandas string function?

I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.
Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:
id facility
0 3466 abc~24353
1 4853 facility1~3.4.5.6
2 4582 53434_Facility~34432~cde
3 9972 facility2~FACILITY2~343
4 2356 Test~23 ~FAC1
The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:
Vectorization
In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lambda Apply
In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does anyone know why I am getting this behavior?
Thanks!
Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.
%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]
411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.
As for why apply is faster, I'm of the belief that the function apply is applying (i.e., str.split) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split.

Categories