Summarize the Problem
I am trying to optimize some code I have written. In its current form it works as intended, however because of the sheer number of loops required the script it takes a very long time to run.
I'm looking for a method of speeding up the below-described code.
Detail the problem
Within this data frame called master, there are 3,936,192 rows. The Position column represents a genomic window. Which is present in this data frame 76 times. Such that master[master['Position'] == 300] returns a dataframe of 76 rows, and similar for each unique appearance of Position. I do some operations on each of these subsets of the data frame.
The data can be found here
My current code takes the form:
import pandas as pd
master = pd.read_csv(data_location)
windows = sorted(set(master['Position']))
window_factor = []
# loop through all the windows, look at the cohort of samples, ignore anything not CNV == 2
# if that means ignore all, then drop the window entirely
# else record the 1/2 mean of that windows normalised coverage across all samples.
for window in windows:
current_window = master[master['Position'] == window]
t = current_window[current_window['CNV'] == 2]
if t.shape[0] == 0:
window_factor.append('drop')
else:
window_factor.append(
np.mean(current_window[current_window['CNV'] == 2]['Normalised_coverage'])/2)
However, this takes an exceptionally long time to run and I can't figure out a way to speed this up, though I know there must be one.
your df is not that big and in your code there are few problems:
If you use np.mean and one value is np.nan it returns np.nan
You can divide by 2 after calculate the mean.
It seems to me a perfect case for groupby
Return a string while other results are float you might consider to use
np.nan instead
import pandas as pd
df = pd.read_csv("master.csv")
def fun(x):
t = x[x["CNV"]==2]
return t["Normalised_coverage"].mean()/2
# returns np.nan when len(t)==0
out = df.groupby('Position').apply(fun)
CPU times: user 34.7 s, sys: 72.5 ms, total: 34.8 s
Wall time: 34.7 s
Or even faster filtering before the groupby as
%%time
out = df[df["CNV"]==2].groupby("Position")["Normalised_coverage"].mean()/2
CPU times: user 82.5 ms, sys: 8.03 ms, total: 90.5 ms
Wall time: 87.8 ms
UPDATE: In the last case if you really need to keep track of groups where df["CNV"]!=2 you can use this code:
import numpy as np
bad = df[df["CNV"]!=2]["Position"].unique()
bad = list(set(bad)-set(out.index))
out = out.reset_index(name="value")
out1 = pd.DataFrame({"Position":bad,
"value":[np.nan]*len(bad)})
out = pd.concat([out,out1],
ignore_index=True)\
.sort_values("Position")\
.reset_index(drop=True)
Which is going to add 160ms to your computation.
I think .groupby() function is what you need here:
fac = []
for name,group in master.groupby('Position'):
if all(group['CNV'] != 2):
fac.append('drop')
else:
fac.append(np.mean(group[group['CNV'] == 2]['Normalised_coverage'])/2)
I downloaded your data master.csv, data generated is exactly the same, running time decreased from 6 min to 30 sec on my laptop.
Hope it helps.
You can do several things:
instead of using a python list for window_factor consider using a np.array since
you know the length of the array.
t is already current_window[current_window['CNV'] == 2] use t when calculate np.mean.
You can also use a profiler to see if there are operations that are expensive, or just consider using C++ and reimplement the code(it's very simple).
Using groupby and query was the solution I went with.
import pandas as pd
import numpy as np
master = pd.read_csv("/home/sean/Desktop/master.csv", index_col=0)
windows = sorted(set(master['Position']))
g = master.groupby("Position")
master.query("Position == 24386700").shape
g = master.query("CNV == 2").groupby("Position")
p = g.Normalised_coverage.mean() / 2
Related
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
cluster_name size is 101,878
maxpaths size is 1,508,931
Dataframe looks like this
| cluster_name | maxpath | chrom_pos |
|--------------+---------------------+--------------------|
| cluster_1007 | 5_G,6_C,7_A,8_A,9_T | chr11:611117-799999|
| cluster_1007 | 5_G,6_C,7_A,8_A,9_T | chr11:611117-799999|
| cluster_1007 | 3_G,4_C,5_A,6_A,7_T | chr12:823492-102341|
What I'm trying to do is
for a given cluster I want to compare each maxpaths to other given maxpaths in
that cluster, I'm interested in the clusters where the maxpaths do not overlap
and their annotations on the chromosome are also disjoint. For example
"5_G,6_C,7_A,8_A,9_T" and "3_G,4_C,5_A,6_A,7_T" do not overlap and have
different annotations. One of the biggest pain points in my code is that I
compare each maxpath to all the other maxpaths in that cluster. Some clusters have over 1000 maxpaths so it slows down a lot. I tried to reduce the number of pairwise comparisons I do by returning the name of the cluster when it finds two maxpaths that fulfill the criteria. I've tried moving certain parts of the code into numpy arrays but it's still extremely slow.
My code looks like this. Does anyone have any ideas that could help?
import pandas as pd
import numpy as np
def find_cluster(cluster, maxpaths):
"""
returns clusters with disjoint maxpaths annotated
to different genomes or annotated on the same
genome min of 10kb apart
"""
for idx, item in enumerate(maxpaths):
unique = set(item.split(','))
for j in range(idx+1, len(maxpaths)):
compare = set(maxpaths[j].split(','))
chrom_string1 = subset_df[
subset_df['maxpath'] == item]['chrom_pos'].values[0]
chrom_string2 = subset_df[
subset_df['maxpath'] == maxpaths[j]]['chrom_pos'].values[0]
chrom1 = chrom_string1.split(':')[0]
chrom_end1 = int(chrom_string1.split('-')[1])
chrom2 = chrom_string2.split(':')[0]
chrom_end2 = int(chrom_string2.split('-')[1])
if len(unique.intersection(compare)) == 0:
if chrom1 != chrom2:
return(cluster)
elif chrom1 == chrom2:
if abs(chrom_end1 - chrom_end2) > 10000:
return(cluster)
file_number = input.df_subset.split('_')[-1].split('.')[0]
df = pd.read_csv(input.df_subset)
cluster_names = df['cluster_name'].unique()
disjoint_clusters = []
for i in cluster_names:
subset_df = df[df['cluster_name'] ==i]
maxpaths_array = subset_df['maxpath'].as_matrix()
cluster = find_cluster(i,maxpaths_array)
disjoint_clusters.append(cluster)
disjoint_maxpaths = pd.DataFrame({"clusters_with_disjoint_maxpaths":disjoint_clusters})
disjoint_maxpaths.to_csv(os.path.abspath('results/disjoint_maxpath_clusters_{}.csv'.format(file_number)),index=False)
After incorporating some of the suggested ideas this is what I came up with. I lost readability but what I needed was performance. Whole run took about 2 hours to finish on my main computer.
def find_cluster(cluster_name, dataframe):
"""
returns clusters with disjoint maxpaths annotated
to different genomes or annotated on the same
genome min of 10kb apart
"""
cluster = cluster_name
df = dataframe
maxpaths = [set(item) for item in [_.split(',') for _ in df['maxpath'].as_matrix()]]
chrom_string = df['chrom_pos'].as_matrix()
chrom = [_.split(':')[0] for _ in df['chrom_pos'].as_matrix()]
chrom_end = [int(_.split('-')[1]) for _ in df['chrom_pos'].as_matrix()]
for idx,item in enumerate(maxpaths):
for j in range(idx+1, len(maxpaths)):
if item.isdisjoint(maxpaths[j]):
if chrom[idx] != chrom[j]:
return(cluster)
elif chrom[idx] == chrom[j]:
if abs(chrom_end[idx] - chrom_end[j]) > 10000:
return(cluster)
def df_to_dict(dataframe):
"""
Create dict with cluster as key and
subset of dataframe at cluster as value
"""
df = dataframe
unique_clusters = df['cluster_name'].unique()
sub_dfs = []
for i in unique_clusters:
sub_dfs.append(df[df['cluster_name'] == i])
return(dict(zip(unique_clusters, sub_dfs)))
def get_clusters(dataframe):
"""
return disjoint clusters
"""
df = pd.read_csv(dataframe, index_col=False)
df_dict = df_to_dict(df)
disjoint = [find_cluster(k,v) for k,v in df_dict.items() if find_cluster(k,v)]
return(disjoint)
def test_new():
cluster = ["cluster_689"]
disjoint_cluster = []
for i in cluster:
found = find_cluster(i, subset_df)
disjoint_cluster.append(found)
return(disjoint_cluster)
def test_old():
cluster = ["cluster_689"]
disjoint_cluster = []
for i in cluster:
maxpaths_array = subset_df['maxpath'].as_matrix()
found = old_find_cluster(i,maxpaths_array)
disjoint_cluster.append(found)
return(disjoint_cluster)
new = %timeit for x in range(3): test_new()
old = %timeit for x in range(3): test_old()
yields
New find_cluster
247 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Old find_cluster
138 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The speed up when looking for disjoint max paths is immense. I wasn't able to calculate the time for the whole script because the last few large clusters never finished running after over 24 hours. This was after splitting up the data frame into 100 smaller dataframes as well. But i'm sure overall the script became faster outside of the find_clusters function. Thanks for all the help everyone.
There is some scope for improvement. Basically your code is pythonic, but if performance loss (over readability gain) is unacceptable, you should look at least somewhat deeper, into things such as repeated iterations that you do, and convenience functions which take iterables as inputs and again/repeatedly iterate over (possibly the same) data.
Avoid repeating iterations.
You are repeating iterations (such as with df['cluster_name'].unique() and again iterating all rows in for loop, then again subsetting array multiple times at df[df['cluster_name'] ==i] which is conditional execution, that gets repeated a lot of times for same values). Instead you can just iterate over the whole data once, keep a dict like mapping for cluster (eg. cluster name mapped to a list of sets of max_paths), check if the max_paths for current cluster name entry are already present in that dictionary's corresponding entry, and accumulate clusters passing your conditions. This also helps repeated creations of exactly same sets that's happening with your current code.
Another example of repeated heavy computation:
cluster_names = df['cluster_name'].unique()
disjoint_clusters = []
for i in cluster_names:
subset_df = df[df['cluster_name'] ==i]
You can at least keep (cache) and reuse subsets instead of re-computing.
Do not compute unnecessary expressions, and look for items breaking your conditions. Eg. len(unique.intersection(compare)) == 0 is slower than unique.isdisjoint(compare) because the latter one short-circuits and your data seem to have that case more frequently. Basically you don't need that intersection, you just need to check if they're disjoint.
Here, again the same sets are being constructed (which themselves involve iterations over input iterables) repeatedly, with the dict approach above, those repetitions are avoided.
There is one shortcoming with the suggested approach thought, it may consume more memory but I guess that shouldn't be a troublesome difference.
Also I assume you need all clusters passing your conditions, and not stop at just 2. In that case it can be made even faster.
I have a Pandas DataFrame with the following structure (about 100 million rows):
Date Value ID
'1/1/17' 500 1
'1/2/17' 550 1
'1/4/17' 600 2
If I do:
def get_coeff_var(group):
group['coeff_var'] = group['Value'].std()/group['Value'].mean()
return group
df = df.groupby(['ID']).apply(lambda x: get_coeff_var(x))
It completes extremely quickly.
But if I first set the index to the date and get the last month of data, then the same apply function takes an enormous (i.e. I can't even wait for it to complete) amount of time:
df = df.set_index('Date')
df = df.last('1M')
df = df.groupby(['ID']).apply(lambda x: get_coeff_var(x))
What's going on?
Almost always, mutating inside a groupby-apply is a bad idea - in general it takes a slow path, although I'm not sure what the exact issue is here.
In your case, the idiomatic, and much faster way to do this transformation is as follows, and should be fast regardless of your index.
gb = df.groupby('ID')['Value']
df['coeff_var'] = gb.transform('std') / gb.transform('mean')
I have a dataframe with 40 million records and I need to create 2 new columns (net_amt and share_amt) from existing amt and sharing_pct columns. I created two functions which calculate these amounts and then used apply function to populate them back to dataframe. As my dataframe is large it is taking more time to complete. Can we calculate both amounts at one shot or is there completely a better way of doing it
def fn_net(row):
if (row['sharing']== 1):
return row['amt'] * row['sharing_pct']
else:
return row['amt']
def fn_share(row):
if (row['sharing']== 1):
return (row['amt']) * (1- row['sharing_pct'])
else:
return 0
df_load['net_amt'] = df_load.apply (lambda row: fn_net (row),axis=1)
df_load['share_amt'] = df_load.apply (lambda row: fn_share (row),axis=1)
I think numpy where() will be the best choice here (after import numpy as np):
df['net_amount'] = np.where( df['sharing']==1, # test/condition
df['amt']*df['sharing_pct'], # value if True
df['amt'] ) # value if False
You can, of course, use this same method for 'share_amt' also. I don't think there is any faster way to do this, and I don't think you can do it in "one shot", depending on how you define it. Bottom line: doing it with np.where is way faster than applying a function.
More specifically, I tested on the sample dataset below (10,000 rows) and it's about 700x faster than the function/apply method in that case.
df=pd.DataFrame({ 'sharing':[0,1]*5000,
'sharing_pct':np.linspace(.01,1.,10000),
'amt':np.random.randn(10000) })
I have used rosetta.parallel.pandas_easy to parallelize apply after groupby, for example:
from rosetta.parallel.pandas_easy import groupby_to_series_to_frame
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)
However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for rosetta, as expected.
def tmpFunc(df):
df['c'] = df.a + df.b
return df
df.groupby(df.index).apply(tmpFunc)
groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)
This seems to work, although it really should be built in to pandas
import pandas as pd
from joblib import Parallel, delayed
import multiprocessing
def tmpFunc(df):
df['c'] = df.a + df.b
return df
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)
if __name__ == '__main__':
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
print 'parallel version: '
print applyParallel(df.groupby(df.index), tmpFunc)
print 'regular version: '
print df.groupby(df.index).apply(tmpFunc)
print 'ideal version (does not work): '
print df.groupby(df.index).applyParallel(tmpFunc)
Ivan's answer is great, but it looks like it can be slightly simplified, also removing the need to depend on joblib:
from multiprocessing import Pool, cpu_count
def applyParallel(dfGrouped, func):
with Pool(cpu_count()) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pandas.concat(ret_list)
By the way: this can not replace any groupby.apply(), but it will cover the typical cases: e.g. it should cover cases 2 and 3 in the documentation, while you should obtain the behaviour of case 1 by giving the argument axis=1 to the final pandas.concat() call.
EDIT: the docs changed; the old version can be found here, in any case I'm copypasting the three examples below.
case 1: group DataFrame apply aggregation function (f(chunk) -> Series) yield DataFrame, with group axis having group labels
case 2: group DataFrame apply transform function ((f(chunk) -> DataFrame with same indexes) yield DataFrame with resulting chunks glued together
case 3: group Series apply function with f(chunk) -> DataFrame yield DataFrame with result of chunks glued together
I have a hack I use for getting parallelization in Pandas. I break my dataframe into chunks, put each chunk into the element of a list, and then use ipython's parallel bits to do a parallel apply on the list of dataframes. Then I put the list back together using pandas concat function.
This is not generally applicable, however. It works for me because the function I want to apply to each chunk of the dataframe takes about a minute. And the pulling apart and putting together of my data does not take all that long. So this is clearly a kludge. With that said, here's an example. I'm using Ipython notebook so you'll see %%time magic in my code:
## make some example data
import pandas as pd
np.random.seed(1)
n=10000
df = pd.DataFrame({'mygroup' : np.random.randint(1000, size=n),
'data' : np.random.rand(n)})
grouped = df.groupby('mygroup')
For this example I'm going to make 'chunks' based on the above groupby, but this does not have to be how the data is chunked. Although it's a pretty common pattern.
dflist = []
for name, group in grouped:
dflist.append(group)
set up the parallel bits
from IPython.parallel import Client
rc = Client()
lview = rc.load_balanced_view()
lview.block = True
write a silly function to apply to our data
def myFunc(inDf):
inDf['newCol'] = inDf.data ** 10
return inDf
now let's run the code in serial then in parallel.
serial first:
%%time
serial_list = map(myFunc, dflist)
CPU times: user 14 s, sys: 19.9 ms, total: 14 s
Wall time: 14 s
now parallel
%%time
parallel_list = lview.map(myFunc, dflist)
CPU times: user 1.46 s, sys: 86.9 ms, total: 1.54 s
Wall time: 1.56 s
then it only takes a few ms to merge them back into one dataframe
%%time
combinedDf = pd.concat(parallel_list)
CPU times: user 296 ms, sys: 5.27 ms, total: 301 ms
Wall time: 300 ms
I'm running 6 IPython engines on my MacBook, but you can see it drops the execute time down to 2s from 14s.
For really long running stochastic simulations I can use AWS backend by firing up a cluster with StarCluster. Much of the time, however, I parallelize just across 8 CPUs on my MBP.
A short comment to accompany JD Long's answer. I've found that if the number of groups is very large (say hundreds of thousands), and your apply function is doing something fairly simple and quick, then breaking up your dataframe into chunks and assigning each chunk to a worker to carry out a groupby-apply (in serial) can be much faster than doing a parallel groupby-apply and having the workers read off a queue containing a multitude of groups. Example:
import pandas as pd
import numpy as np
import time
from concurrent.futures import ProcessPoolExecutor, as_completed
nrows = 15000
np.random.seed(1980)
df = pd.DataFrame({'a': np.random.permutation(np.arange(nrows))})
So our dataframe looks like:
a
0 3425
1 1016
2 8141
3 9263
4 8018
Note that column 'a' has many groups (think customer ids):
len(df.a.unique())
15000
A function to operate on our groups:
def f1(group):
time.sleep(0.0001)
return group
Start a pool:
ppe = ProcessPoolExecutor(12)
futures = []
results = []
Do a parallel groupby-apply:
%%time
for name, group in df.groupby('a'):
p = ppe.submit(f1, group)
futures.append(p)
for future in as_completed(futures):
r = future.result()
results.append(r)
df_output = pd.concat(results)
del ppe
CPU times: user 18.8 s, sys: 2.15 s, total: 21 s
Wall time: 17.9 s
Let's now add a column which partitions the df into many fewer groups:
df['b'] = np.random.randint(0, 12, nrows)
Now instead of 15000 groups there are only 12:
len(df.b.unique())
12
We'll partition our df and do a groupby-apply on each chunk.
ppe = ProcessPoolExecutor(12)
Wrapper fun:
def f2(df):
df.groupby('a').apply(f1)
return df
Send out each chunk to be operated on in serial:
%%time
for i in df.b.unique():
p = ppe.submit(f2, df[df.b==i])
futures.append(p)
for future in as_completed(futures):
r = future.result()
results.append(r)
df_output = pd.concat(results)
CPU times: user 11.4 s, sys: 176 ms, total: 11.5 s
Wall time: 12.4 s
Note that the amount of time spend per group has not changed. Rather what has changed is the length of the queue from which the workers read off of. I suspect that what is happening is that the workers cannot access the shared memory simultaneously, and are returning constantly to read off the queue, and are thus stepping on each others toes. With larger chunks to operate on, the workers return less frequently and so this problem is ameliorated and the overall execution is faster.
People are moving to use bodo for parallelism. It's the fastest engine available to parallelize python as it compiles your code with MPI. Its new compiler made it to be much faster than Dask, Ray, multiprocessing, pandarel, etc. Read bodo vs Dask in this blog post, and see what Travis has to say about bodo in his LinkedIn! He is the founder of Anaconda: Quote "bodo is the real deal"
https://bodo.ai/blog/performance-and-cost-of-bodo-vs-spark-dask-ray
https://www.linkedin.com/posts/teoliphant_performance-and-cost-evaluation-of-bodo-vs-activity-6873290539773632512-y5iZ/
As per how to use groupby with bodo, here I write a sample code:
#install bodo through your terminal
conda create -n Bodo python=3.9 -c conda-forge
conda activate Bodo
conda install bodo -c bodo.ai -c conda-forge
Here is a code sample for groupby:
import time
import pandas as pd
import bodo
#bodo.jit
def read_data():
""" a dataframe with 2 columns, headers: 'A', 'B'
or you can just create a data frame instead of reading it from flat file
"""
return pd.read_parquet("your_input_data.pq")
#bodo.jit
def data_groupby(input_df):
t_1 = time.time()
df2 = input_df.groupby("A", as_index=False).sum()
t_2 = time.time()
print("Compute time: {:.2f}".format(t_2-t_1))
return df2, t_2-t_1
if __name__ == "__main__":
df = read_data()
t0 = time.time()
output, compute_time = data_groupby(df)
t2 = time.time()
total_time = t2 - t0
if bodo.get_rank() == 0:
print("Compilation time: {:.2f}".format(total_time - compute_time))
print("Total time second call: {:.2f}".format(total_time))
and finally run it with mpiexec through your terminal. -n determines the number of cores (CPUs) you want to run it.
mpiexec -n 4 python filename.py
Personally I would recommend using dask, per this thread.
As #chrisb pointed out, multiprocessing with pandas in python might create unnecessary overhead. It might also not perform as well as multithreading or even as a single thread.
Dask is created specifically for multiproccessing.
EDIT: To achieve better calculation performance on pandas groupby, you can use numba to compile your code into C code at runtime and run at C speed. If the function you apply after groupby is pure numpy calculation, it will be super fast (much faster than this parallelization).
You can use either multiprocessing or joblib to achieve parallelization. However, if the number of groups is large and each group DataFrame is large, the running time can be worse as you need to transfer those groups into CPUs for many times. To reduce the overhead, we can first divide the data into large chunks, and then parallelize computation on these chunks.
For example, suppose you are processing the stock data, where you need to group the stocks by their code and then calculate some statistics. You can first group by the first character of the code (large chunks), then do the things within this dummy group:
import pandas as pd
from joblib import Parallel, delayed
def group_func(dummy_group):
# Do something to the group just like doing to the original dataframe.
# Example: calculate daily return.
res = []
for _, g in dummy_group.groupby('code'):
g['daily_return'] = g.close / g.close.shift(1)
res.append(g)
return pd.concat(res)
stock_data = stock_data.assign(dummy=stock_data['code'].str[0])
Parallel(n_jobs=-1)(delayed(group_func)(group) for _, group in stock_data.groupby('dummy'))
DISCLAIMER: I am the owner and primary contributor/maintainer of swifter
swifter is a python package that I created over 4 years ago as a package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. As of today, swifter has over 2k GitHub stars, 250k downloads/month, and 95% code coverage.
As of v1.3.2, swifter offers a simple interface to a performant parallelized groupby apply:
df.swifter.groupby(df.index).apply(tmpFunc)
I have also created performance benchmarks showcasing swifter's performance improvement, with a key visual replicated here:
Swifter Groupby Apply Performance Benchmark
You can easily install swifter (with groupby apply functionality) either via pip:
pip install swifter[groupby]>=1.3.2
or via conda:
conda install -c conda-forge swifter>=1.3.2 ray>=1.0.0
Please check out the README and documentation for further information