I have used rosetta.parallel.pandas_easy to parallelize apply after groupby, for example:
from rosetta.parallel.pandas_easy import groupby_to_series_to_frame
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)
However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for rosetta, as expected.
def tmpFunc(df):
df['c'] = df.a + df.b
return df
df.groupby(df.index).apply(tmpFunc)
groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)
This seems to work, although it really should be built in to pandas
import pandas as pd
from joblib import Parallel, delayed
import multiprocessing
def tmpFunc(df):
df['c'] = df.a + df.b
return df
def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)
if __name__ == '__main__':
df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])
print 'parallel version: '
print applyParallel(df.groupby(df.index), tmpFunc)
print 'regular version: '
print df.groupby(df.index).apply(tmpFunc)
print 'ideal version (does not work): '
print df.groupby(df.index).applyParallel(tmpFunc)
Ivan's answer is great, but it looks like it can be slightly simplified, also removing the need to depend on joblib:
from multiprocessing import Pool, cpu_count
def applyParallel(dfGrouped, func):
with Pool(cpu_count()) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pandas.concat(ret_list)
By the way: this can not replace any groupby.apply(), but it will cover the typical cases: e.g. it should cover cases 2 and 3 in the documentation, while you should obtain the behaviour of case 1 by giving the argument axis=1 to the final pandas.concat() call.
EDIT: the docs changed; the old version can be found here, in any case I'm copypasting the three examples below.
case 1: group DataFrame apply aggregation function (f(chunk) -> Series) yield DataFrame, with group axis having group labels
case 2: group DataFrame apply transform function ((f(chunk) -> DataFrame with same indexes) yield DataFrame with resulting chunks glued together
case 3: group Series apply function with f(chunk) -> DataFrame yield DataFrame with result of chunks glued together
I have a hack I use for getting parallelization in Pandas. I break my dataframe into chunks, put each chunk into the element of a list, and then use ipython's parallel bits to do a parallel apply on the list of dataframes. Then I put the list back together using pandas concat function.
This is not generally applicable, however. It works for me because the function I want to apply to each chunk of the dataframe takes about a minute. And the pulling apart and putting together of my data does not take all that long. So this is clearly a kludge. With that said, here's an example. I'm using Ipython notebook so you'll see %%time magic in my code:
## make some example data
import pandas as pd
np.random.seed(1)
n=10000
df = pd.DataFrame({'mygroup' : np.random.randint(1000, size=n),
'data' : np.random.rand(n)})
grouped = df.groupby('mygroup')
For this example I'm going to make 'chunks' based on the above groupby, but this does not have to be how the data is chunked. Although it's a pretty common pattern.
dflist = []
for name, group in grouped:
dflist.append(group)
set up the parallel bits
from IPython.parallel import Client
rc = Client()
lview = rc.load_balanced_view()
lview.block = True
write a silly function to apply to our data
def myFunc(inDf):
inDf['newCol'] = inDf.data ** 10
return inDf
now let's run the code in serial then in parallel.
serial first:
%%time
serial_list = map(myFunc, dflist)
CPU times: user 14 s, sys: 19.9 ms, total: 14 s
Wall time: 14 s
now parallel
%%time
parallel_list = lview.map(myFunc, dflist)
CPU times: user 1.46 s, sys: 86.9 ms, total: 1.54 s
Wall time: 1.56 s
then it only takes a few ms to merge them back into one dataframe
%%time
combinedDf = pd.concat(parallel_list)
CPU times: user 296 ms, sys: 5.27 ms, total: 301 ms
Wall time: 300 ms
I'm running 6 IPython engines on my MacBook, but you can see it drops the execute time down to 2s from 14s.
For really long running stochastic simulations I can use AWS backend by firing up a cluster with StarCluster. Much of the time, however, I parallelize just across 8 CPUs on my MBP.
A short comment to accompany JD Long's answer. I've found that if the number of groups is very large (say hundreds of thousands), and your apply function is doing something fairly simple and quick, then breaking up your dataframe into chunks and assigning each chunk to a worker to carry out a groupby-apply (in serial) can be much faster than doing a parallel groupby-apply and having the workers read off a queue containing a multitude of groups. Example:
import pandas as pd
import numpy as np
import time
from concurrent.futures import ProcessPoolExecutor, as_completed
nrows = 15000
np.random.seed(1980)
df = pd.DataFrame({'a': np.random.permutation(np.arange(nrows))})
So our dataframe looks like:
a
0 3425
1 1016
2 8141
3 9263
4 8018
Note that column 'a' has many groups (think customer ids):
len(df.a.unique())
15000
A function to operate on our groups:
def f1(group):
time.sleep(0.0001)
return group
Start a pool:
ppe = ProcessPoolExecutor(12)
futures = []
results = []
Do a parallel groupby-apply:
%%time
for name, group in df.groupby('a'):
p = ppe.submit(f1, group)
futures.append(p)
for future in as_completed(futures):
r = future.result()
results.append(r)
df_output = pd.concat(results)
del ppe
CPU times: user 18.8 s, sys: 2.15 s, total: 21 s
Wall time: 17.9 s
Let's now add a column which partitions the df into many fewer groups:
df['b'] = np.random.randint(0, 12, nrows)
Now instead of 15000 groups there are only 12:
len(df.b.unique())
12
We'll partition our df and do a groupby-apply on each chunk.
ppe = ProcessPoolExecutor(12)
Wrapper fun:
def f2(df):
df.groupby('a').apply(f1)
return df
Send out each chunk to be operated on in serial:
%%time
for i in df.b.unique():
p = ppe.submit(f2, df[df.b==i])
futures.append(p)
for future in as_completed(futures):
r = future.result()
results.append(r)
df_output = pd.concat(results)
CPU times: user 11.4 s, sys: 176 ms, total: 11.5 s
Wall time: 12.4 s
Note that the amount of time spend per group has not changed. Rather what has changed is the length of the queue from which the workers read off of. I suspect that what is happening is that the workers cannot access the shared memory simultaneously, and are returning constantly to read off the queue, and are thus stepping on each others toes. With larger chunks to operate on, the workers return less frequently and so this problem is ameliorated and the overall execution is faster.
People are moving to use bodo for parallelism. It's the fastest engine available to parallelize python as it compiles your code with MPI. Its new compiler made it to be much faster than Dask, Ray, multiprocessing, pandarel, etc. Read bodo vs Dask in this blog post, and see what Travis has to say about bodo in his LinkedIn! He is the founder of Anaconda: Quote "bodo is the real deal"
https://bodo.ai/blog/performance-and-cost-of-bodo-vs-spark-dask-ray
https://www.linkedin.com/posts/teoliphant_performance-and-cost-evaluation-of-bodo-vs-activity-6873290539773632512-y5iZ/
As per how to use groupby with bodo, here I write a sample code:
#install bodo through your terminal
conda create -n Bodo python=3.9 -c conda-forge
conda activate Bodo
conda install bodo -c bodo.ai -c conda-forge
Here is a code sample for groupby:
import time
import pandas as pd
import bodo
#bodo.jit
def read_data():
""" a dataframe with 2 columns, headers: 'A', 'B'
or you can just create a data frame instead of reading it from flat file
"""
return pd.read_parquet("your_input_data.pq")
#bodo.jit
def data_groupby(input_df):
t_1 = time.time()
df2 = input_df.groupby("A", as_index=False).sum()
t_2 = time.time()
print("Compute time: {:.2f}".format(t_2-t_1))
return df2, t_2-t_1
if __name__ == "__main__":
df = read_data()
t0 = time.time()
output, compute_time = data_groupby(df)
t2 = time.time()
total_time = t2 - t0
if bodo.get_rank() == 0:
print("Compilation time: {:.2f}".format(total_time - compute_time))
print("Total time second call: {:.2f}".format(total_time))
and finally run it with mpiexec through your terminal. -n determines the number of cores (CPUs) you want to run it.
mpiexec -n 4 python filename.py
Personally I would recommend using dask, per this thread.
As #chrisb pointed out, multiprocessing with pandas in python might create unnecessary overhead. It might also not perform as well as multithreading or even as a single thread.
Dask is created specifically for multiproccessing.
EDIT: To achieve better calculation performance on pandas groupby, you can use numba to compile your code into C code at runtime and run at C speed. If the function you apply after groupby is pure numpy calculation, it will be super fast (much faster than this parallelization).
You can use either multiprocessing or joblib to achieve parallelization. However, if the number of groups is large and each group DataFrame is large, the running time can be worse as you need to transfer those groups into CPUs for many times. To reduce the overhead, we can first divide the data into large chunks, and then parallelize computation on these chunks.
For example, suppose you are processing the stock data, where you need to group the stocks by their code and then calculate some statistics. You can first group by the first character of the code (large chunks), then do the things within this dummy group:
import pandas as pd
from joblib import Parallel, delayed
def group_func(dummy_group):
# Do something to the group just like doing to the original dataframe.
# Example: calculate daily return.
res = []
for _, g in dummy_group.groupby('code'):
g['daily_return'] = g.close / g.close.shift(1)
res.append(g)
return pd.concat(res)
stock_data = stock_data.assign(dummy=stock_data['code'].str[0])
Parallel(n_jobs=-1)(delayed(group_func)(group) for _, group in stock_data.groupby('dummy'))
DISCLAIMER: I am the owner and primary contributor/maintainer of swifter
swifter is a python package that I created over 4 years ago as a package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. As of today, swifter has over 2k GitHub stars, 250k downloads/month, and 95% code coverage.
As of v1.3.2, swifter offers a simple interface to a performant parallelized groupby apply:
df.swifter.groupby(df.index).apply(tmpFunc)
I have also created performance benchmarks showcasing swifter's performance improvement, with a key visual replicated here:
Swifter Groupby Apply Performance Benchmark
You can easily install swifter (with groupby apply functionality) either via pip:
pip install swifter[groupby]>=1.3.2
or via conda:
conda install -c conda-forge swifter>=1.3.2 ray>=1.0.0
Please check out the README and documentation for further information
Related
I have a very large dataframe that I am resampling a large number of times, so I'd like to use dask to speed up the process. However, I'm running into challenges with the groupby apply. An example data frame would be
import numpy as np
import pandas as pd
import random
test_df = pd.DataFrame({'sample_id':np.array(['a', 'b', 'c', 'd']).repeat(100),
'param1':random.sample(range(1, 1000), 400)})
test_df.set_index('sample_id', inplace=True)
which I can normally groupby and resample using
N = 5;i=1
test = test_df\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
Which I wrap into a method that iterates over an N gradient i times. The actual dataframe is very large with a number of columns, and before anyone suggests, this method is a little bit faster than an np.random.choice approach on the index-- it's all in the groupby. I've run the overall procedure through a multiprocessing method, but I wanted to see if I could get a bit more speed out of a dask version of the same. The problem is the documentation suggests that if you index and partition then you get complete groups per partition-- which is not proving true.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df, npartitions=8)
df1=df1.persist()
df1.divisions
creates
('a', 'b', 'c', 'd', 'd')
which unsurprisingly results in a failure
N = 5;i=1
test = df1\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
ValueError: Metadata inference failed in groupby.apply(sample).
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
ValueError("Cannot take a larger sample than population when 'replace=False'")
I have dug all around the documentation on keywords, dask dataframes & partitions, and groupby aggregations and simply am simply missing the solution if it's there in the documents. Any advice on how to create a smarter set of partitions and/or get the groupby with sample playing nice with dask would be deeply appreciated.
It's not quite clear to me what you are trying to achieve and why you need to add replace=False (which is default) but the following code work for me. I just need to add meta.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df.reset_index(), npartitions=8)
N = 5
i = 1
test = df1\
.groupby(['sample_id'])\
.apply(lambda x: x.sample(n=N),
meta={"sample_id": "object",
"param1": "f8"})\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
If you then want to drop sample_id you just need to add
df = df.drop("sample_id", axis=1)
Summarize the Problem
I am trying to optimize some code I have written. In its current form it works as intended, however because of the sheer number of loops required the script it takes a very long time to run.
I'm looking for a method of speeding up the below-described code.
Detail the problem
Within this data frame called master, there are 3,936,192 rows. The Position column represents a genomic window. Which is present in this data frame 76 times. Such that master[master['Position'] == 300] returns a dataframe of 76 rows, and similar for each unique appearance of Position. I do some operations on each of these subsets of the data frame.
The data can be found here
My current code takes the form:
import pandas as pd
master = pd.read_csv(data_location)
windows = sorted(set(master['Position']))
window_factor = []
# loop through all the windows, look at the cohort of samples, ignore anything not CNV == 2
# if that means ignore all, then drop the window entirely
# else record the 1/2 mean of that windows normalised coverage across all samples.
for window in windows:
current_window = master[master['Position'] == window]
t = current_window[current_window['CNV'] == 2]
if t.shape[0] == 0:
window_factor.append('drop')
else:
window_factor.append(
np.mean(current_window[current_window['CNV'] == 2]['Normalised_coverage'])/2)
However, this takes an exceptionally long time to run and I can't figure out a way to speed this up, though I know there must be one.
your df is not that big and in your code there are few problems:
If you use np.mean and one value is np.nan it returns np.nan
You can divide by 2 after calculate the mean.
It seems to me a perfect case for groupby
Return a string while other results are float you might consider to use
np.nan instead
import pandas as pd
df = pd.read_csv("master.csv")
def fun(x):
t = x[x["CNV"]==2]
return t["Normalised_coverage"].mean()/2
# returns np.nan when len(t)==0
out = df.groupby('Position').apply(fun)
CPU times: user 34.7 s, sys: 72.5 ms, total: 34.8 s
Wall time: 34.7 s
Or even faster filtering before the groupby as
%%time
out = df[df["CNV"]==2].groupby("Position")["Normalised_coverage"].mean()/2
CPU times: user 82.5 ms, sys: 8.03 ms, total: 90.5 ms
Wall time: 87.8 ms
UPDATE: In the last case if you really need to keep track of groups where df["CNV"]!=2 you can use this code:
import numpy as np
bad = df[df["CNV"]!=2]["Position"].unique()
bad = list(set(bad)-set(out.index))
out = out.reset_index(name="value")
out1 = pd.DataFrame({"Position":bad,
"value":[np.nan]*len(bad)})
out = pd.concat([out,out1],
ignore_index=True)\
.sort_values("Position")\
.reset_index(drop=True)
Which is going to add 160ms to your computation.
I think .groupby() function is what you need here:
fac = []
for name,group in master.groupby('Position'):
if all(group['CNV'] != 2):
fac.append('drop')
else:
fac.append(np.mean(group[group['CNV'] == 2]['Normalised_coverage'])/2)
I downloaded your data master.csv, data generated is exactly the same, running time decreased from 6 min to 30 sec on my laptop.
Hope it helps.
You can do several things:
instead of using a python list for window_factor consider using a np.array since
you know the length of the array.
t is already current_window[current_window['CNV'] == 2] use t when calculate np.mean.
You can also use a profiler to see if there are operations that are expensive, or just consider using C++ and reimplement the code(it's very simple).
Using groupby and query was the solution I went with.
import pandas as pd
import numpy as np
master = pd.read_csv("/home/sean/Desktop/master.csv", index_col=0)
windows = sorted(set(master['Position']))
g = master.groupby("Position")
master.query("Position == 24386700").shape
g = master.query("CNV == 2").groupby("Position")
p = g.Normalised_coverage.mean() / 2
I splited a pandas.DataFrame into 6 parts with a similar shape's rows: 118317,118315,... in order to balance the work and to respect the integrity of data to use groupby on a field.
These 6 parts are pandas.DataFrame stored in a list.
The function to apply, in parallel for each one, is the following.
def compute_recency (df):
recency = df.groupby('name').apply(lambda x: x['date'] - x['date'].shift()).fillna(0).reset_index()
df = df.join(recency.set_index('level_1'), rsuffix= '_f')
return df
Then I parallelized the processes as :
import multiprocessing as mp
cores=mp.cpu_count()
pool = mp.Pool(cores)
df_out = pool.map(compute_recency, list_of_6_dataframes)
pool.close()
pool.join()
The issue is it keeps calculating in the jupyter lab's notebook => [*], while I can see in my resources monitor that the CPUs are now "free", I mean they are not at 100% as at first.
Note if I use the following function:
def func(df):
return df.shape
It works well and quickly, no [*] for the eternity.
So I guess the issue is from the function compute_recency but I don't see why.
Can you help me ?
Pandas version: 0.23.4 Python version: 3.7.4
It's a little difficult to see what might be causing an issue here. Perhaps since you are using multiprocessing maybe break your data up into the groups created by groupby? and then process each group using multiprocessing?
from multiprocessing import Pool
groups = [x for _, x in df.groupby("name")]
def add_new_col(x):
x['new'] = x['date'] - x['date'].shift().fillna(0)
return x
p = Pool()
groups = p.map(add_new_col, groups)
df = pd.concat(groups, ignore_index=True)
p.close()
p.join()
by the way, in regards to your original code. p.map will return a list of dataframes not a dataframe. Which is why I've used pd.concat to combine the results at the end.
I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.
def apply_all_regexes(data, regexes):
# find all regex matches is applied to the pandas' dataframe
new_df = data.applymap(
partial(apply_re_to_cell, regexes))
return regex_applied
def apply_re_to_cell(regexes, cell):
cell = str(cell)
regex_matches = []
for regex in regexes:
regex_matches.extend(re.findall(regex, cell))
return regex_matches
Due to the serial execution of applymap, the time taken to process is ~ elements * (serial execution of the regexes for 1 element). Is there anyway to invoke parallelism? I tried ProcessPoolExecutor, but that appeared to take longer time than executing serially.
Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?
I was able to do something similar with a dataframe about gene expression.
I would run it small scale and control if you get the expected output.
Unfortunately I don't have enough reputation to comment
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
for x in df_split:
print(x.shape)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
This is the general function I used
I have to perform lots of operations on a dataframe and it takes a long time using a single core. I am trying to implement multiprocessing.
Right now while I am trying to figure out how it works so i am using a simpler version where i just want to add values from data
import multiprocessing
import pandas as pd
def add_values(a):
df = pd.DataFrame([{'n':a}])
return df
df = pd.DataFrame([{'n':0}])
data = [9, 4, 5]
with multiprocessing.Pool(processes=4) as pool:
df = df.add(pool.map(add_values, data))
df
I would like df to return a dataframe with n=18 but i get this error message ValueError: Unable to coerce to Series, length must be 1: given 3
The issue here is how you treat the return value from your multiprocessing calls. pool.map() returns a list. In this particular case, it will be a list of dataframes, i.e, what your call expands to is equivalent to df = df.add([dfn9, dfn4, dfn5]), where the dfnXs are different dataframes.
This input is neither expected nor handled by df.add(), which expects something that can be turned into a pd.Series object and added to the original frame. Instead, you need to take this list and "manually" reduce it, e.g. as:
import multiprocessing
import pandas as pd
def add_values(a):
df = pd.DataFrame([{'n':a}])
return df
df = pd.DataFrame([{'n':0}])
data = [9, 4, 5]
with multiprocessing.Pool(processes=4) as pool:
#df = df.add(pool.map(add_values, data)) does not work
dfs = pool.map(add_values, data)
print(type(dfs))
# Reducing return values
for d in dfs:
df = df.add(d)
print(df)
The reduction must happen in a single process, as the different processes do not share the same df (instead they all have identical copies).
As a side note, I think you should also consider using multithreading rahter than multiprocessing. It may be simpler as threads can share the same memory and reduce the need for copying memory. Also, since pandas reduces the GIL, there is not the problem of only one thread being able to execute at a time.