I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)
After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)
I have a function that has 4 nestled for loops in it. The function takes in a dataframe and returns a new dataframe. Currently the function takes about 2 hours to run, I need it to run in around 30 mins...
I've tried multiprocessing using 4 cores but I cant seem to get it to work. I start by creating a list of my input dataframe split into smaller chunks (list_of_df)
all_trips = uncov_df.TRIP_NO.unique()
list_of_df = []
for trip in all_trips:
I then tried mapping this list of chunks into my function (transform_df) using 4 pools.
from multiprocessing import Pool
if __name__ == "__main__":
with Pool(4) as p:
df_uncov = list(p.map(transform_df, list_of_df))
df = pd.concat(df_uncov)
When I run the above my code cell freezes and nothing happens. Does anyone know what's going on?
This is how I set mine up using starmap. This returns a list of dfs to be concatenated later.
#put this above if __name__ == "__main__":
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as p:
df_list = p.starmap(transform_df, list_of_df)
return df_list
#then below if __name__ == "__main__":
df_list = get_dflist_multiprocess(list_of_df, num_proc=4) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:
import pandas as pd
import numpy as np
import time
import multiprocessing
from multiprocessing import Manager
def generate_df(size):
df = pd.DataFrame()
for x in list('abcdefghi'):
df[x] = np.random.normal(size=size)
return df
def do_something(df_list,size,k):
df = generate_df(size)
df_list[k] = df
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with Manager() as manager:
df_list = manager.list(range(num_df))
processes = []
for k in range(num_df):
p = multiprocessing.Process(target=do_something, args=(df_list,size,k,))
for process in processes:
final_df = pd.concat(df_list)
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
The elapsed time is 7 secs.
I try to append list without Multiprocessing.
df_list = []
for _ in range(num_df):
final_df = pd.concat(df_list)
But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?
When you use manager.list, you're not using a normal Python list. You're using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It's slow because it's a non-trivial problem to keep everything consistent in that way.
You probably don't need all of that synchronization, it's just slowing you down. A much more natural way to do what you're attempting is to use a process pool and it's map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.
Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with multiprocessing.pool() as pool:
df_list = pool.map(generate_df, [size]*num_df)
final_df = pd.concat(df_list)
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.
Two points:
Start and retrieve data from subprocess costs data must be transported between processes. This means that if transportation time is more than the time it takes to compute data you don't find benefits. This article can explain better the question.
In your implementation the bottleneck is in the df_list use. The Manager uses lock, this means that the processes are not free to write results into the list df_list
I'm using the multiprocessing python library to run in parallel feature selection for a machine learning problem. This function accepts as input a pandas dataframe and returns some figures.
When I execute this function using mp.pool.map() everything runs smoothly. However, if I substitute it with mp.pool.ThreadPool.map() it fails with this error:
AssertionError: Number of manager items must equal union of block items
# manager items: 15, # tot_items: 20.
Strangely, I was running the ThreadPool code normally till yesterday. Then, I tried to re-run it and started getting these errors. I need ThreadPool since this is an IO bound job and it was running much faster compared to pool.
The code goes like that (python 2.7):
import multiprocessing as mp
import pandas as pd (version 0.22.0)
def main_functionality(df, params):
df = df[params['feature']]
#Run 5-fold cross-validation
data_df = pd.DataFrame(....)
pred_df = pred_df.append(data_df)
return statistics from pred_df
def a_function(df_init, feature, params_init):
params = dict(params_init)
df = df_init.copy()
params['feature'] = feature
results = main_functionality(df, params)
results = (0,0,0)
return results
def b_function(df, features):
pool = mp.pool.ThreadPool(4)
params = {...}
results = pool.map(a_function,(df, feature, params) for f in features))
results_df = pd.DataFrame(results)
if __name__ == '__main__':
df = read.csv(...) # A big CSV file (i.e. few GBs)
features = [i for i in df.columns if i ....]
b_function(df, features)
I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).
Here's the solution I finally found:
import multiprocessing as mp
import pandas.util.testing as pdt
def process_apply(x):
# do some stuff to data here
def process(df):
res = df.apply(process_apply, axis=1)
return res
if __name__ == '__main__':
p = mp.Pool(processes=8)
split_dfs = np.array_split(big_df,8)
pool_results = p.map(aoi_proc, split_dfs)
# merging parts processed by different processes
parts = pd.concat(pool_results, axis=0)
# merging newly calculated parts to big_df
big_df = pd.concat([big_df, parts], axis=1)
# checking if the dfs were merged correctly
pdt.assert_series_equal(parts['id'], big_df['id'])
You can use https://github.com/nalepae/pandarallel, as in the following example:
from pandarallel import pandarallel
from math import sin
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
A more generic version based on the author solution, that allows to run it on every function and dataframe:
from multiprocessing import Pool
from functools import partial
import numpy as np
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)
So the following line:
df.apply(some_func, axis=1)
Will become:
parallelize_on_rows(df, some_func)
This is some code that I found useful. Automatically splits the dataframe into however many cpu cores you have.
import pandas as pd
import numpy as np
import multiprocessing as mp
def parallelize_dataframe(df, func):
num_processes = mp.cpu_count()
df_split = np.array_split(df, num_processes)
with mp.Pool(num_processes) as p:
df = pd.concat(p.map(func, df_split))
return df
def parallelize_function(df):
df[column_output] = df[column_input].apply(example_function)
return df
def example_function(x):
x = x*2
return x
To run:
df_output = parallelize_dataframe(df, parallelize_function)
This worked well for me:
rows_iter = (row for _, row in df.iterrows())
with multiprocessing.Pool() as pool:
df['new_column'] = pool.map(process_apply, rows_iter)
Since I don't have much of your data script, this is a guess, but I'd suggest using p.map instead of apply_async with the callback.
p = mp.Pool(8)
pool_results = p.map(process, np.array_split(big_df,8))
results = []
for result in pool_results:
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
def process_apply(x):
# do some stuff to data here
def process(df):
# spawns a pathos.multiprocessing.ProcessPool if sensible
res = df.mapply(process_apply, axis=1)
return res
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
You could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
I also run into the same problem when I use multiprocessing.map() to apply function to different chunk of a large dataframe.
I just want to add several points just in case other people run into the same problem as I do.
remember to add if __name__ == '__main__':
execute the file in a .py file, if you use ipython/jupyter notebook, then you can not run multiprocessing (this is true for my case, though I have no clue)
Install Pyxtension that simplifies using parallel map and use like this:
from pyxtension.streams import stream
big_df = pd.concat(stream(np.array_split(df, multiprocessing.cpu_count())).mpmap(process))
I ended up using concurrent.futures.ProcessPoolExecutor.map in place of multiprocessing.Pool.map which took 316 microseconds for some code that took 12 seconds in serial.
Python's pool.starmap() method can be used to succinctly introduce parallelism also to apply use cases where column values are passed as arguments, i.e. to cases like:
df.apply(lambda row: my_func(row["col_1"], row["col_2"], ...), axis=1)
Full example and benchmarking:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
def mul(a, b, c):
# For illustration, could obviously be vectorized
return a * b * c
df = pd.DataFrame(np.random.randint(0, 100, size=(10_000_000, 3)), columns=list('ABC'))
# Standard apply
start = time.time()
df["mul"] = df.apply(lambda row: mul(row["A"], row["B"], row["C"]), axis=1)
print(f"Standard apply took {time.time() - start:.0f} seconds.")
# Starmap apply
start = time.time()
with Pool(10) as pool:
df["mul_pool"] = pool.starmap(mul, zip(df["A"], df["B"], df["C"]))
print(f"Starmap apply took {time.time() - start:.0f} seconds.")
pd.testing.assert_series_equal(df["mul"], df["mul_pool"], check_names=False)
>>> Standard apply took 72 seconds.
>>> Starmap apply took 5 seconds.
This has the benefit of not relying on external libraries, plus being very readable.
Tom Raz's answer https://stackoverflow.com/a/53135031/11847090 misses an edge case where there are fewer rows in the dataframe than processes
use this parallelize method instead
def parallelize(data, func, num_of_processes=8):
# check if the number of rows is less than the number of processes
# to avoid the following error
# ValueError: Expected a 1D array, got an array with shape
num_rows = len(data)
if num_rows == 0:
return None
elif num_rows < num_of_processes:
num_of_processes = num_rows
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
return data
and also I used dask bag to multithread this instead of this custom code