Say I have a program that contains a simulation-based function which takes some time to compute.
def foo_func(args):
# some calculations
return foo # df
res = {} # will be a dictionary of dfs
for i in range(n):
res[i] = foo_func(args)
Problem: Calculating foo using foo_func n times takes too long
Question: how do i implement multiprocessing/multithreading within the program, and store the results in res?
Note that:
foo_func takes in args
order does not matter within the res - the order in which the jobs finish does not matter, as long as all of the jobs are correctly stored in res
Related
I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.
I am trying to accomplish a task that involves doing it parallely using Multiprocessing pool in Python. Basically there are some static parameters for a function and a bunch of variable parameters for different hyperparamters. For eg.
def simulate(static1, static2, iter1, iter2):
#do some math in for loop
return output
Now the thing is nth component in iter2 comes only with nth component of iter1. Like say
iter1 = [1,2,3,4]
iter2 = [x,y,z,w]
So during iteration (1,x),(2,y) etc. should be there as the parameters and in the end I expect to get 4 different outputs. SO I am trying to implement
partial_function = partial(simulate, static1 = s1, static2 = s2)
output = pool.map(partial, (iter1, iter2))
I am stuck at how to use multiple iters given that python returns TypeError mentioning simulate() missing 1 positional argument. Any suggestions on that?
Context
I have a function that produces a large 2D numpy array (with fixed shape) as output. I am calling this function 1000 times using joblib (Parallel with a multiprocessing backend) on 8 CPUs. At the end of the job, I add up all the arrays element-wise (using np.sum) to produce a single 2D array that I am interested in. However, when I attempt this, I run out of RAM. I assume that this is because the 1000 arrays would need to be stored in RAM until they are summed at the end.
Question
Is there a way to get each worker to add up its arrays as it goes? For example, worker 1 would add array 2 to array 1, and then discard array 2 before computing array 3, and so on. This way, there would only be a maximum of 8 arrays (for 8 CPUs) stored in RAM at any point in time, and these could be summed up at the end to get the same answer.
The facts that you know your arguments in advance and the time for calculation not varying much with the actual argument(s) simplifies the task. It allows for assigning complete jobs for every worker process at start and just summing up the results at the end, just how you proposed.
In the code below every spawned process gets an "equal" (as much as possible) part of all arguments (its args_batch) and sums up the intermediate results from calling the target function in it's own result-array. These arrays get summed up finally by the parent process.
The "delayed" function here in the example is not the target function which calculates an array, but a processing function (worker) to which the target function (calc_array) gets passed as part of the job along with the batch of arguments.
import numpy as np
from itertools import repeat
from time import sleep
from joblib import Parallel, delayed
def calc_array(v):
"""Create an array with specified shape and
fill it up with value v, then kill some time.
Dummy target function.
"""
new_array = np.full(shape=SHAPE, fill_value=v)
# delay result:
cnt = 10_000_000
for _ in range(cnt):
cnt -= 1
return new_array
def worker(func, args_batch):
"""Call func with every packet of arguments received and update
result array on the run.
Worker function which runs the job in each spawned process.
"""
results = np.zeros(SHAPE)
for args_ in args_batch:
new_array = func(*args_)
np.sum([results, new_array], axis=0, out=results)
return results
def main(func, arguments, n_jobs, verbose):
with Parallel(n_jobs=n_jobs, verbose=verbose) as parallel:
# bundle up jobs:
funcs = repeat(func, n_jobs) # functools.partial seems not pickle-able
args_batches = np.array_split(arguments, n_jobs, axis=0)
jobs = zip(funcs, args_batches)
result = sum(parallel(delayed(worker)(*job) for job in jobs))
assert np.all(result == sum(range(CALLS_TOTAL)))
sleep(1) # just to keep stdout ordered
print(result)
if __name__ == '__main__':
SHAPE = (4, 4) # shape of array calculated by calc_array
N_JOBS = 8
CALLS_TOTAL = 100
VERBOSE = 10
ARGUMENTS = np.asarray([*zip(range(CALLS_TOTAL))])
# array([[0], [1], [2], ...]])
# zip to bundle arguments in a container so we have less code to
# adapt when feeding a function with multiple parameters
main(func=calc_array, arguments=ARGUMENTS, n_jobs=N_JOBS, verbose=VERBOSE)
I have a current long iteration process where I run a calculation and every x iteration I store the results to DB.
For example, iterate fun() function over range(20) and save every 5 results with save_results:
import time
def fun(x):
time.sleep(0.1*x)
return(0.1*x)
def save_results(result):
# originally stores the new data to DB
print result # print as example
result = []
for i in range(20):
result.append(fun(i))
if i%5==4:
save_results(result[-5:])
I want to parallelize the process with dask delayed and compute methods. But if I run it like in the following example, the store_results occurs before the compute:
import dask as da
result = []
for i in range(20):
result.append(da.delayed(fun)(i))
if i%5==4:
save_results(result[-5:])
result = da.compute(result)[0]
and therefore, instead of storing the results every 5 iterations, I store a list of delayed objects:
[Delayed('fun-202f7e28-e594-4926-a5cd-5931dbc99d6b'),
Delayed('fun-d2bf2bc9-a4f3-46d7-adb7-84114a68b482'),
Delayed('fun-c34f2c04-3e25-47fa-8165-1ee7c786aaf6'),
Delayed('fun-a4edd3fc-442d-4ec1-8a0e-320bd9315a61'),
Delayed('fun-c7b48e2c-cb66-472e-85c5-fe6c595fa1ec')]
How can I overcome the issue, and store every 5 new results to DB?
You should delay any function calls that operate on delayed objects
Sequential code
result = []
for i in range(20):
result.append(fun(i))
if i%5==4:
save_results(result[-5:])
Parallel code
def fun(x):
...
result = []
side_effects = []
for i in range(20):
result = dask.delayed(fun)(i)
results.append(result)
if i%5==4:
value = dask.delayed(save_results)(result[-5:])
side_effects.append(value)
dask.compute(results + side_effects)
I have a dataset df of trader transactions.
I have 2 levels of for loops as follows:
smartTrader =[]
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
# I have some more calculations here
for trader in range(len(df['TraderID'])):
# I have some calculations here, If trader is successful, I add his ID
# to the list as follows
smartTrader.append(df['TraderID'][trader])
# some more calculations here which are related to the first for loop.
I would like to parallelise the calculations for each asset in Assets, and I also want to parallelise the calculations for each trader for every asset. After ALL these calculations are done, I want to do additional analysis based on the list of smartTrader.
This is my first attempt at parallel processing, so please be patient with me, and I appreciate your help.
If you use pathos, which provides a fork of multiprocessing, you can easily nest parallel maps. pathos is built for easily testing combinations of nested parallel maps -- which are direct translations of nested for loops.
It provides a selection of maps that are blocking, non-blocking, iterative, asynchronous, serial, parallel, and distributed.
>>> from pathos.pools import ProcessPool, ThreadPool
>>> amap = ProcessPool().amap
>>> tmap = ThreadPool().map
>>> from math import sin, cos
>>> print amap(tmap, [sin,cos], [range(10),range(10)]).get()
[[0.0, 0.8414709848078965, 0.9092974268256817, 0.1411200080598672, -0.7568024953079282, -0.9589242746631385, -0.27941549819892586, 0.6569865987187891, 0.9893582466233818, 0.4121184852417566], [1.0, 0.5403023058681398, -0.4161468365471424, -0.9899924966004454, -0.6536436208636119, 0.2836621854632263, 0.9601702866503661, 0.7539022543433046, -0.14550003380861354, -0.9111302618846769]]
Here this example uses a processing pool and a thread pool, where the thread map call is blocking, while the processing map call is asynchronous (note the get at the end of the last line).
Get pathos here: https://github.com/uqfoundation
or with:
$ pip install git+https://github.com/uqfoundation/pathos.git#master
Nested parallelism can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
Assume you want to parallelize the following nested program
def inner_calculation(asset, trader):
return trader
def outer_calculation(asset):
return asset, [inner_calculation(asset, trader) for trader in range(5)]
inner_results = []
outer_results = []
for asset in range(10):
outer_result, inner_result = outer_calculation(asset)
outer_results.append(outer_result)
inner_results.append(inner_result)
# Then you can filter inner_results to get the final output.
Bellow is the Ray code parallelizing the above code:
Use the #ray.remote decorator for each function that we want to execute concurrently in its own process. A remote function returns a future (i.e., an identifier to the result) rather than the result itself.
When invoking a remote function f() the remote modifier, i.e., f.remote()
Use the ids_to_vals() helper function to convert a nested list of ids to values.
Note the program structure is identical. You only need to add remote and then convert the futures (ids) returned by the remote functions to values using the ids_to_vals() helper function.
import ray
ray.init()
# Define inner calculation as a remote function.
#ray.remote
def inner_calculation(asset, trader):
return trader
# Define outer calculation to be executed as a remote function.
#ray.remote(num_return_vals = 2)
def outer_calculation(asset):
return asset, [inner_calculation.remote(asset, trader) for trader in range(5)]
# Helper to convert a nested list of object ids to a nested list of corresponding objects.
def ids_to_vals(ids):
if isinstance(ids, ray.ObjectID):
ids = ray.get(ids)
if isinstance(ids, ray.ObjectID):
return ids_to_vals(ids)
if isinstance(ids, list):
results = []
for id in ids:
results.append(ids_to_vals(id))
return results
return ids
outer_result_ids = []
inner_result_ids = []
for asset in range(10):
outer_result_id, inner_result_id = outer_calculation.remote(asset)
outer_result_ids.append(outer_result_id)
inner_result_ids.append(inner_result_id)
outer_results = ids_to_vals(outer_result_ids)
inner_results = ids_to_vals(inner_result_ids)
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Probably threading, from standard python library, is most convenient approach:
import threading
def worker(id):
#Do you calculations here
return
threads = []
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
for trader in range(len(df['TraderID'])):
t = threading.Thread(target=worker, args=(trader,))
threads.append(t)
t.start()
#add semaphore here if you need synchronize results for all traders.
Instead of using for, use map:
import functools
smartTrader =[]
m=map( calculations_as_a_function,
[df[df['Assets'] == asset] \
for asset in range(len(Assets))])
functools.reduce(smartTradder.append, m)
From then on, you can try different parallel map implementations s.a. multiprocessing's, or stackless'