Python multiprocessing: reduce during map? - python

Is there a way to reduce memory consumption when working with Python's pool.map?
To give a short example: worker() does some heavy lifting and returns a larger array...
def worker():
# cpu time intensive tasks
return large_array
...and a Pool maps over some large sequence:
with mp.Pool(mp.cpu_count()) as p:
result = p.map(worker, large_sequence)
Considering this setup, obviously, result will allocate a large portion of the system's memory. However, the final operation on the result is:
final_result = np.sum(result, axis=0)
Thus, NumPy effectively does nothing else than reducing with a sum operation on the iterable:
final_result = reduce(lambda x, y: x + y, result)
This, of course, would make it possible to consume results of pool.map as they come in and garbage-collecting them after reducing to eliminate the need of storing all the values first.
I could write some mp.queue now where results go into and then write some queue-consuming worker that sums up the results but this would (1) require significantly more lines of code and (2) feel like a (potentially slower) hack-around to me rather than clean code.
Is there a way to reduce results returned by a mp.Pool operation directly as they come in?

The iterator mappers imap and imap_unordered seem to do the trick:
#!/usr/bin/env python3
import multiprocessing
import numpy as np
def worker( a ):
# cpu time intensive tasks
large_array = np.ones((20,30))+a
return large_array
if __name__ == '__main__':
arraysum = np.zeros((20,30))
large_sequence = range(20)
num_cpus = multiprocessing.cpu_count()
with multiprocessing.Pool( processes=num_cpus ) as p:
for large_array in p.imap_unordered( worker, large_sequence ):
arraysum += large_array

Related

Programmatically setting number of processes with ray

I want to use Ray to parallelize some computations in python. As part of this, I want a method which takes the desired number of worker processes as an argument.
The introductory articles on Ray that I can find say to specify the number of processes at the top level, which is different from what I want. Is it possible to specify similarly to how one would do when instantiating e.g. a multiprocessing Pool object, as illustrated below?
Example using multiprocessing:
import multiprocessing as mp
def f(x):
return 2*x
def compute_results(x, n_jobs=4):
with mp.Pool(n_jobs) as pool:
res = pool.map(f, x)
return res
data = [1,2,3]
results = compute_results(data, n_jobs=4)
Example using ray
import ray
# Tutorials say to designate the number of cores already here
ray.remote(4)
def f(x):
return 2*x
def compute_results(x):
result_ids = [f.remote(val) for val in x]
res = ray.get(result_ids)
return res
If you run f.remote() four times then Ray will create four worker processes to run it.
Btw, you can use multiprocessing.Pool with Ray: https://docs.ray.io/en/latest/ray-more-libs/multiprocessing.html

Large memory consumption by iPython Parallel module

I am using the ipyparallel module to speed up an all by all list comparison but I am having issues with huge memory consumption.
Here is a simplified version of the script that I am running:
From a SLURM script start the cluster and run the python script
ipcluster start -n 20 --cluster-id="cluster-id-dummy" &
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy"
In python, make two list of lists for the simplified example
import ipyparallel as ipp
from itertools import compress
list1 = [ [i, i, i] for i in range(4000000)]
list2 = [ [i, i, i] for i in range(2000000, 6000000)]
Then define my list comparison function:
def loop(item):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
Then connect to my ipython engines, push list2 to each of them and map my function:
rc = ipp.Client(profile='default', cluster_id = "cluster-id-dummy")
dview = rc[:]
dview.block = True
lview = rc.load_balanced_view()
lview.block = True
mydict = dict(list2 = list2)
dview.push(mydict)
trueorfalse = list(lview.map(loop, list1))
As mentioned, I am running this on a cluster using SLURM and getting the memory usage from the sacct command. Here is the memory usage that I am getting for each of the steps:
Just creating the two lists: 1.4 Gb
Creating two lists and pushing them to 20 engines: 22.5 Gb
Everything: 62.5 Gb++ (this is where I get an OUT_OF_MEMORY failure)
From running htop on the node while running the job, it seems that the memory usage is going up slowly over time until it reaches the maximum memory and fails.
I combed through this previous thread and implemented a few of the suggested solutions without success
Memory leak in IPython.parallel module?
I tried clearing the view with each loop:
def loop(item):
lview.results.clear()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
I tried purging the client with each loop:
def loop(item):
rc.purge_everything()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
And I tried using the --nodb and --sqlitedb flags with ipcontroller and started my cluster like this:
ipcontroller --profile=pierrj --nodb --cluster-id='cluster-id-dummy' &
sleep 60
for (( i = 0 ; i < 20; i++)); do ipengine --profile=pierrj --cluster-id='cluster-id-dummy' & done
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy" --profile=pierrj
Unfortunately none of this has helped and has resulted in the exact same out of memory error.
Any advice or help would be greatly appreciated!
Looking around, there seems to be lots of people complaining about LoadBalancedViews being very memory inefficient, and I have not been able to find any useful suggestions on how to fix this, for example.
However, I suspect given your example that's not the place to start. I assume that your example is a reasonable approximation of your code. If your code is doing list comparisons with several million data points, I would advise you to use something like numpy to perform the calculations rather than iterating in python.
If you restructure your algorithm to use numpy vector operations it will be much, much faster than indexing into a list and performing the calculation in python. numpy is a C library and calculation done within the library will benefit from compile time optimisations. Furthermore, performing operations on arrays also benefits from processor predictive caching (your CPU expects you to use adjacent memory looking forward and preloads it; you potentially lose this benefit if you access the data piecemeal).
I have done a very quick hack of your example to demonstrate this. This example compares your loop calculation with a very naïve numpy implementation of the same question. The python loop method is competitive with small numbers of entries, but it quickly heads towards x100 faster with the number of entries you are dealing with. I suspect looking at the way you structure data will outweigh the performance gain you are getting through parallelisation.
Note that I have chosen a matching value in the middle of the distribution; performance differences will obviously depend on the distribution.
import numpy as np
import time
def loop(item, list2):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
def run_comparison(scale):
list2 = [ [i, i, i] for i in range(4 * scale)]
arr2 = np.array([i for i in range(4 * scale)])
test_value = (2 * scale)
np_start = time.perf_counter()
res1 = test_value in arr2
np_end = time.perf_counter()
np_time = np_end - np_start
loop_start = time.perf_counter()
res2 = loop((test_value, 0, 0), list2)
loop_end = time.perf_counter()
loop_time = loop_end - loop_start
assert res1 == res2
return (scale, loop_time / np_time)
print([run_comparison(v) for v in [100, 1000, 10000, 100000, 1000000, 10000000]])
returns:
[
(100, 1.0315526939407524),
(1000, 19.066806587378263),
(10000, 91.16463510672537),
(100000, 83.63064249916434),
(1000000, 114.37531283123414),
(10000000, 121.09979997458508)
]
Assuming that a single task on the two lists is being divided up between the worker threads you will want to ensure that the individual workers are using the same copy of the lists. In most cases is looks like ipython parallel will pickle objects sent to workers (relevant doc). If you are able to use one of the types that are not copied (as stated in doc)
buffers/memoryviews, bytes objects, and numpy arrays.
the memory issue might be resolved since a reference is distributed. This answer also assumes that the individual tasks do not need to operate on the lists while working (thread-safe).
TL;DR It looks like moving the objects passed to the parallel workers into a numpy array may resolve the explosion in memory.

How to correctly implement apply_async for data processing?

I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!
If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.

Create a set with multiprocessing

I have a big list of items, and some auxiliary data. For each item in the list and element in data, I compute some thing, and add all the things into an output set (there may be many duplicates). In code:
def process_list(myList, data):
ret = set()
for item in myList:
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
what_I_Want = process_list(myList, data)
Because myList is big and compute(item, foo) is costly, I need to use multiprocessing. For now this is what I have:
from multiprocessing import Pool
initialize_worker(bar):
global data
data = bar
def process_item(item):
ret = set()
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
p = Pool(nb_proc, initializer = initialize_worker, initiargs = (data))
ret = p.map(process_item, myList)
what_I_Want = set().union(*ret)
What I do not like about that is that ret can be big. I am thinking about 3 options:
1) Chop myList into chunks a pass them to the workers, who will use process_list on each chunk (hence some duplicates will be removed at that step), and then union all the sets obtained to remove the last duplicates.
question: Is there an elegant way of doing that? Can we specify to Pool.map that it should pass the chunks to the workers instead of each item in the chunks? I know I could chop the list by myself, but this is damn ugly.
2) Have a shared set between all processes.
question: Why multiprocessing.manager does not feature set()? (I know it has dict(), but still..) If I use a manager.dict(), won't the communications between the processes and the manager slow down considerably the thing?
3) Have a shared multiprocessing.Queue(). Each worker puts the things it computes into the queue. Another worker does the unioning until some stopItem is found (which we put in the queue after the p.map)
question: Is this a stupid idea? Are communications between processes and a multiprocessing.Queue faster than those with a, say, manager.dict()? Also, how could I get back the set computed by the worker doing the unioning?
A minor thing: initiargs takes a tuple.
If you want to avoid creating all the results before reducing them into a set, you can use Pool.imap_unordered() with some chunk size. That will produce chunk size results from each workers as they become available.
If you want to change process_item to accept chunks directly, you have to do it manually. toolz.partition_all could be used to partition the initial dataset.
Finally, the managed data structures are bound to have much higher synchronization overhead. I'd avoid them as much as possible.
Go with imap_unordered and see if that's good enough; if not, then partition; if you cannot help having more than a couple duplicates total, use a managed dict.

Multiprocessing on pd.DataFrame did not speed up?

I am trying to apply function on a large pd.dataframe on pyspark. My code was post below which uses multiprocessing.Pool but is not as fast as expected. It cost the same time as df.apply(f,axis=1).
There shall be some mistakes I didn't notice. I spend my day but find nothing out. That's why I finally come here for help.
def f(series):
# do something
return series
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
pool = multiprocessing.Pool(8)
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
output[name]=res.get()
pool.close()
pool.join()
After #Andriy Ivaneyko answers, I also tried this:
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
res={}
pool = multiprocessing.Pool(8)
for name in df.colunms:
res[name]=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
for name,val in res.items():
output[name]=val.get()
pool.close()
pool.join()
I change the number of cores from 4 to 8 to 16, however the function consumes almost the same time.
The get() method blocks until the function is completed, that's the reason for not getting performance benefit. Create list of the ApplyResult objects ( returned by apply_async) and perform get when you finish iteration over df.colunms
# ... Code before
apply_results = []
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
apply_results[name] = res
for name, res in apply_results.items():
output[name]=res.get()
# ... Code after

Categories