Using python multiprocessing.Pool without returing result object

Using python multiprocessing.Pool without returing result object - python

I have a large number of CPU-bounded tasks that I want to run in parallel. Most of those tasks will return similar results and I only need to store unique results and count non-unique ones.
Here's how it is currently designed: I use two managed dictionaries - one for results and another one for result counters. My tasks are checking those dictionaries using unique result keys for the results they found and either write into both dictionaries or only increase the counters for non-unique results (if I have to write I acquire the lock and check again to avoid inconsistency).
What I am concerned about: since Pool.map should actually return result object, even though I do not save a reference to it, results will still pile up in memory until they are garbage collected. Even though I will have millions of just None's there (since I am processing results in a different manner and all my tasks just return None) I can not rely on specific garbage collector behavior so the program might eventually run out of memory. I still want to keep nice features of the pool but leave out this built-in result handling. Is my understanding correct and is my concern valid? If so, are there any alternatives?
Also, now that I laid it out on paper it looks really clumsy :) Do you see a better way to design such thing?
Thanks!

Question: I still want to keep nice features of the pool
Remove return result from multiprocessing.Pool.
Copy class MapResult and inherit from mp.pool.ApplyResult.
Add, replace ,comment the following:
import multiprocessing as mp
from multiprocessing.pool import Pool
class MapResult(mp.pool.ApplyResult):
def __init__(self, cache, chunksize, length, callback, error_callback):
super().__init__(cache, callback, error_callback=error_callback)
...
#self._value = [None] * length
self._value = None
...
def _set(self, i, success_result):
...
if success:
#self._value[i*self._chunksize:(i+1)*self._chunksize] = result
Create your own class myPool(Pool) inherit from multiprocessing.Pool.
Copy def _map_async(... from multiprocessing.Pool.
Add, replace, comment the following:
class myPool(Pool):
def __init__(self, processes=1):
super().__init__(processes=processes)
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
...
#if self._state != RUN:
if self._state != mp.pool.RUN:
...
#return result
Tested with Python: 3.4.2

Related

Share Python dict across many processes

I am developing an heuristic algorithm to find "good" solutions for a NP (hence CPU intensive) problem.
I am implementing my solution using Python (I agree it is not the best choice when speed is a concern, but so it is) and I am splitting the workload across many subprocesses, each one in charge to explore a branch of the space of possible solutions.
To improve performances I would like to share some information gathered during the execution of each subprocess among all subprocesses.
The "obvious" way to gather such information is gathering them inside a dictionary whose keys are (frozen)sets of integers and values are lists (or sets) of integers.
Hence the shared dictionary must both be readable and writable from each subprocess, but I can safely expect that reads will be far more frequent than writes because a subprocess will write to the shared dict only when it finds something "interesting" and will read the dict far more frequently to know if a certain solution has already been evaluated by other processes (to avoid exploring the same branch twice or more).
I do not expect the dimension of such dictionary to exceed 10 MB.
At the moment I implemented the shared dict using an instance of multiprocessing.Manager() that takes care of handling concurrent accesses to the shared dictionary out of the box.
However (according to what I have found) this way of sharing data is implemented using pipes between processes which are a lot slower than plain and simple shared memory (moreover the dictionary must be pickled before being sent through the pipe and unpickled when it is received).
So far my code looks like this:
# main.py
import multiprocessing as mp
import os
def worker(a, b, c, shared_dict):
while condition:
# do things
# sometimes reads from shared_dict to check if a candidate solution has already been evaluated by other process
# if not, evaluate it and store it inside the shared_dict together with some related info
return worker_result
def main():
with mp.Manager() as manager:
# setup params a, b, c, ...
# ...
shared_dict = manager.dict()
n_processes = os.cpu_count()
with mp.Pool(processes=n_processes) as pool:
async_results = [pool.apply_async(worker, (a, b, c, shared_dict)) for _ in range(n_processes)]
results = [res.get() for res in async_results]
# gather the overall result from 'results' list
if __name__ == '__main__':
main()
To avoid the overhead due to pipes I would like to use shared memory, but it doesn't seem that the Python standard library offers a straightforward way to handle a dictionary in shared memory.
As far as I know the Python standard library offers helpers to store data in shared memory only for standard ctypes (with multiprocessing.Value and multiprocessing.Array) or gives you access to raw areas of shared memory.
I do not want to implement my own hash table in a raw area of shared memory since I am not an expert neither of hash tables nor of concurrent programming, instead I am wondering if there are other faster solutions to my needs that doesn't require to write everything from zero.
For example, I have seen that the ray library allows to read data written in shared memory way faster than using pipes, however it seems that you cannot modify a dictionary once it has been serialized and written to a shared memory area.
Any help?

Unfortunately shared memory in Ray must be immutable. Typically, it is recommended that you use actors for mutable state. (see here).
You can do a couple of tricks with actors. For example, you can store object references in your dict if the values are immutable. Then the dict itself won't be in shared memory, but all of its objects would be.
#ray.remote
class DictActor
def __init__(self):
self._dict = {}
def put(self, key, value):
self._dict[key] = ray.put(value)
def get(self, key):
return self._dict[key]
d = DictActor.remote()
ray.get(d.put.remote("a", np.zeros(100)))
ray.get(d.get.remote("a")) # This result is in shared memory.

Simple way to parallelize embarrassingly parallelizable generator

I have a generator (or, a list of generators). Let's call them gens
Each generator in gens is a complicated function that returns the next value of a complicated procedure. Fortunately, they are all independent of one another.
I want to call gen.__next__() for each element gen in gens, and return the resulting values in a list. However, multiprocessing is unhappy with pickling generators.
Is there a fast, simple way to do this in Python? I would like it such that gens of length m is mapped to n cores locally on my machine, where n could be larger or smaller than m. Each generator should run on a separate core.
If this is possible, can someone provide a minimal example?

You can't pickle generators. Read more about it here.
There is a blog post which explains it in much more detail. Referring a quote from it:
Let’s ignore that problem for a moment and look what we would need to do to pickle a generator. Since a generator is essentially a souped-up function, we would need to save its bytecode, which is not guarantee to be backward-compatible between Python’s versions, and its frame, which holds the state of the generator such as local variables, closures and the instruction pointer. And this latter is rather cumbersome to accomplish, since it basically requires to make the whole interpreter picklable. So, any support for pickling generators would require a large number of changes to CPython’s core.
Now if an object unsupported by pickle (e.g., a file handle, a socket, a database connection, etc) occurs in the local variables of a generator, then that generator could not be pickled automatically, regardless of any pickle support for generators we might implement. So in that case, you would still need to provide custom getstate and setstate methods. This problem renders any pickling support for generators rather limited.
He also suggests a solution, to use simple iterators.
the best solution to this problem to the rewrite the generators as simple iterators (i.e., one with a __next__ method). Iterators are easy and efficient space-wise to pickle because their state is explicit. You would still need to handle objects representing some external state explicitly however; you cannot get around this.
Another offered solution (which I haven't tried) suggests this
Convert the generator to a class in which the generator code is the __iter__ method
Add __getstate__ and __setstate__ methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.

If your subtasks are truly parallel (do not rely on any shared state), you can do this with multiprocesing.Pool().
Take a look at
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
This requires you to make the arguments of pool.map() serializable. You can't pass in a generator to your worker, but you can achieve something similar by defining your generator inside the target function, and pass in initialization arguments to the multiprocessing library:
import multiprocessing as mp
import time
def worker(value):
# The generator is defined inside the multiprocessed function
def gen():
for k in range(value):
time.sleep(1) # Simulate long running task
yield k
# Execute the generator
for x in gen():
print(x)
# Do something with x?
pass
pool = mp.Pool()
pool.map(worker, [2, 5, 2])
pool.join() # Wait for all the work to be finished.
pool.close() # Clean up system resources
The output will be:
0
0
0
1
1
1
2
3
4
Note that this solution only really works if you build your generators, then use them only once, as their final state is lost at the end of the worker function.
Keep in mind that anytime you want to use multiprocessing, you have to use for serializable objects due to the limitations of inter-process communication; this can often prove limiting.
If your process is not CPU bound but instead I/O bound (disk access, network access, etc), you'll have a much easier time using threads.

You don't need to pickle the generator, just send an index of the generator to the processing pool.
M = len(gens)
N = multiprocessing.cpu_count()
def proc(gen_idx):
return [r for r in gens[gen_idx]()]
if __name__ == "__main__":
with multiprocessing.Pool(N) as p:
for r in p.imap_unordered(proc, range(M)):
print(r)
Note that I don't call/initialize the generator until within the processing function.
Using imap_unordered will allow you to process the results as each generator completes.

It's quite easy to implement, just dont block the threads sincronusly, just constantly loop thru the states and join them on complition. This template shuld be good enough to give an idea, self.done alwais needs to be set last on thread complition and las on thread reuse.
import threading as th
import random
import time
class Gen_thread(th.Thread):
def is_done(self):
return self.done
def get_result(self):
return self.work_result
def __init__(self, *args, **kwargs):
self.g_id = kwargs['id']
self.kwargs = kwargs
self.args = args
self.work_result = None
self.done = False
th.Thread.__init__(self)
def run(self):
# time.sleep(*self.args) to pass variables
time.sleep(random.randint(1, 4))
self.work_result = 'Thread {0} done'.format(self.g_id + 1)
self.done = True
class Gens(object):
def __init__(self, n):
self.n_needed = 0
self.n_done = 0
self.n_loop = n
self.workers_tmp = None
self.workers = []
def __iter__(self):
return self
def __next__(self):
if self.n_needed == 0:
for w in range(self.n_loop):
self.workers.append(Gen_thread(id=w))
self.workers[w].start()
self.n_needed += 1
while self.n_done != self.n_needed:
for w in range(self.n_loop):
if self.workers[w].is_done():
self.workers[w].join()
self.workers_tmp = self.workers[w].get_result()
self.workers.pop(w)
self.n_loop -= 1
self.n_done += 1
return self.workers_tmp
raise StopIteration()
if __name__ == '__main__':
for gen in Gens(4):
print(gen)

Python Multiprocessing Slower and not really working for object methods

Edit: Running Apple MBP 2017 Model 14,3 with 2.8GHz i7 4-cores:
multiprocessing.cpu_count()
8
I have a list of objects I'm performing object methods on in python once for each object. The process is for a genetic algorithm so I'm interested in speeding it up. Basically, each time I update the environment with data from the data list, the object (genome) performs a little bit of math including taking values from the environment, and referencing it's own internal values.
I'm doing:
from multiprocessing import Pool
class Individual(object):
def __init__(self):
self.parameter1 = None
self.parameter2 = None
def update_values():
# reads the environment variables, does math specific to each instance
# updates internal parameters
a, b, c, d = environment_variables
self.parameter1 = do_math(a, b, c, d,
self.parameter1, self.parameter2)
self.parameter2 = do_math(a, b, c, d,
self.parameter1, self.parameter2)
data_list = [data1, data2, data3, ..., data1000]
object_list = [object1, object2, object3, ..., object20000]
If I run this:
for newdataset in data_list:
update_parameters(newdataset)
for object in object_list:
object.update_values()
It is much faster than if I try to split this up using multiprocessing/ map:
def process_object(object):
object.update_values()
for newdataset in data_list:
update_parameters(newdataset)
with Pool(4) as p:
p.map(process_object, object_list)
If I run with object_list length of 200 (instead of 20000) the total time is 14.8 seconds in single threaded mode.
If I run with the same in multiprocessing mode the total time is... still waiting... ok... 211 seconds.
Also it doesn't appear to do what the function says it should at all. What am I missing here? When I check the values of each object they do not appear to have been updated at all.

When you use multiprocessing, you're serializing and transferring the data both ways. In this case, that includes each object you indend to call update_values on. I'm guessing that you're also iterating on your models, meaning they'll be sent back and forth quite a lot. Furthermore, map() returns a list of results, but process_object just returns None. So you've serialized a model, sent it to another process, had that process run and update the model, then send a None back and toss away the updated model, before tossing away the list of None results. If you were to return the models:
def process_object(object):
object.update_values()
return object
...
object_list = p.map(process_object, object_list)
Your program might actually produce some results, but almost certainly still slower than you wish. In particular your process pool will not have the data_list or similar things (the "environment"?) - it only receives what you passed through Pool.map().
You may want to consider using other tools such as tensorflow or MPI. At least read up on sharing state between processes. Also, you probably shouldn't be recreating your process pool for every iteration; that's very expensive on some platforms, such as Windows.

I would split up the parallelization a little bit differently. It's hard to tell what's happening with update_parameters, but I would parallelize the call to that too. Why leave it out? You could wrap the whole operation you're interested in, in some function, right?
Also, this is important: you need to make sure that you only open up the pool if you're in the main process. So add the line
if __name__ == '__main__':
with Pool(multiprocessing.cpu_count()) as p:

Is filter thread-safe

I have a thread which is updating a list called l. Am I right in saying that it is thread-safe to do the following from another thread?
filter(lambda x: x[0] == "in", l)
If its not thread safe, is this then the correct approach:
import threading
import time
import Queue
class Logger(threading.Thread):
def __init__(self, log):
super(Logger, self).__init__()
self.log = log
self.data = []
self.finished = False
self.data_lock = threading.Lock()
def run(self):
while not self.finished:
try:
with self.data_lock:
self.data.append(self.log.get(block=True, timeout=0.1))
except Queue.Empty:
pass
def get_data(self, cond):
with self.data_lock:
d = filter(cond, self.data)
return d
def stop(self):
self.finished = True
self.join()
print("Logger stopped")
where the get_data(self, cond) method is used to retrieve a small subset of the data in the self.data in a thread safe manner.

First, to answer your question in the title: filter is just a function. Hence, its thread-safety will rely on the data-structure you use it with.
As pointed out in the comments already, list operations themselves are thread-safe in CPython and protected by the GIL, but that is arguably only an implementation detail of CPython that you shouldn't really rely on. Even if you could rely on it, thread safety of some of their operations probably does not mean the kind of thread safety you mean:
The problem is that iterating over a sequence with filter is in general not an atomic operation. The sequence could be changed during iteration. Depending on the data-structure underlying your iterator this might cause more or less weird effects. One way to overcome this problem is by iterating over a copy of the sequence that is created with one atomic action. Easiest way to do this for standard sequences like tuple, list, string is with the slice operator like this:
filter(lambda x: x[0] == "in", l[:])
Apart from this not necessarily being thread-safe for other data-types, there's one problem with this though: it's only a shallow copy. As your list's elements seem to be list-like as well, another thread could in parallel do del l[1000][:] to empty one of the inner lists (which are pointed to in your shallow copy as well). This would make your filter expression fail with an IndexError.
All that said, it's not a shame to use a lock to protect access to your list and I'd definitely recommend it. Depending on how your data changes and how you work with the returned data, it might even be wise to deep-copy the elements while holding the lock and to return those copies. That way you can guarantee that once returned the filter condition won't suddenly change for the returned elements.
Wrt. your Logger code: I'm not 100 % sure how you plan to use this and if it's critical for you to run several threads on one queue and join them. What looks weird to me is that you never use Queue.task_done() (assuming that its self.log is a Queue). Also your polling of the queue is potentially wasteful. If you don't need the join of the thread, I'd suggest to at least turn the lock acquisition around:
class Logger(threading.Thread):
def __init__(self, log):
super(Logger, self).__init__()
self.daemon = True
self.log = log
self.data = []
self.data_lock = threading.Lock()
def run(self):
while True:
l = self.log.get() # thread will sleep here indefinitely
with self.data_lock:
self.data.append(l)
self.log.task_done()
def get_data(self, cond):
with self.data_lock:
d = filter(cond, self.data)
# maybe deepcopy d here
return d
Externally you could still do log.join() to make sure that all of the elements of the log queue are processed.

If one thread writes to a list and another thread reads that list, the two must be synchronized. It doesn't matter for that aspect whether the reader uses filter(), an index or iteration or whether the writer uses append() or any other method.
In your code, you achieve the necessary synchronization using a threading.Lock. Since you only access the list within the context of with self.data_lock, the accesses are mutually exclusive.
In summary, your code is formally correct concerning the list handling between threads. But:
You do access self.finished without the lock, which is problematic. Assigning to that member will change self, i.e. the mapping of the object to the according members, so this should be synced. Effectively, this won't hurt, because True and False are global constants, at the worst you will have a short delay between setting the state in one thread and seeing the state in the other. It remains bad, because it is habit-forming.
As a rule, when you use a lock, always document which objects this lock protects. Also, document which object is accessed by which thread. The fact that self.finished is shared and requires synchronization would have been obvious. Also, making a visual distinction between public functions and data and private ones (beginning with an _underscore, see PEP 8) helps keeping track of this. It also helps other readers.
A similar issue is your baseclass. In general, inheriting from threading.Thread is a bad idea. Rather, include an instance of the thread class and give it a function like self._main_loop to run on. The reason is that you say that your Logger is a Thread and that all of its baseclass' public members are also public members of your class, which is probably a much wider interface than what you intended.
You should never block with a lock held. In your code, you block in self.log.get(block=True, timeout=0.1) with the lock on the mutex. In that time, even if nothing actually happens, no other thread will be able to call and complete a call to get_data(). There is actually just a tiny window between unlocking the mutex and locking it again where a caller of get_data() does not have to wait, which is very bad for performance. I could even imagine that your question is motivated by the really bad performance this causes. Instead, call log.get(..) without lock, it shouldn't need one. Then, with the lock held, append data to self.data and check self.finished.

How to share a cache between multiple processes?

I'm using a LRU cache to speed up some rather heavy duty processing. It works well and speeds things up considerably. However...
When I multiprocess, each process creates it's own separate cache and there are 8 copies of the same thing. That doesn't appear to be a problem, until the box runs out of memory and bad things happen as a result...
Ideally I only need a cachesize of around 300 items for the application, and 1*300 will fit in the 7GB i have to work with, but the 8*300 just doesn't fit.
How do I get all the processes to share the same cache?

I believe you can use a Manager to share a dict between processes. That should in theory let you use the same cache for all functions.
However, I think a saner logic would be to have one process that responds to queries by looking them up in the cache, and if they are not present then delegating the work to a subprocess, and caching the result before returning it. You could easily do that with
with concurrent.futures.ProcessPoolExecutor() as e:
#functools.lru_cache
def work(*args, **kwargs):
return e.submit(slow_work, *args, **kwargs)
Note that work will return Future objects, which the consumer will have to wait on. The lru_cache will cache the future objects so they will be returned automatically; I believe you can access their data more than once but can't test it right now.
If you're not using Python 3, you'll have to install backported versions of concurrent.futures and functools.lru_cache.

Pass the shared cache to each process. The parent process can instantiate a single cache and refer it to each process as an argument...
#utils.lru_cache(maxsize=300)
def get_stuff(key):
"""This is the routine that does the stuff which can be cached.
"""
return Stuff(key)
def process(stuff_obj):
"""This is the routine which multiple processes call to do work with that Stuff
"""
# get_stuff(key) <-- Wrong; I was calling the cache from here
stuff_obj.execute()
def iterate_stuff(keys):
"""This generates work for the processses.
"""
for key in keys:
yield get_stuff(key) # <-- I can call the cache from the parent
def main():
...
keys = get_list_of_keys()
for result in pool.imap(process, iterate_stuff(keys)):
evaluate(result)
...
This example is simple because I can look up the cache before calling the process. Some scenarios might prefer to pass a pointer to the cache rather than the value. eg:
yield (key, get_stuff)
Katriel's put me on the right track and I would implement that answer, but, silly me, my mistake was even simpler to solve than what he suggested.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.