Python Multiprocessing Slower and not really working for object methods - python

Edit: Running Apple MBP 2017 Model 14,3 with 2.8GHz i7 4-cores:
multiprocessing.cpu_count()
8
I have a list of objects I'm performing object methods on in python once for each object. The process is for a genetic algorithm so I'm interested in speeding it up. Basically, each time I update the environment with data from the data list, the object (genome) performs a little bit of math including taking values from the environment, and referencing it's own internal values.
I'm doing:
from multiprocessing import Pool
class Individual(object):
def __init__(self):
self.parameter1 = None
self.parameter2 = None
def update_values():
# reads the environment variables, does math specific to each instance
# updates internal parameters
a, b, c, d = environment_variables
self.parameter1 = do_math(a, b, c, d,
self.parameter1, self.parameter2)
self.parameter2 = do_math(a, b, c, d,
self.parameter1, self.parameter2)
data_list = [data1, data2, data3, ..., data1000]
object_list = [object1, object2, object3, ..., object20000]
If I run this:
for newdataset in data_list:
update_parameters(newdataset)
for object in object_list:
object.update_values()
It is much faster than if I try to split this up using multiprocessing/ map:
def process_object(object):
object.update_values()
for newdataset in data_list:
update_parameters(newdataset)
with Pool(4) as p:
p.map(process_object, object_list)
If I run with object_list length of 200 (instead of 20000) the total time is 14.8 seconds in single threaded mode.
If I run with the same in multiprocessing mode the total time is... still waiting... ok... 211 seconds.
Also it doesn't appear to do what the function says it should at all. What am I missing here? When I check the values of each object they do not appear to have been updated at all.

When you use multiprocessing, you're serializing and transferring the data both ways. In this case, that includes each object you indend to call update_values on. I'm guessing that you're also iterating on your models, meaning they'll be sent back and forth quite a lot. Furthermore, map() returns a list of results, but process_object just returns None. So you've serialized a model, sent it to another process, had that process run and update the model, then send a None back and toss away the updated model, before tossing away the list of None results. If you were to return the models:
def process_object(object):
object.update_values()
return object
...
object_list = p.map(process_object, object_list)
Your program might actually produce some results, but almost certainly still slower than you wish. In particular your process pool will not have the data_list or similar things (the "environment"?) - it only receives what you passed through Pool.map().
You may want to consider using other tools such as tensorflow or MPI. At least read up on sharing state between processes. Also, you probably shouldn't be recreating your process pool for every iteration; that's very expensive on some platforms, such as Windows.

I would split up the parallelization a little bit differently. It's hard to tell what's happening with update_parameters, but I would parallelize the call to that too. Why leave it out? You could wrap the whole operation you're interested in, in some function, right?
Also, this is important: you need to make sure that you only open up the pool if you're in the main process. So add the line
if __name__ == '__main__':
with Pool(multiprocessing.cpu_count()) as p:

Related

Share Python dict across many processes

I am developing an heuristic algorithm to find "good" solutions for a NP (hence CPU intensive) problem.
I am implementing my solution using Python (I agree it is not the best choice when speed is a concern, but so it is) and I am splitting the workload across many subprocesses, each one in charge to explore a branch of the space of possible solutions.
To improve performances I would like to share some information gathered during the execution of each subprocess among all subprocesses.
The "obvious" way to gather such information is gathering them inside a dictionary whose keys are (frozen)sets of integers and values are lists (or sets) of integers.
Hence the shared dictionary must both be readable and writable from each subprocess, but I can safely expect that reads will be far more frequent than writes because a subprocess will write to the shared dict only when it finds something "interesting" and will read the dict far more frequently to know if a certain solution has already been evaluated by other processes (to avoid exploring the same branch twice or more).
I do not expect the dimension of such dictionary to exceed 10 MB.
At the moment I implemented the shared dict using an instance of multiprocessing.Manager() that takes care of handling concurrent accesses to the shared dictionary out of the box.
However (according to what I have found) this way of sharing data is implemented using pipes between processes which are a lot slower than plain and simple shared memory (moreover the dictionary must be pickled before being sent through the pipe and unpickled when it is received).
So far my code looks like this:
# main.py
import multiprocessing as mp
import os
def worker(a, b, c, shared_dict):
while condition:
# do things
# sometimes reads from shared_dict to check if a candidate solution has already been evaluated by other process
# if not, evaluate it and store it inside the shared_dict together with some related info
return worker_result
def main():
with mp.Manager() as manager:
# setup params a, b, c, ...
# ...
shared_dict = manager.dict()
n_processes = os.cpu_count()
with mp.Pool(processes=n_processes) as pool:
async_results = [pool.apply_async(worker, (a, b, c, shared_dict)) for _ in range(n_processes)]
results = [res.get() for res in async_results]
# gather the overall result from 'results' list
if __name__ == '__main__':
main()
To avoid the overhead due to pipes I would like to use shared memory, but it doesn't seem that the Python standard library offers a straightforward way to handle a dictionary in shared memory.
As far as I know the Python standard library offers helpers to store data in shared memory only for standard ctypes (with multiprocessing.Value and multiprocessing.Array) or gives you access to raw areas of shared memory.
I do not want to implement my own hash table in a raw area of shared memory since I am not an expert neither of hash tables nor of concurrent programming, instead I am wondering if there are other faster solutions to my needs that doesn't require to write everything from zero.
For example, I have seen that the ray library allows to read data written in shared memory way faster than using pipes, however it seems that you cannot modify a dictionary once it has been serialized and written to a shared memory area.
Any help?
Unfortunately shared memory in Ray must be immutable. Typically, it is recommended that you use actors for mutable state. (see here).
You can do a couple of tricks with actors. For example, you can store object references in your dict if the values are immutable. Then the dict itself won't be in shared memory, but all of its objects would be.
#ray.remote
class DictActor
def __init__(self):
self._dict = {}
def put(self, key, value):
self._dict[key] = ray.put(value)
def get(self, key):
return self._dict[key]
d = DictActor.remote()
ray.get(d.put.remote("a", np.zeros(100)))
ray.get(d.get.remote("a")) # This result is in shared memory.

Python memory issues - memory doesn't get released after finishing a method

I have a quite complex python (2.7 on ubuntu) code which is leaking memory unexpectedly. To break it down, it is a method which is repeatedly called (and itself calls different methods) and returns a very small object. After finishing the method the used memory is not released. As far as I know it is not unusual to reserve some memory for later usages, but if I use big enough input my machine eventually consumes all memory and freezes. This is not the case if I use a subprocess with concurrent.futures ProcessPoolExecutor, thus I need to assume it is not my code but some underlying problems?!
Is this a known issue? Might it be a problem in 3rd party libraries I am using (e.g. PyQgis)? Where should I start to search for the problem?
Some more Background to eliminate silly reasons (because I am still somewhat of a beginner):
The method uses some global variables but in my understanding these should only be active in the file where they are declared and anyways should be overwritten in the next call of the method?!
To clarify in pseudocode:
def main():
load input from file
for x in input:
result = extra_file.initialization(x)
#here is the point where memory should get released in my opinion
#extra file
def initialization(x):
global input
input = x
result_container = []
while not result do:
part_of_result = method1()
result_container.append(part_of_result)
if result_container fulfills condition to be the final result:
result = result_container
del input
return result
def method1():
#do stuff
method2()
#do stuff
return part_of_result
def method2():
#do stuff with input not altering it
Numerous different methods and global variables are involved and the global declaration is used to not pass like 5 different input variables through multiple methods which don't even use them.
Should I try using garbage collection? All references after finishing the method should be deleted and python itself should take care of it?
Definitely try using garbage collection. I don't believe it's a known problem.

Using python multiprocessing.Pool without returing result object

I have a large number of CPU-bounded tasks that I want to run in parallel. Most of those tasks will return similar results and I only need to store unique results and count non-unique ones.
Here's how it is currently designed: I use two managed dictionaries - one for results and another one for result counters. My tasks are checking those dictionaries using unique result keys for the results they found and either write into both dictionaries or only increase the counters for non-unique results (if I have to write I acquire the lock and check again to avoid inconsistency).
What I am concerned about: since Pool.map should actually return result object, even though I do not save a reference to it, results will still pile up in memory until they are garbage collected. Even though I will have millions of just None's there (since I am processing results in a different manner and all my tasks just return None) I can not rely on specific garbage collector behavior so the program might eventually run out of memory. I still want to keep nice features of the pool but leave out this built-in result handling. Is my understanding correct and is my concern valid? If so, are there any alternatives?
Also, now that I laid it out on paper it looks really clumsy :) Do you see a better way to design such thing?
Thanks!
Question: I still want to keep nice features of the pool
Remove return result from multiprocessing.Pool.
Copy class MapResult and inherit from mp.pool.ApplyResult.
Add, replace ,comment the following:
import multiprocessing as mp
from multiprocessing.pool import Pool
class MapResult(mp.pool.ApplyResult):
def __init__(self, cache, chunksize, length, callback, error_callback):
super().__init__(cache, callback, error_callback=error_callback)
...
#self._value = [None] * length
self._value = None
...
def _set(self, i, success_result):
...
if success:
#self._value[i*self._chunksize:(i+1)*self._chunksize] = result
Create your own class myPool(Pool) inherit from multiprocessing.Pool.
Copy def _map_async(... from multiprocessing.Pool.
Add, replace, comment the following:
class myPool(Pool):
def __init__(self, processes=1):
super().__init__(processes=processes)
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
...
#if self._state != RUN:
if self._state != mp.pool.RUN:
...
#return result
Tested with Python: 3.4.2

Python multiprocess share memory vs using arguments

I'm trying to get my head around what is the most efficient and less memory consuming way to share the same data source between different process.
Imagine the following code, that simplify my problem.
import pandas as pd
import numpy as np
from multiprocessing import Pool
# method #1
def foo(i): return data[i]
if __name__ == '__main__':
data = pd.Series(np.array(range(100000)))
pool = Pool(2)
print pool.map(foo,[10,134,8,1])
# method #2
def foo((data,i)): return data[i]
if __name__ == '__main__':
data = pd.Series(np.array(range(100000)))
pool = Pool(2)
print pool.map(foo,[(data,10),(data,134),(data,8),(data,1)])
In the first method will use the global variable (won't work on Windows, only on Linux/OSX) which will then access by the function. In the second method I'm passing "data" as part of the arguments.
In terms of memory used during the process, there will be a difference between the two methods?
# method #3
def foo((data,i)): return data[i]
if __name__ == '__main__':
data = pd.Series(np.array(range(100000)))
pool = Pool(2)
# reduce the size of the argument passed
data1 = data[:1000]
print pool.map(foo,[(data1,10),(data1,134),(data1,8),(data1,1)])
A third method, rather than passing all the "data", since we know we'll be using only the first records, I'm only passing the first 1000 records. Will this make any difference?
Background
The problem I'm facing I have a big dataset of about 2 millions rows (4GB in memory) which will then by four subprocess to do some elaboration. Each elaboration only affect a small portion of the data (20000 rows) and I'd like to minimize the memory use by each concurrent process.
I'm going to start with the second and third methods, because they're easier to explain.
When you pass the arguments to pool.map or pool.apply, the arguments will be pickled, sent to the child process using a pipe, and then unpickled in the child. This of course requires two completely distinct copies of the data structures you're passing. It also can lead to slow performance with large data structures, since pickling/unpickling large objects can take quite a while.
With the third method, you're just passing smaller data structures than method two. This should perform better, since you don't need to pickle/unpickle as much data.
One other note - passing data multiple times is definitely a bad idea, because each copy will be getting pickled/unpickled repeatedly. You want to pass it to each child once. Method 1 is a good way to do that, or you can use the initializer keyword argument to explicitly pass data to the child. This will use fork on Linux and pickling on Windows to pass data to the child process:
import pandas as pd
import numpy as np
from multiprocessing import Pool
data = None
def init(_data):
global data
data = _data # data is now accessible in all children, even on Windows
# method #1
def foo(i): return data[i]
if __name__ == '__main__':
data = pd.Series(np.array(range(100000)))
pool = Pool(2, initializer=init, initargs=(data,))
print pool.map(foo,[10,134,8,1])
Using the first method, you're leveraging the behavior of fork to allow the child process to inherit the data object. fork has copy-on-write semantics, which means that the memory is actually shared between the parent and its children, until you try to write to it in the child. When you try to write, the memory page that the data you're trying to write is contained in must be copied, to keep it separate from the parent version.
Now, this sounds like a slam dunk - no need to copy anything as long as we don't write to it, which is surely faster than the pickle/unpickle method. And that's usually the case. However, in practice, Python is internally writing to its objects, even when you wouldn't really expect it to. Because Python uses reference counting for memory management, it needs to increment the internal reference counter on each object every time its passed to a method, or assigned to variable, etc. So, that means the memory page containing the reference count for each object passed to your child process will end up getting copied. This will definitely be faster and use less memory than pickling data multiple times, but isn't quite completely shared, either.

Python Multiprocessor and a list of variables to be passed to a function

Okay, so I've never used the python multiprocessing library, and I don't really know how to word my search. I read the docs for the library, and I have tried searching for examples of my problem and I couldn't find anything.
I have a list of file names (~2400), a dictionary (called cond, and is a global), and a function. I want to run my function on each processor, and each time the function is running it is using one of the file names as the variable. So I want it to be running 4 processes, 1 for each processor, and it should work its way through the list, when one function ends it carries onto the next item in the list, and each of those functions are going to be updating a single shared dictionary.
Psudofunction code:
def PSC(fnom):
f = open(fnom,"r")
r = xml.dom.minidom.parse(f)
cond[fnom] = otherfunc(r)
f.close()
So, a) is it possible to use multiprocessing on this function, and b) if it is, what method from the multiprocessing library would be able to handle it, c) if you're extra nice, how do I iterate through a list passing each item as a arg each time.
musings about the way it would work (psudo bulls*** code):
if __name__ == __main__:
name_list = name_list_func()
method = multiprocessing.[method]() #no idea what method
method.something(target=PSC, iter=name_list) #no idea either
This is easy, except for the "single shared dictionary" part. Processes don't share memory. That's a lie, but it's one you should believe at first ;-) I'm going to keep the dict in the main program here, because that's far more efficient than any actual way of sharing the dict across processes:
NUM_CPUS = None # defaults to all available cores
def PSC(fnom):
return fnom, len(fnom)
if __name__ == "__main__":
import multiprocessing as mp
pool = mp.Pool(NUM_CPUS)
list_of_strings = list("abcdefghijklm")
cond = {}
for fnom, result in pool.imap_unordered(PSC, list_of_strings):
cond[fnom] = result
pool.close()
pool.join()
print cond
That's code you can actually run. Plugging in file-opening cruft, XML parsing, etc, doesn't change any of what you need to get the multiprocessing part working right.
In current Python 3, this can be made a little simpler. The code here is for Python 2.
Note that instead of imap_unordered(), you could also use imap() or map(). imap_unordered() gives the implementation the most freedom to arrange things as efficiently as possible, although so far the implementation isn't really smart enough to take much advantage of that. Looking ahead ;-)

Categories