I want to process a large for loop in parallel, and from what I have read the best way to do this is to use the multiprocessing library that comes standard with Python.
I have a list of around 40,000 objects, and I want to process them in parallel in a separate class. The reason for doing this in a separate class is mainly because of what I read here.
In one class I have all the objects in a list and via the multiprocessing.Pool and Pool.map functions I want to carry out parallel computations for each object by making it go through another class and return a value.
# ... some class that generates the list_objects
pool = multiprocessing.Pool(4)
results = pool.map(Parallel, self.list_objects)
And then I have a class which I want to process each object passed by the pool.map function:
class Parallel(object):
def __init__(self, args):
self.some_variable = args[0]
self.some_other_variable = args[1]
self.yet_another_variable = args[2]
self.result = None
def __call__(self):
self.result = self.calculate(self.some_variable)
The reason I have a call method is due to the post I linked before, yet I'm not sure I'm using it correctly as it seems to have no effect. I'm not getting the self.result value to be generated.
Any suggestions?
Thanks!
Use a plain function, not a class, when possible. Use a class only when there is a clear advantage to doing so.
If you really need to use a class, then given your setup, pass an instance of Parallel:
results = pool.map(Parallel(args), self.list_objects)
Since the instance has a __call__ method, the instance itself is callable, like a function.
By the way, the __call__ needs to accept an additional argument:
def __call__(self, val):
since pool.map is essentially going to call in parallel
p = Parallel(args)
result = []
for val in self.list_objects:
result.append(p(val))
Pool.map simply applies a function (actually, a callable) in parallel. It has no notion of objects or classes. Since you pass it a class, it simply calls __init__ - __call__ is never executed. You need to either call it explicitly from __init__ or use pool.map(Parallel.__call__, preinitialized_objects)
Related
In my code, I use multiprocessing.Pool to run some code concurrently. Simplified code looks somewhat like this:
class Wrapper():
session: Session
def __init__(self):
self.session = requests.Session()
# Session initialization
def upload_documents(docs):
with Pool(4) as pool:
upload_file = partial(self.upload_document)
pool.starmap(upload_file, documents)
summary = create_summary(documents)
self.upload_document(summary)
def upload_document(doc):
self.post(doc)
def post(data):
self.session.post(self.url, data, other_params)
So basically sending documents via HTTP is parallelized. Now I want to test this code, and can't do it. This is my test:
#patch.object(Session, 'post')
def test_study_upload(self, post_mock):
response_mock = Mock()
post_mock.return_value = response_mock
response_mock.ok = True
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
mc = post_mock.mock_calls
And in debug I can check the mock calls. There is one that looks valid, and it's the one uploading the summary, and a bunch of calls like call.json(), call.__len__(), call.__str__() etc.
There are no calls uploading documents. When I set breakpoint in upload_document method, I can see it is called once for each document, it works as expected. However, I can't test it, because I can't verify this behavior by mock. I assume it's because there are many processes calling on the same mock, but still - how can I solve this?
I use Python 3.6
The approach I would take here is to keep your test as granular as possible and mock out other calls. In this case you'd want to mock your Pool object and verify that it's calling what you're expecting, not actually rely on it to spin up child processes during your test. Here's what I'm thinking:
#patch('yourmodule.Pool')
def test_study_upload(self, mock_pool_init):
mock_pool_instance = mock_pool_init.return_value.__enter__.return_value
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
# To get the upload file arg here, you'll need to either mock the partial call here,
# or actually call it and get the return value
mock_pool_instance.starmap.assert_called_once_with_args(upload_file, documents)
Then you'd want to take your existing logic and test your upload_document function separately:
#patch.object(Session, 'post')
def test_upload_file(self, post_mock):
response_mock = Mock()
post_mock.return_value = response_mock
response_mock.ok = True
with Wrapper() as wrapper:
wrapper.upload_document(document)
mc = post_mock.mock_calls
This gives you coverage both on your function that's creating and controlling your pool, and the function that's being called by the pool instance. Caveat this with I didn't test this, but am leaving some of it for you to fill in the blanks since it looks like it's an abbreviated version of the actual module in your original question.
EDIT:
Try this:
def test_study_upload(self):
def call_direct(func_var, documents):
return func_var(documents)
with patch('yourmodule.Pool.starmap', new=call_direct)
with Wrapper() as wrapper:
wrapper.upload_documents(documents)
This is patching out the starmap call so that it calls the function you pass in directly. It circumvents the Pool entirely; the bottom line being that you can't really dive into those subprocesses created by multiprocessing.
I have a method like this in Python :
def test(a,b):
return a+b, a-b
How can I run this in a background thread and wait until the function returns.
The problem is the method is pretty big and the project involves GUI, so I can't wait until it's return.
In my opinion, you should besides this thread run another thread that checks if there is result. Or Implement callback that is called at the end of the thread. However, since you have gui, which as far as I know is simply a class -> you can store result into obj/class variable and check if the result came.
I would use mutable variable, which is sometimes used. Lets create special class which will be used for storing results from thread functions.
import threading
import time
class ResultContainer:
results = [] # Mutable - anything inside this list will be accesable anywher in your program
# Lets use decorator with argument
# This way it wont break your function
def save_result(cls):
def decorator(func):
def wrapper(*args,**kwargs):
# get result from the function
func_result = func(*args,**kwargs)
# Pass the result into mutable list in our ResultContainer class
cls.results.append(func_result)
# Return result from the function
return func_result
return wrapper
return decorator
# as argument to decorator, add the class with mutable list
#save_result(ResultContainer)
def func(a,b):
time.sleep(3)
return a,b
th = threading.Thread(target=func,args=(1,2))
th.daemon = True
th.start()
while not ResultContainer.results:
time.sleep(1)
print(ResultContainer.results)
So, in this code, we have class ResultContainer with list. Whatever you put in it, you can easily access it from anywhere in the code (between threads and etc... exception is between processes due to GIL). I made decorator, so you can store result from any function without violating the function. This is just example how you can run threads and leave it to store result itself without you taking care of it. All you have to do, is to check, if the result arrived.
You can use global variables, to do the same thing. But I dont advise you to use them. They are ugly and you have to be very careful when using them.
For even more simplicity, if you dont mind violating your function, you can just, without using decorator, just push result to class with list directly in the function, like this:
def func(a,b):
time.sleep(3)
ResultContainer.results.append(tuple(a,b))
return a,b
This question is related to these other posts on SO, yet the solutions suggested therein do not seem to work for my case. In short, my problem can be illustrated by the following example. I have an Algebra class where by the method triPower I aim at computing the power of a trinomial, i.e. (a+b+c)**n for many n values with fixed a, b, c. To do so, I thought of creating a method _triPower(a,b,c,n) and pass it to my pool.map() function by functools.partial(_triPower,...) where I fix a, b, c and leave n as the only parameter, since I am working in Python 2.7 and map from the multiprocessing module wants only one argument function (see otherwise this post). The code is the following:
from __future__ import division
import numpy as np
import functools as fntls
import multiprocessing as mp
import multiprocessing.pool as mppl
# A couple of classes introduced to allow multiple processes to have their own daemons (parallelization)
class NoDaemonProcess(mp.Process):
# make 'daemon' attribute always return False
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(mppl.Pool):
Process = NoDaemonProcess
# Sample class where I want a method to run a parallel loop
class Algebra(object):
def __init__(self,offset):
self.offset = offset
def trinomial(self,a,b,c):
return a+b+c
def _triPower(self,a,b,c,n):
"""This is the method that I want to run in parallel from the next method"""
return self.offset + self.trinomial(a,b,c)**n
def triPower(self,n):
pls = MyPool(4)
vals = pls.map(fntls.partial(self._triPower,a=1.,b=0.,c=1.),n)
print vals
# Testing
if __name__ == "__main__":
A = Algebra(0.)
A.triPower(np.arange(0.,10.,0.01))
The above does not work and produces (as expected from this post) the error:
cPickle.PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
Hence, following the same post, I tried to define _triPower as a global function i.e.
def _triPower(alg,a,b,c,n):
"""This is the method that I want to run in parallel from the next method"""
return alg.trinomial(a,b,c)**n
and then editing Algebra.triPower(...) according to:
def triPower(self,n):
pls = MyPool(4)
vals = pls.map(fntls.partial(_triPower, alg=self, a=1., b=0., c=1.), n)
print vals
and this latter instead gives some weird TypeError like:
TypeError: _triPower() got multiple values for keyword argument 'alg'
On the other hand the suggestion to try to make the methods serializable by VeryPicklableObject as in this other post seems also not to work and this package appears dead by now (as of 05/2019). So what am I doing wrong and how can I make my computation run in parallel?
util.py
def exec_multiprocessing(self, method, args):
with concurrent.futures.ProcessPoolExecutor() as executor:
results = pool.map(method, args)
return results
clone.py
def clone_vm(self, name, first_run, host, ip):
# clone stuff
invoke.py
exec_args = [(name, first_run, host, ip) for host, ip in zip(hosts, ips)]
results = self.util.exec_multiprocessing(self.clone.clone_vm, exec_args)
The above code gives the pickling error. I found that it is because we are passing instance method. So we should unwrap the instance method. But I am not able to make it work.
Note: I can not create top level method to avoid this. I have to use instance methods.
Let's start with an overview - why the error came up in the first place:
The multiprocessing must requires to pickle (serialize) data to pass them along processes or threads. To be specific, pool methods themselves rely on queue at the lower level, to stack tasks and pass them to threads/processes, and queue requires everything that goes through it must be pickable.
The problem is, not all items are pickable - list of pickables - and when one tries to pickle an unpicklable object, gets the PicklingError exception - exactly what happened in your case, you passed an instance method which is not picklable.
There can be various workarounds (as is the case with every problem) - the solution which worked for me is here by Dano - is to make pickle handle the methods and register it with copy_reg.
Add the following lines at the start of your module clone.py to make clone_vm picklable (do import copy_reg and types):
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
Other useful answers - by Alex Martelli, mrule, by unutbu
You need to add support for pickling functions and methods for that to work as pointed out by Nabeel Ahmed. But his solution won't work with name-mangled methods -
import copy_reg
import types
def _pickle_method(method):
attached_object = method.im_self or method.im_class
func_name = method.im_func.func_name
if func_name.startswith('__'):
func_name = filter(lambda method_name: method_name.startswith('_') and method_name.endswith(func_name), dir(attached_object))[0]
return (getattr, (attached_object, func_name))
copy_reg.pickle(types.MethodType, _pickle_method)
This would work for name mangled methods as well. For this to work, you need to ensure this code is always ran before any pickling happens. Ideal place is settings file(if you are using django) or some package that is always imported before other code is executed.
Credits:- Steven Bethard (https://bethard.cis.uab.edu/)
I've got a question about defining functions and the self-parameter in python.
There is following code.
class Dictionaries(object):
__CSVDescription = ["ID", "States", "FilterTime", "Reaction", "DTC", "ActiveDischarge"]
def __makeDict(Lst):
return dict(zip(Lst, range(len(Lst))))
def getDict(self):
return self.__makeDict(self.__CSVDescription)
CSVDescription = __makeDict(__CSVDescription)
x = Dictionaries()
print x.CSVDescription
print x.getDict()
x.CSVDescription works fine. But print x.getDict() returns an error.
TypeError: __makeDict() takes exactly 1 argument (2 given)
I can add the self-parameter to the __makeDict() method, but then print x.CSVDescription wouldn't work.
How do I use the self-parameter correctly?
In python, the self parameter is implicitly passed to instance methods, unless the method is decorated with #staticmethod.
In this case, __makeDict doesn't need a reference to the object itself, so it can be made a static method so you can omit the self:
#staticmethod
def __makeDict(Lst): # ...
def getDict(self):
return self.__makeDict(self.__CSVDescription)
A solution using #staticmethod won't work here because calling the method from the class body itself doesn't invoke the descriptor protocol (this would also be a problem for normal methods if they were descriptors - but that isn't the case until after the class definition has been compiled). There are four major options here - but most of them could be seen as some level of code obfuscation, and would really need a comment to answer the question "why not just use a staticmethod?".
The first is, as #Marcus suggests, to always call the method from the class, not from an instance. That is, every time you would do self.__makeDict, do self.__class__.__makeDict instead. This will look strange, because it is a strange thing to do - in Python, you almost never need to call a method as Class.method, and the only time you do (in code written before super became available), using self.__class__ would be wrong.
In similar vein, but the other way around, you could make it a staticmethod and invoke the descriptor protocol manually in the class body - do: __makeDict.__get__(None, Dictionaries)(__lst).
Or, you could detect yourself what context its being called from by getting fancy with optional arguments:
def __makeDict(self, Lst=None):
if Lst is None:
Lst = self
...
But, by far the best way is to realise you're working in Python and not Java - put it outside the class.
def _makeDict(Lst):
...
class Dictionaries(object):
def getDict(self):
return _makeDict(self.__CSVDescription)
CSVDescription = _makeDict(__CSVDescription)