I have several multiprocessing.Processes and would like them to consume (queue get()) callable non-picklable objects and call them. These were created before the fork(), so they shouldn't need pickling.
Using multiprocessing.Queue doesn't work as it tries to pickle everything:
import multiprocessing as mp
# create non-global callable to make it unpicklable
def make_callable():
def foo():
print("running foo")
return foo
def bar():
print("running bar")
def runall(q):
while True:
c = q.get()
if c is None:
break
c()
if __name__ == '__main__':
q = mp.Queue()
call = make_callable()
p = mp.Process(target=runall, args=(q,))
p.start()
q.put(bar)
q.put(call)
q.put(None)
p.join()
running bar
Traceback (most recent call last):
File "/usr/lib64/python3.7/multiprocessing/queues.py", line 236, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib64/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'make_callable.<locals>.foo'
An implementation equivalent would be putting all objects into a global (or passed) list and passing just indexes, which works:
import multiprocessing as mp
# create non-global callable to make it unpicklable
def make_callable():
def foo():
print("running foo")
return foo
def bar():
print("running bar")
def runall(q, everything):
while True:
c = q.get()
if c is None:
break
everything[c]()
if __name__ == '__main__':
q = mp.Queue()
call = make_callable()
everything = [bar, call]
p = mp.Process(target=runall, args=(q,everything))
p.start()
q.put(0)
q.put(1)
q.put(None)
p.join()
running bar
running foo
The problem is that while I know that none of the callables passed will be garbage collected (and thus their addresses will stay valid), I do not have the full list beforehand.
I also know I could probably use multiprocessing.Manager and its Queue implementation using a Proxy object, but this seems like a lot of overhead, especially as in the real implementation I would be passing other picklable data as well.
Is there a way to pickle and pass only the address reference to an object, shared across multiple processes?
Thanks!
True that Process' target objects must be pickable.
Note that functions (built-in and user-defined) are pickled by “fully
qualified” name reference, not by value.This means that only the
function name is pickled, along with the name of the module the
function is defined in. Neither the function’s code, nor any of its
function attributes are pickled. Thus the defining module must be
importable in the unpickling environment, and the module must contain
the named object, otherwise an exception will be raised.
Picklable functions and classes must be defined in the top level of a module.
So in your case you need to proceed with passing top-level callables but applying additional checks/workarounds in the crucial runall function:
import multiprocessing as mp
# create non-global callable to make it unpicklable
def make_callable():
def foo():
print("running foo")
return foo
def bar():
print("running bar")
def runall(q):
while True:
c = q.get()
if c is None:
break
res = c()
if callable(res): res()
if __name__ == '__main__':
q = mp.Queue()
p = mp.Process(target=runall, args=(q,))
p.start()
q.put(bar)
q.put(make_callable)
q.put(None)
p.join()
q.close()
The output:
running bar
running foo
After a bit of thinking and searching, I believe I have the answer I was looking for, mostly from: Get object by id()?.
I could pass an id() of the callable and then translate it back in the spawned process:
import ctypes
a = "hello world"
print ctypes.cast(id(a), ctypes.py_object).value
Or use the gc module and, as long as I keep a reference to the object alive, that should work too:
import gc
def objects_by_id(id_):
for obj in gc.get_objects():
if id(obj) == id_:
return obj
raise Exception("No found")
However neither of these are very clean and, in the end, it may be worth imposing a limitation of having all the callables first and just passing indexes.
Related
In a custom class I have the following code:
class CustomClass():
triggerQueue: multiprocessing.Queue
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
pool = multiprocessing.Pool(5)
while True:
try:
queueString = self.triggerQueue.get_nowait()
pool.apply_async(func=self.poolFunc, args=(queueString,))
except queue.Empty:
break
What I intend to do is:
add a trigger to the queue (not implemented in this snippet) -> works as intended
run an endless loop within the listenerFunc that reads all triggers from the queue (if any are found) -> works as intended
pass trigger to poolFunc which is to be executed asynchronosly -> not working
It works as soon as I source my poolFun() outside of the class like
def poolFunc(queueString):
print(queueString)
class CustomClass():
[...]
But why is that so? Do I have to pass the self argument somehow? Is it impossible to perform it this way in general?
Thank you for any hint!
There are several problems going on here.
Your instance method, poolFunc, is missing a self parameter.
You are never properly terminating the Pool. You should take advantage of the fact that a multiprocessing.Pool object is a context manager.
You're calling apply_async, but you're never waiting for the results. Read the documentation: you need to call the get method on the AsyncResult object to receive the result; if you don't do this before your program exits your poolFunc function may never run.
By making the Queue object part of your class, you won't be able to pass instance methods to workers.
We can fix all of the above like this:
import multiprocessing
import queue
triggerQueue = multiprocessing.Queue()
class CustomClass:
def poolFunc(self, queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for res in results:
print(res.get())
c = CustomClass()
for i in range(10):
triggerQueue.put(f"testval{i}")
c.listenerFunc()
You can, as you mention, also replace your instance method with a static method, in which case we can keep triggerQueue as part of the class:
import multiprocessing
import queue
class CustomClass:
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
#staticmethod
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = self.triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for r in results:
print(r.get())
c = CustomClass()
for i in range(10):
c.triggerQueue.put(f"testval{i}")
c.listenerFunc()
But we still need to reap the pool_async results.
Okay, I found an answer and a workaround:
the answer is based the anser of noxdafox to this question.
Instance methods cannot be serialized that easily. What the Pickle protocol does when serialising a function is simply turning it into a string.
For a child process would be quite hard to find the right object your instance method is referring to due to separate process address spaces.
A functioning workaround is to declare the poolFunc() as static function like
#staticmethod
def poolFunc(queueString):
print(queueString)
I need to run the same function 10 times that for reasons of data linked to a login, it needs to be inside another function:
from multiprocessing import Pool
def main():
def inside(a):
print(a)
Pool.map(inside, 'Ok' * 10)
if __name__ == '__main__':
main()
from multiprocessing import Pool
def main():
def inside(a):
print(a)
Pool.map(main.inside, 'Ok' * 10)
if __name__ == '__main__':
main()
In both attempts the result is this:
AttributeError: 'function' object has no attribute 'map'
How can I do this by keeping the function inside the other function?
Is there a way to do this?
AttributeError: 'function' object has no attribute 'map'
We need to instantiate Pool from multiprocessing and call map method of that pool object.
You have to move inside method to some class because Pool uses pickel to serialize and deserialize methods and if its inside some method then it cannot be imported by pickel.
Pool needs to pickle (serialize) everything it sends to its
worker-processes (IPC). Pickling actually only saves the name of a
function and unpickling requires re-importing the function by name.
For that to work, the function needs to be defined at the top-level,
nested functions won't be importable by the child and already trying
to pickle them raises an exception (more).
Please visit this link of SO.
from multiprocessing import Pool
class Wrap:
def inside(self, a):
print(a)
def main():
pool = Pool()
pool.map(Wrap().inside, 'Ok' * 10)
if __name__ == '__main__':
main()
If you don't want to wrap inside method inside of a class move the inside method to global scope so it can be pickled
from multiprocessing import Pool
def inside(a):
print(a)
def main():
with Pool() as pool:
pool.map(inside, 'Ok'*10)
if __name__ == '__main__':
main()
To start with, here is some code that works
from multiprocessing import Pool, Manager
import random
manager = Manager()
dct = manager.dict()
def do_thing(n):
for i in range(10_000_000):
i += 1
dct[n] = random.randint(0, 9)
with Pool(2) as pool:
pool.map(do_thing, range(10))
Now if I try to make a class out of this:
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
self.manager = Manager()
self.dct = self.manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
I run into: TypeError: Pickling an AuthenticationString object is disallowed for security reasons. Now from here, I get the hint that Python is trying to pickle the Manager which as I understand has its own dedicated process, and processes can't be pickled because they contain an AuthenticationString.
I don't know enough about how forking works (I'm on Linux, so I understand this is the default method for starting new processes) to understand exactly why the Manager instance needs to be pickled.
So here are my questions:
Why is this happening?
How can I use a Manager when doing multiprocessing within a class? PS: I want to be able to import SomeClass from this module.
Is what I'm asking for unreasonable or unconventional?
PS: I know I can do this exact snippet without the Manager by exploiting the fact that pool.map will return things in order, so something like this: res = pool.map(self.do_thing, range(10)) then dct = {k: v for k, v in zip(range(10), res)}. But that's besides the point of the question.
To answer your questions:
Q1 - Why is this happening?
Each worker process created by the Pool.map() needs to execute the instance method self.do_thing(). In order to do that Python pickles the instance and passes it to the subprocess (which unpickles it). If each instance has a Manager it will be a problem because they're not pickleable. Part of the unpickling process involves importing the module that defines the class and restoring the instance's attributes (which were also pickled).
Q2 - How to fix it
You can avoid the problem by having the class create its own class-level Manager (shared by all instances of the class). Here the __init__() method creates the manager class attribute the first time an instance is created and from that point on, further instances will reuse this — it's sometimes called "lazy initialization"
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
# Lazy creation of class attribute.
try:
manager = getattr(type(self), 'manager')
except AttributeError:
manager = type(self).manager = Manager()
self.dct = manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
print('done')
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
Q3 - Is this a reasonable thing to do?
In my opinion, yes.
I am trying to return values from subprocesses but these values are unfortunately unpicklable. So I used global variables in threads module with success but have not been able to retrieve updates done in subprocesses when using multiprocessing module. I hope I'm missing something.
The results printed at the end are always the same as initial values given the vars dataDV03 and dataDV04. The subprocesses are updating these global variables but these global variables remain unchanged in the parent.
import multiprocessing
# NOT ABLE to get python to return values in passed variables.
ants = ['DV03', 'DV04']
dataDV03 = ['', '']
dataDV04 = {'driver': '', 'status': ''}
def getDV03CclDrivers(lib): # call global variable
global dataDV03
dataDV03[1] = 1
dataDV03[0] = 0
# eval( 'CCL.' + lib + '.' + lib + '( "DV03" )' ) these are unpicklable instantiations
def getDV04CclDrivers(lib, dataDV04): # pass global variable
dataDV04['driver'] = 0 # eval( 'CCL.' + lib + '.' + lib + '( "DV04" )' )
if __name__ == "__main__":
jobs = []
if 'DV03' in ants:
j = multiprocessing.Process(target=getDV03CclDrivers, args=('LORR',))
jobs.append(j)
if 'DV04' in ants:
j = multiprocessing.Process(target=getDV04CclDrivers, args=('LORR', dataDV04))
jobs.append(j)
for j in jobs:
j.start()
for j in jobs:
j.join()
print 'Results:\n'
print 'DV03', dataDV03
print 'DV04', dataDV04
I cannot post to my question so will try to edit the original.
Here is the object that is not picklable:
In [1]: from CCL import LORR
In [2]: lorr=LORR.LORR('DV20', None)
In [3]: lorr
Out[3]: <CCL.LORR.LORR instance at 0x94b188c>
This is the error returned when I use a multiprocessing.Pool to return the instance back to the parent:
Thread getCcl (('DV20', 'LORR'),)
Process PoolWorker-1:
Traceback (most recent call last):
File "/alma/ACS-10.1/casa/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "/alma/ACS-10.1/casa/lib/python2.6/multiprocessing/process.py", line 88, in run
self._target(*self._args, **self._kwargs)
File "/alma/ACS-10.1/casa/lib/python2.6/multiprocessing/pool.py", line 71, in worker
put((job, i, result))
File "/alma/ACS-10.1/casa/lib/python2.6/multiprocessing/queues.py", line 366, in put
return send(obj)
UnpickleableError: Cannot pickle <type 'thread.lock'> objects
In [5]: dir(lorr)
Out[5]:
['GET_AMBIENT_TEMPERATURE',
'GET_CAN_ERROR',
'GET_CAN_ERROR_COUNT',
'GET_CHANNEL_NUMBER',
'GET_COUNT_PER_C_OP',
'GET_COUNT_REMAINING_OP',
'GET_DCM_LOCKED',
'GET_EFC_125_MHZ',
'GET_EFC_COMB_LINE_PLL',
'GET_ERROR_CODE_LAST_CAN_ERROR',
'GET_INTERNAL_SLAVE_ERROR_CODE',
'GET_MAGNITUDE_CELSIUS_OP',
'GET_MAJOR_REV_LEVEL',
'GET_MINOR_REV_LEVEL',
'GET_MODULE_CODES_CDAY',
'GET_MODULE_CODES_CMONTH',
'GET_MODULE_CODES_DIG1',
'GET_MODULE_CODES_DIG2',
'GET_MODULE_CODES_DIG4',
'GET_MODULE_CODES_DIG6',
'GET_MODULE_CODES_SERIAL',
'GET_MODULE_CODES_VERSION_MAJOR',
'GET_MODULE_CODES_VERSION_MINOR',
'GET_MODULE_CODES_YEAR',
'GET_NODE_ADDRESS',
'GET_OPTICAL_POWER_OFF',
'GET_OUTPUT_125MHZ_LOCKED',
'GET_OUTPUT_2GHZ_LOCKED',
'GET_PATCH_LEVEL',
'GET_POWER_SUPPLY_12V_NOT_OK',
'GET_POWER_SUPPLY_15V_NOT_OK',
'GET_PROTOCOL_MAJOR_REV_LEVEL',
'GET_PROTOCOL_MINOR_REV_LEVEL',
'GET_PROTOCOL_PATCH_LEVEL',
'GET_PROTOCOL_REV_LEVEL',
'GET_PWR_125_MHZ',
'GET_PWR_25_MHZ',
'GET_PWR_2_GHZ',
'GET_READ_MODULE_CODES',
'GET_RX_OPT_PWR',
'GET_SERIAL_NUMBER',
'GET_SIGN_OP',
'GET_STATUS',
'GET_SW_REV_LEVEL',
'GET_TE_LENGTH',
'GET_TE_LONG_FLAG_SET',
'GET_TE_OFFSET_COUNTER',
'GET_TE_SHORT_FLAG_SET',
'GET_TRANS_NUM',
'GET_VDC_12',
'GET_VDC_15',
'GET_VDC_7',
'GET_VDC_MINUS_7',
'SET_CLEAR_FLAGS',
'SET_FPGA_LOGIC_RESET',
'SET_RESET_AMBSI',
'SET_RESET_DEVICE',
'SET_RESYNC_TE',
'STATUS',
'_HardwareDevice__componentName',
'_HardwareDevice__hw',
'_HardwareDevice__stickyFlag',
'_LORRBase__logger',
'__del__',
'__doc__',
'__init__',
'__module__',
'_devices',
'clearDeviceCommunicationErrorAlarm',
'getControlList',
'getDeviceCommunicationErrorCounter',
'getErrorMessage',
'getHwState',
'getInternalSlaveCanErrorMsg',
'getLastCanErrorMsg',
'getMonitorList',
'hwConfigure',
'hwDiagnostic',
'hwInitialize',
'hwOperational',
'hwSimulation',
'hwStart',
'hwStop',
'inErrorState',
'isMonitoring',
'isSimulated']
In [6]:
When you use multiprocessing to open a second process, an entirely new instance of Python, with its own global state, is created. That global state is not shared, so changes made by child processes to global variables will be invisible to the parent process.
Additionally, most of the abstractions that multiprocessing provides use pickle to transfer data. All data transferred using proxies must be pickleable; that includes all the objects that a Manager provides. Relevant quotations (my emphasis):
Ensure that the arguments to the methods of proxies are picklable.
And (in the Manager section):
Other processes can access the shared objects by using proxies.
Queues also require pickleable data; the docs don't say so, but a quick test confirms it:
import multiprocessing
import pickle
class Thing(object):
def __getstate__(self):
print 'got pickled'
return self.__dict__
def __setstate__(self, state):
print 'got unpickled'
self.__dict__.update(state)
q = multiprocessing.Queue()
p = multiprocessing.Process(target=q.put, args=(Thing(),))
p.start()
print q.get()
p.join()
Output:
$ python mp.py
got pickled
got unpickled
<__main__.Thing object at 0x10056b350>
The one approach that might work for you, if you really can't pickle the data, is to find a way to store it as a ctype object; a reference to the memory can then be passed to a child process. This seems pretty dodgy to me; I've never done it. But it might be a possible solution for you.
Given your update, it seems like you need to know a lot more about the internals of a LORR. Is LORR a class? Can you subclass from it? Is it a subclass of something else? What's its MRO? (Try LORR.__mro__ and post the output if it works.) If it's a pure python object, it might be possible to subclass it, creating a __setstate__ and a __getstate__ to enable pickling.
Another approach might be to figure out how to get the relevant data out of a LORR instance and pass it via a simple string. Since you say that you really just want to call the methods of the object, why not just do so using Queues to send messages back and forth? In other words, something like this (schematically):
Main Process Child 1 Child 2
LORR 1 LORR 2
child1_in_queue -> get message 'foo'
call 'foo' method
child1_out_queue <- return foo data string
child2_in_queue -> get message 'bar'
call 'bar' method
child2_out_queue <- return bar data string
#DBlas gives you a quick url and reference to the Manager class in an answer, but I think its still a bit vague so I thought it might be helpful for you to just see it applied...
import multiprocessing
from multiprocessing import Manager
ants = ['DV03', 'DV04']
def getDV03CclDrivers(lib, data_dict):
data_dict[1] = 1
data_dict[0] = 0
def getDV04CclDrivers(lib, data_list):
data_list['driver'] = 0
if __name__ == "__main__":
manager = Manager()
dataDV03 = manager.list(['', ''])
dataDV04 = manager.dict({'driver': '', 'status': ''})
jobs = []
if 'DV03' in ants:
j = multiprocessing.Process(
target=getDV03CclDrivers,
args=('LORR', dataDV03))
jobs.append(j)
if 'DV04' in ants:
j = multiprocessing.Process(
target=getDV04CclDrivers,
args=('LORR', dataDV04))
jobs.append(j)
for j in jobs:
j.start()
for j in jobs:
j.join()
print 'Results:\n'
print 'DV03', dataDV03
print 'DV04', dataDV04
Because multiprocessing actually uses separate processes, you cannot simply share global variables because they will be in completely different "spaces" in memory. What you do to a global under one process will not reflect in another. Though I admit that it seems confusing since the way you see it, its all living right there in the same piece of code, so "why shouldn't those methods have access to the global"? Its harder to wrap your head around the idea that they will be running in different processes.
The Manager class is given to act as a proxy for data structures that can shuttle info back and forth for you between processes. What you will do is create a special dict and list from a manager, pass them into your methods, and operate on them locally.
Un-pickle-able data
For your specialize LORR object, you might need to create something like a proxy that can represent the pickable state of the instance.
Not super robust or tested much, but gives you the idea.
class LORRProxy(object):
def __init__(self, lorrObject=None):
self.instance = lorrObject
def __getstate__(self):
# how to get the state data out of a lorr instance
inst = self.instance
state = dict(
foo = inst.a,
bar = inst.b,
)
return state
def __setstate__(self, state):
# rebuilt a lorr instance from state
lorr = LORR.LORR()
lorr.a = state['foo']
lorr.b = state['bar']
self.instance = lorr
When using multiprocess, the only way to pass objects between processes is to use Queue or Pipe; globals are not shared. Objects must be pickleable, so multiprocess won't help you here.
You could also use a multiprocessing Array. This allows you to have a shared state between processes and is probably the closest thing to a global variable.
At the top of main, declare an Array. The first argument 'i' says it will be integers. The second argument gives the initial values:
shared_dataDV03 = multiprocessing.Array ('i', (0, 0)) #a shared array
Then pass this array to the process as an argument:
j = multiprocessing.Process(target=getDV03CclDrivers, args=('LORR',shared_dataDV03))
You have to receive the array argument in the function being called, and then you can modify it within the function:
def getDV03CclDrivers(lib,arr): # call global variable
arr[1]=1
arr[0]=0
The array is shared with the parent, so you can print out the values at the end in the parent:
print 'DV03', shared_dataDV03[:]
And it will show the changes:
DV03 [0, 1]
I use p.map() to spin off a number of processes to remote servers and print the results when they come back at unpredictable times:
Servers=[...]
from multiprocessing import Pool
p=Pool(len(Servers))
p.map(DoIndividualSummary, Servers)
This worked fine if DoIndividualSummary used print for the results, but the overall result was in unpredictable order, which made interpretation difficult. I tried a number of approaches to use global variables but ran into problems. Finally, I succeeded with sqlite3.
Before p.map(), open a sqlite connection and create a table:
import sqlite3
conn=sqlite3.connect('servers.db') # need conn for commit and close
db=conn.cursor()
try: db.execute('''drop table servers''')
except: pass
db.execute('''CREATE TABLE servers (server text, serverdetail text, readings text)''')
conn.commit()
Then, when returning from DoIndividualSummary(), save the results into the table:
db.execute('''INSERT INTO servers VALUES (?,?,?)''', (server,serverdetail,readings))
conn.commit()
return
After the map() statement, print the results:
db.execute('''select * from servers order by server''')
rows=db.fetchall()
for server,serverdetail,readings in rows: print serverdetail,readings
May seem like overkill but it was simpler for me than the recommended solutions.
I am sorry that I can't reproduce the error with a simpler example, and my code is too complicated to post. If I run the program in IPython shell instead of the regular Python, things work out well.
I looked up some previous notes on this problem. They were all caused by using pool to call function defined within a class function. But this is not the case for me.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 313, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I would appreciate any help.
Update: The function I pickle is defined at the top level of the module. Though it calls a function that contains a nested function. i.e, f() calls g() calls h() which has a nested function i(), and I am calling pool.apply_async(f). f(), g(), h() are all defined at the top level. I tried simpler example with this pattern and it works though.
Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.
This piece of code:
import multiprocessing as mp
class Foo():
#staticmethod
def work(self):
pass
if __name__ == '__main__':
pool = mp.Pool()
foo = Foo()
pool.apply_async(foo.work)
pool.close()
pool.join()
yields an error almost identical to the one you posted:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.
It can be fixed by defining a function at the top level, which calls foo.work():
def work(foo):
foo.work()
pool.apply_async(work,args=(foo,))
Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.
I'd use pathos.multiprocesssing, instead of multiprocessing. pathos.multiprocessing is a fork of multiprocessing that uses dill. dill can serialize almost anything in python, so you are able to send a lot more around in parallel. The pathos fork also has the ability to work directly with multiple argument functions, as you need for class methods.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(4)
>>> class Test(object):
... def plus(self, x, y):
... return x+y
...
>>> t = Test()
>>> p.map(t.plus, x, y)
[4, 6, 8, 10]
>>>
>>> class Foo(object):
... #staticmethod
... def work(self, x):
... return x+1
...
>>> f = Foo()
>>> p.apipe(f.work, f, 100)
<processing.pool.ApplyResult object at 0x10504f8d0>
>>> res = _
>>> res.get()
101
Get pathos (and if you like, dill) here:
https://github.com/uqfoundation
When this problem comes up with multiprocessing a simple solution is to switch from Pool to ThreadPool. This can be done with no change of code other than the import-
from multiprocessing.pool import ThreadPool as Pool
This works because ThreadPool shares memory with the main thread, rather than creating a new process- this means that pickling is not required.
The downside to this method is that python isn't the greatest language with handling threads- it uses something called the Global Interpreter Lock to stay thread safe, which can slow down some use cases here. However, if you're primarily interacting with other systems (running HTTP commands, talking with a database, writing to filesystems) then your code is likely not bound by CPU and won't take much of a hit. In fact I've found when writing HTTP/HTTPS benchmarks that the threaded model used here has less overhead and delays, as the overhead from creating new processes is much higher than the overhead for creating new threads and the program was otherwise just waiting for HTTP responses.
So if you're processing a ton of stuff in python userspace this might not be the best method.
As others have said multiprocessing can only transfer Python objects to worker processes which can be pickled. If you cannot reorganize your code as described by unutbu, you can use dills extended pickling/unpickling capabilities for transferring data (especially code data) as I show below.
This solution requires only the installation of dill and no other libraries as pathos:
import os
from multiprocessing import Pool
import dill
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
return fun(*args)
def apply_async(pool, fun, args):
payload = dill.dumps((fun, args))
return pool.apply_async(run_dill_encoded, (payload,))
if __name__ == "__main__":
pool = Pool(processes=5)
# asyn execution of lambda
jobs = []
for i in range(10):
job = apply_async(pool, lambda a, b: (a, b, a * b), (i, i + 1))
jobs.append(job)
for job in jobs:
print job.get()
print
# async execution of static method
class O(object):
#staticmethod
def calc():
return os.getpid()
jobs = []
for i in range(10):
job = apply_async(pool, O.calc, ())
jobs.append(job)
for job in jobs:
print job.get()
I have found that I can also generate exactly that error output on a perfectly working piece of code by attempting to use the profiler on it.
Note that this was on Windows (where the forking is a bit less elegant).
I was running:
python -m profile -o output.pstats <script>
And found that removing the profiling removed the error and placing the profiling restored it. Was driving me batty too because I knew the code used to work. I was checking to see if something had updated pool.py... then had a sinking feeling and eliminated the profiling and that was it.
Posting here for the archives in case anybody else runs into it.
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
This error will also come if you have any inbuilt function inside the model object that was passed to the async job.
So make sure to check the model objects that are passed doesn't have inbuilt functions. (In our case we were using FieldTracker() function of django-model-utils inside the model to track a certain field). Here is the link to relevant GitHub issue.
This solution requires only the installation of dill and no other libraries as pathos
def apply_packed_function_for_map((dumped_function, item, args, kwargs),):
"""
Unpack dumped function as target function and call it with arguments.
:param (dumped_function, item, args, kwargs):
a tuple of dumped function and its arguments
:return:
result of target function
"""
target_function = dill.loads(dumped_function)
res = target_function(item, *args, **kwargs)
return res
def pack_function_for_map(target_function, items, *args, **kwargs):
"""
Pack function and arguments to object that can be sent from one
multiprocessing.Process to another. The main problem is:
«multiprocessing.Pool.map*» or «apply*»
cannot use class methods or closures.
It solves this problem with «dill».
It works with target function as argument, dumps it («with dill»)
and returns dumped function with arguments of target function.
For more performance we dump only target function itself
and don't dump its arguments.
How to use (pseudo-code):
~>>> import multiprocessing
~>>> images = [...]
~>>> pool = multiprocessing.Pool(100500)
~>>> features = pool.map(
~... *pack_function_for_map(
~... super(Extractor, self).extract_features,
~... images,
~... type='png'
~... **options,
~... )
~... )
~>>>
:param target_function:
function, that you want to execute like target_function(item, *args, **kwargs).
:param items:
list of items for map
:param args:
positional arguments for target_function(item, *args, **kwargs)
:param kwargs:
named arguments for target_function(item, *args, **kwargs)
:return: tuple(function_wrapper, dumped_items)
It returs a tuple with
* function wrapper, that unpack and call target function;
* list of packed target function and its' arguments.
"""
dumped_function = dill.dumps(target_function)
dumped_items = [(dumped_function, item, args, kwargs) for item in items]
return apply_packed_function_for_map, dumped_items
It also works for numpy arrays.
A quick fix is to make the function global
from multiprocessing import Pool
class Test:
def __init__(self, x):
self.x = x
#staticmethod
def test(x):
return x**2
def test_apply(self, list_):
global r
def r(x):
return Test.test(x + self.x)
with Pool() as p:
l = p.map(r, list_)
return l
if __name__ == '__main__':
o = Test(2)
print(o.test_apply(range(10)))
Building on #rocksportrocker solution,
It would make sense to dill when sending and RECVing the results.
import dill
import itertools
def run_dill_encoded(payload):
fun, args = dill.loads(payload)
res = fun(*args)
res = dill.dumps(res)
return res
def dill_map_async(pool, fun, args_list,
as_tuple=True,
**kw):
if as_tuple:
args_list = ((x,) for x in args_list)
it = itertools.izip(
itertools.cycle([fun]),
args_list)
it = itertools.imap(dill.dumps, it)
return pool.map_async(run_dill_encoded, it, **kw)
if __name__ == '__main__':
import multiprocessing as mp
import sys,os
p = mp.Pool(4)
res = dill_map_async(p, lambda x:[sys.stdout.write('%s\n'%os.getpid()),x][-1],
[lambda x:x+1]*10,)
res = res.get(timeout=100)
res = map(dill.loads,res)
print(res)
As #penky Suresh has suggested in this answer, don't use built-in keywords.
Apparently args is a built-in keyword when dealing with multiprocessing
class TTS:
def __init__(self):
pass
def process_and_render_items(self):
multiprocessing_args = [{"a": "b", "c": "d"}, {"e": "f", "g": "h"}]
with ProcessPoolExecutor(max_workers=10) as executor:
# Using args here is fine.
future_processes = {
executor.submit(TTS.process_and_render_item, args)
for args in multiprocessing_args
}
for future in as_completed(future_processes):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
else:
print(f"Generated data for comment process: {future}")
# Dont use 'args' here. It seems to be a built-in keyword.
# Changing 'args' to 'arg' worked for me.
def process_and_render_item(arg):
print(arg)
# This will print {"a": "b", "c": "d"} for the first process
# and {"e": "f", "g": "h"} for the second process.
PS: The tabs/spaces maybe a bit off.