How to make use of a multiprocessing manager within a class - python

To start with, here is some code that works
from multiprocessing import Pool, Manager
import random
manager = Manager()
dct = manager.dict()
def do_thing(n):
for i in range(10_000_000):
i += 1
dct[n] = random.randint(0, 9)
with Pool(2) as pool:
pool.map(do_thing, range(10))
Now if I try to make a class out of this:
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
self.manager = Manager()
self.dct = self.manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
I run into: TypeError: Pickling an AuthenticationString object is disallowed for security reasons. Now from here, I get the hint that Python is trying to pickle the Manager which as I understand has its own dedicated process, and processes can't be pickled because they contain an AuthenticationString.
I don't know enough about how forking works (I'm on Linux, so I understand this is the default method for starting new processes) to understand exactly why the Manager instance needs to be pickled.
So here are my questions:
Why is this happening?
How can I use a Manager when doing multiprocessing within a class? PS: I want to be able to import SomeClass from this module.
Is what I'm asking for unreasonable or unconventional?
PS: I know I can do this exact snippet without the Manager by exploiting the fact that pool.map will return things in order, so something like this: res = pool.map(self.do_thing, range(10)) then dct = {k: v for k, v in zip(range(10), res)}. But that's besides the point of the question.

To answer your questions:
Q1 - Why is this happening?
Each worker process created by the Pool.map() needs to execute the instance method self.do_thing(). In order to do that Python pickles the instance and passes it to the subprocess (which unpickles it). If each instance has a Manager it will be a problem because they're not pickleable. Part of the unpickling process involves importing the module that defines the class and restoring the instance's attributes (which were also pickled).
Q2 - How to fix it
You can avoid the problem by having the class create its own class-level Manager (shared by all instances of the class). Here the __init__() method creates the manager class attribute the first time an instance is created and from that point on, further instances will reuse this — it's sometimes called "lazy initialization"
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
# Lazy creation of class attribute.
try:
manager = getattr(type(self), 'manager')
except AttributeError:
manager = type(self).manager = Manager()
self.dct = manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
print('done')
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
Q3 - Is this a reasonable thing to do?
In my opinion, yes.

Related

Python multiprocessing.Pool.apply_async() not executing class function

In a custom class I have the following code:
class CustomClass():
triggerQueue: multiprocessing.Queue
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
pool = multiprocessing.Pool(5)
while True:
try:
queueString = self.triggerQueue.get_nowait()
pool.apply_async(func=self.poolFunc, args=(queueString,))
except queue.Empty:
break
What I intend to do is:
add a trigger to the queue (not implemented in this snippet) -> works as intended
run an endless loop within the listenerFunc that reads all triggers from the queue (if any are found) -> works as intended
pass trigger to poolFunc which is to be executed asynchronosly -> not working
It works as soon as I source my poolFun() outside of the class like
def poolFunc(queueString):
print(queueString)
class CustomClass():
[...]
But why is that so? Do I have to pass the self argument somehow? Is it impossible to perform it this way in general?
Thank you for any hint!
There are several problems going on here.
Your instance method, poolFunc, is missing a self parameter.
You are never properly terminating the Pool. You should take advantage of the fact that a multiprocessing.Pool object is a context manager.
You're calling apply_async, but you're never waiting for the results. Read the documentation: you need to call the get method on the AsyncResult object to receive the result; if you don't do this before your program exits your poolFunc function may never run.
By making the Queue object part of your class, you won't be able to pass instance methods to workers.
We can fix all of the above like this:
import multiprocessing
import queue
triggerQueue = multiprocessing.Queue()
class CustomClass:
def poolFunc(self, queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for res in results:
print(res.get())
c = CustomClass()
for i in range(10):
triggerQueue.put(f"testval{i}")
c.listenerFunc()
You can, as you mention, also replace your instance method with a static method, in which case we can keep triggerQueue as part of the class:
import multiprocessing
import queue
class CustomClass:
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
#staticmethod
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = self.triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for r in results:
print(r.get())
c = CustomClass()
for i in range(10):
c.triggerQueue.put(f"testval{i}")
c.listenerFunc()
But we still need to reap the pool_async results.
Okay, I found an answer and a workaround:
the answer is based the anser of noxdafox to this question.
Instance methods cannot be serialized that easily. What the Pickle protocol does when serialising a function is simply turning it into a string.
For a child process would be quite hard to find the right object your instance method is referring to due to separate process address spaces.
A functioning workaround is to declare the poolFunc() as static function like
#staticmethod
def poolFunc(queueString):
print(queueString)

class variable become empty in multiprocessing pool.map

I have a class variable in a Utils class.
class Utils:
_raw_data = defaultdict(list)
#classmethod
def raw_data(cls):
return cls._raw_data.copy()
#classmethod
def set_raw_data(cls, key, data):
cls._raw_data[key] = data
The _raw_data was filled with key and value pairs before it was being read.
...
data = [ipaddress.IPv4Network(address) for address in ip_addresses]
Utils.set_raw_data(device_name, data)
But when I try to execute a function in multiprocessing Pool.map that reads the raw_data from Utils class, it returns empty list.
This is the method from the parent class
class Parent:
...
def evaluate_without_prefix(self, devices):
results = []
print(Utils.raw_data()) <------ this print shows that the Utils.raw_data() is empty
for network1, network2 in itertools.product(Utils.raw_data()[devices[0]], Utils.raw_data()[devices[1]]):
if network1.subnet_of(network2):
results.append((devices[0], network1, devices[1], network2))
if network2.subnet_of(network1):
results.append((devices[1], network2, devices[0], network1))
return results
and in the child class, I execute the method from the parent class, with multiprocessing pool.
class Child(Parent):
...
def execute(self):
pool = Pool(os.cpu_count() - 1)
devices = list(itertools.combinations(list(Utils.raw_data().keys()), 2))
results = pool.map(super().evaluate_without_prefix, devices)
return results
The print() in the Parent class shows that the raw_data() is empty, but the variable actually has data, devices variable in Child class actually get data from the raw_data() but when it enters the multiprocessing pool, the raw_data() becomes empty. Any reason for this?
The problem seems to be as follows:
The class data created in your main process must be serialized/de-serialized using pickle so that it can be passed from the main process's address space to the address spaces of the processes in the multiprocessing pool that needs to work with these objects. But the class data in question is an instance of class Parent since you are calling one of its methods, i.e. valuate_without_prefix. But nowhere in that instance is there a reference to class Util or anything that would cause the multiprocessing pool to be serializing the Util class along with the Parent instance. Consequently, when that method references class Util in any of the processes, a new Util will be created and, of course, it will not have its dictionary initialized.
I think the simplest change is to:
Make attribute _raw_data an instance attribute rather than a class attribute (by the way, according to your current usage, there is no need for this to be a defaultdict).
Create an instance of class Util named util and initialize the dictionary via this reference.
Use the initializer and initargs arguments of the multiprocessing.Pool constructor to initialize each process in the multiprocessing pool to have a global variable named util that will be a copy of the util instance created by the main process.
So I would organize the code along the following lines:
class Utils:
def __init__(self):
self._raw_data = {}
def raw_data(self):
# No need to make a copy ???
return self._raw_data.copy()
def set_raw_data(self, key, data):
self._raw_data[key] = data
def init_processes(utils_instance):
"""
Initialize each process in the process pool with global variable utils.
"""
global utils
utils = utils_instance
class Parent:
...
def evaluate_without_prefix(self, devices):
results = []
print(utils.raw_data())
for network1, network2 in itertools.product(utils.raw_data()[devices[0]], utils.raw_data()[devices[1]]):
results.append([network1, network2])
return results
class Child(Parent):
...
def execute(self, utils):
pool = Pool(os.cpu_count() - 1, initializer=init_processes, initargs=(utils,))
# No need to make an explicit list (map will do that for you) ???
devices = list(itertools.combinations(list(utils.raw_data().keys()), 2))
results = pool.map(super().evaluate_without_prefix, devices)
return results
def main():
utils = Utils()
# Initialize utils:
...
data = [ipaddress.IPv4Network(address) for address in ip_addresses]
utils.set_raw_data(device_name, data)
child = Child()
results = child.execute(utils)
if __name__ == '__main__':
main()
Further Explanation
The following program's main process calls class method Foo.set_x to update class attribute x to the value of 10 before creating a multiprocessing pool and invoking worker function worker, which prints out the value of Foo.x.
On Windows, which uses OS spawn to create new processes, the process in the pool is initialized prior to calling the worker function essentially by launching a new Python interpreter and re-executing the source program executing every statement at global scope. Hence the class definition of Foo is created by the Python interpreter compiling it; there is no pickling involved. But Foo.x will be 0.
The same program run on Linux, which uses OS fork to create new processes, inherits a copy-on-write address space from the main process. Therefore, it will have a copy of the Foo class as it existed at the time the multiprocessing pool was created and Foo.x will be 10.
My solution above, which uses a pool initializer to set a global variable in each pool's process's address space to the value of the Util instance, is what is required for Windows platforms and will work also for Linux. An alternative, of course, is to pass the Util instance as an additional argument to your worker function instead of using a pool initializer, but this is generally not as efficient because generally the number of processes in the pool is less than the number of times the worker function is being invoked so less pickling will be required with the pool initializer method.
from multiprocessing import Pool
class Foo:
x = 0
#classmethod
def set_x(cls, x):
cls.x = x
def worker():
print(Foo.x)
if __name__ == '__main__':
Foo.set_x(10)
pool = Pool(1)
pool.apply(worker)

Multiprocessing Manager can't be passed to another process

I need to pass a Manager instance to other processes as I need instances of proxy objects created in parallel and later on be re-used again in separate processes. However, it appears that I can' pass a Manager as an argument to a function that is ought to be ran by the other process. See an example:
from multiprocessing.managers import BaseManager
from multiprocessing import Pool
from functools import partial
class MyManager(BaseManager):
pass
class MyClass():
def __init__(self, i):
self.i = i
def my_fun(i, manager):
return manager.MyClass(i)
MyManager.register('MyClass', MyClass)
manager = MyManager()
manager.start()
f = partial(my_fun, manager=manager)
with Pool(4) as p:
res = [r.i for r in p.map(f, list(range(10)))]
print(res)
The following exception will arise if I run the code above:
TypeError: Pickling an AuthenticationString object is disallowed for security reasons
Interestingly, but passing Manager inside of args argument of the Pool.Process works, but I still need map functionality.
First of all, the proxy that is automatically generated for your class does not support the access of attributes. So if you want to access the i attribute of your managed class, you will need to explicitly define your own proxy class. It will be easier to just define, for example, a method get_i to return that attribute. I would typically define the get_i method in a subclass of the original class created just for the purpose of being used as the managed class implementation. In the code below I have defined such a method (although I have not bothered to create a special subclass) and a custom proxy class to show you how you would do this.
I just see no way of passing the manager instance to another process. The solution I came up with (there may be better ones) is to create a thread that will accept requests via the connections exposed by a multiprocessing.Pipe instance. You will need to enforce single threading of these requests not only because you cannot have multiple processes sending to the same connection concurrently but also because it is the only way to ensure that the response a requestor gets back matches up with its request.
The idea is that the my_fun function sends via its connection the argument i for which it wants to create a MyClass instance. A daemon thread running in the main process, function create_MyClass for which manager is defined, receives this argument, creates the desired class instance and sends the result back. Essentially create_MyClass behaves like a factory "method". The manner in which this "method" is "called", i.e. sending a message via a Pipe-created connection to a thread running in a different process, is actually similar to what happens when you make a method call on a managed class's proxy reference.
from multiprocessing.managers import BaseManager, NamespaceProxy
from multiprocessing import Pool, Pipe, Lock
from threading import Thread
class MyManager(BaseManager):
pass
class MyClass():
def __init__(self, i):
self.i = i
def get_i(self):
return self.i
class MyClassProxy(NamespaceProxy):
_exposed_ = ('__getattribute__', '__setattr__', '__delattr__', 'get_i')
def get_i(self):
return self._callmethod('get_i')
def init_pool(the_connection, the_lock):
global connection, lock
connection = the_connection
lock = the_lock
def my_fun(i):
with lock:
connection.send(i) # send argument
my_class = connection.recv() # get result
return my_class
def create_MyClass(connection):
while True:
i = connection.recv()
my_class = manager.MyClass(i)
connection.send(my_class)
if __name__ == '__main__':
MyManager.register('MyClass', MyClass, MyClassProxy)
manager = MyManager()
manager.start()
lock = Lock()
connection1, connection2 = Pipe(duplex=True)
# Give one of the bi-directional connections to the daemon thread:
Thread(target=create_MyClass, args=(connection1,), daemon=True).start()
# Initialize each process in the pool with the other bi-directional connection
# and a lock to ensure single-threading of the requests:
with Pool(4, initializer=init_pool, initargs=(connection2, lock)) as p:
res = [r.i for r in p.map(my_fun, list(range(10)))]
print(res)
Prints:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Wrote script in OSX, with multiprocessing. Now windows won't play ball

The program/script I've made works on OSX and linux. It uses selenium to scrape data from some pages, manipulates the data and saves it. In order to be more efficient, I included the multiprocessing pool and manager. I create a pool, for each item in a list, it calles the scrap class, starts a phantomjs instance and scrapes. Since I'm using multiprocessing.pool, and I want a way to pass data between the threads, I read that multiprocessing.manager was the way forward. If I wrote
manager = Manager()
info = manager.dict([])
it would create a dict that could be accessed by all threads. It all worked perfectly.
My issue is that the client wants to run this on a windows machine (I wrote the entire thing on OSX) I assumed, it would be as simple as installing python, selenium and launching it. I had errors which later lead me to writing if __name__ == '__main__: at the top of my main.py file, and indenting everything to be inside. The issue is, when I have class scrape(): outside of the if statement, it cannot see the global info, since it is declared outside of the scope. If I insert the class scrape(): inside the if __name__ == '__main__': then i get an attribute error saying
AttributeError: 'module' object has no attribute 'scrape'
And if I go back to declaring manager = manager() and info = manager.dict([]) outside of the if __name__ == '__main__' then I get the error in windows about making sure I use if __name__ == '__main__' it doesn't seem like I can win with this project at the moment.
Code Layout...
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
main()
Basically, the scrape_items is called by main, then scrape_items uses pool.map(do_scrape, s) so it calls the do_scrape class and passes the list of items to it one by one. The do_scrape then scrapes a web page based on the item url in "s" then saves that info in the global info which is the multiprocessing.manager dict. The above code does not show any if __name__ == '__main__': statements, it is an outline of how it works on my OSX setup. It runs and completes the task as is. If someone could issue a few pointers, I would appreciate it. Thanks
It would be helpful to see your code, but its sounds like you just need to explicitly pass your shared dict to scrape, like this:
import multiprocessing
from functools import partial
def scrape(info, item):
# Use info in here
if __name__ == "__main__":
manager = multiprocessing.Manager()
info = manager.dict()
pool = multiprocessing.Pool()
func = partial(scrape, info) # use a partial to make it easy to pass the dict to pool.map
items = [1,2,3,4,5] # This would be your actual data
results = pool.map(func, items)
#pool.apply_async(scrape, [shared_dict, "abc"]) # In case you're not using map...
Note that you shouldn't put all your code inside the if __name__ == "__main__": guard, just the code that's actually creating processes via multiprocessing, this includes creating the Manager and the Pool.
Any method you want to run in a child process must be declared at the top level of the module, because it has to be importable from __main__ in the child process. When you declared scrape inside the if __name__ ... guard, it could no longer be imported from the __main__ module, so you saw the AttributeError: 'module' object has no attribute 'scrape' error.
Edit:
Taking your example:
import multiprocessing
from functools import partial
date = str(datetime.date.today())
#class do_scrape():
# def __init__():
# def...
def do_scrape(info, s):
# do stuff
# Also note that do_scrape should probably be a function, not a class
def scrape_items():
# scrape_items is called by main(), which is protected by a`if __name__ ...` guard
# so this is ok.
manager = multiprocessing.Manager()
info = manager.dict([])
pool = multiprocessing.Pool()
func = partial(do_scrape, info)
s = [1,2,3,4,5] # Substitute with the real s
results = pool.map(func, s)
def save_scrape():
def update_price():
def main():
scrape_items()
if __name__ == "__main__":
# Note that you can declare manager and info here, instead of in scrape_items, if you wanted
#manager = multiprocessing.Manager()
#info = manager.dict([])
main()
One other important note here is that the first argument to map should be a function, not a class. This is stated in the docs (multiprocessing.map is meant to be equivalent to the built-in map).
Find the starting point of your program, and make sure you wrap only that with your if statement. For example:
Imports...
from multiprocessing import Pool
from multiprocessing import Manager
manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())
class do_scrape():
def __init__():
def...
def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items
def save_scrape():
def update_price():
def main():
if __name__ == "__main__":
main()
Essentially the contents of the if are only executed if you called this file directly when running your python code. If this file/module is included as an import from another file, all attributes will be defined, so you can access various attributes without actually beginning execution of the module.
Read more here:
What does if __name__ == "__main__": do?

python multiprocessing manager & composite pattern sharing

I'm trying to share a composite structure through a multiprocessing manager but I felt in trouble with a "RuntimeError: maximum recursion depth exceeded" when trying to use just one of the Composite class methods.
The class is token from code.activestate and tested by me before inclusion into the manager.
When retrieving the class into a process and invoking its addChild() method I kept the RunTimeError, while outside the process it works.
The composite class inheritates from a SpecialDict class, that implements a ** ____getattr()____ **
method.
Could be possible that while calling addChild() the interpreter of python looks for a different ** ____getattr()____ ** because the right one is not proxied by the manager?
If so It's not clear to me the right way to make a proxy to that class/method
The following code reproduce exactly this condition:
1) this is the manager.py:
from multiprocessing.managers import BaseManager
from CompositeDict import *
class PlantPurchaser():
def __init__(self):
self.comp = CompositeDict('Comp')
def get_cp(self):
return self.comp
class Manager():
def __init__(self):
self.comp = QueuePurchaser().get_cp()
BaseManager.register('get_comp', callable=lambda:self.comp)
self.m = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
self.s = self.m.get_server()
self.s.serve_forever()
2) I want to use the composite into this consumer.py:
from multiprocessing.managers import BaseManager
class Consumer():
def __init__(self):
BaseManager.register('get_comp')
self.m = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
self.m.connect()
self.comp = self.m.get_comp()
ret = self.comp.addChild('consumer')
3) run all launching by a controller.py:
from multiprocessing import Process
class Controller():
def __init__(self):
for child in _run_children():
child.join()
def _run_children():
from manager import Manager
from consumer import Consumer as Consumer
procs = (
Process(target=Manager, name='Manager' ),
Process(target=Consumer, name='Consumer'),
)
for proc in procs:
proc.daemon = 1
proc.start()
return procs
c = Controller()
Take a look this related questions on how to do a proxy for CompositeDict() class
as suggested by AlberT.
The solution given by tgray works but cannot avoid race conditions
Is it possible there is a circular reference between the classes? For example, the outer class has a reference to the composite class, and the composite class has a reference back to the outer class.
The multiprocessing manager works well, but when you have large, complicated class structures, then you are likely to run into an error where a type/reference can not be serialized correctly. The other problem is that errors from multiprocessing manager are very cryptic. This makes debugging failure conditions even more difficult.
I think the problem is that you have to instruct the Manager on how to manage you object, which is not a standard python type.
In other worlds you have to create a proxy for you CompositeDict
You could look at this doc for an example: http://ruffus.googlecode.com/svn/trunk/doc/html/sharing_data_across_jobs_example.html
Python has a default maximum recursion depth of 1000 (or 999, I forget...). But you can change the default behavior thusly:
import sys
sys.setrecursionlimit(n)
Where n is the number of recursions you wish to allow.
Edit:
The above answer does nothing to solve the root cause of this problem (as pointed out in the comments). It only needs to be used if you are intentionally recursing more than 1000 times. If you are in an infinite loop (like in this problem), you will eventually hit whatever limit you set.
To address your actual problem, I re-wrote your code from scratch starting as simply as I could make it and built it up to what I believe is what you want:
import sys
from multiprocessing import Process
from multiprocessing.managers import BaseManager
from CompositDict import *
class Shared():
def __init__(self):
self.comp = CompositeDict('Comp')
def get_comp(self):
return self.comp
def set_comp(self, c):
self.comp = c
class Manager():
def __init__(self):
shared = Shared()
BaseManager.register('get_shared', callable=lambda:shared)
mgr = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
srv = mgr.get_server()
srv.serve_forever()
class Consumer():
def __init__(self, child_name):
BaseManager.register('get_shared')
mgr = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
mgr.connect()
shared = mgr.get_shared()
comp = shared.get_comp()
child = comp.addChild(child_name)
shared.set_comp(comp)
print comp
class Controller():
def __init__(self):
pass
def main(self):
m = Process(target=Manager, name='Manager')
m.daemon = True
m.start()
consumers = []
for i in xrange(3):
p = Process(target=Consumer, name='Consumer', args=('Consumer_' + str(i),))
p.daemon = True
consumers.append(p)
for c in consumers:
c.start()
for c in consumers:
c.join()
return 0
if __name__ == '__main__':
con = Controller()
sys.exit(con.main())
I did this all in one file, but you shouldn't have any trouble breaking it up.
I added a child_name argument to your consumer so that I could check that the CompositDict was getting updated.
Note that there is both a getter and a setter for your CompositDict object. When I only had a getter, each Consumer was overwriting the CompositDict when it added a child.
This is why I also changed your registered method to get_shared instead of get_comp, as you will want access to the setter as well as the getter within your Consumer class.
Also, I don't think you want to try joining your manager process, as it will "serve forever". If you look at the source for the BaseManager (./Lib/multiprocessing/managers.py:Line 144) you'll notice that the serve_forever() function puts you into an infinite loop that is only broken by KeyboardInterrupt or SystemExit.
Bottom line is that this code works without any recursive looping (as far as I can tell), but let me know if you still experience your error.

Categories