class variable become empty in multiprocessing pool.map - python

I have a class variable in a Utils class.
class Utils:
_raw_data = defaultdict(list)
#classmethod
def raw_data(cls):
return cls._raw_data.copy()
#classmethod
def set_raw_data(cls, key, data):
cls._raw_data[key] = data
The _raw_data was filled with key and value pairs before it was being read.
...
data = [ipaddress.IPv4Network(address) for address in ip_addresses]
Utils.set_raw_data(device_name, data)
But when I try to execute a function in multiprocessing Pool.map that reads the raw_data from Utils class, it returns empty list.
This is the method from the parent class
class Parent:
...
def evaluate_without_prefix(self, devices):
results = []
print(Utils.raw_data()) <------ this print shows that the Utils.raw_data() is empty
for network1, network2 in itertools.product(Utils.raw_data()[devices[0]], Utils.raw_data()[devices[1]]):
if network1.subnet_of(network2):
results.append((devices[0], network1, devices[1], network2))
if network2.subnet_of(network1):
results.append((devices[1], network2, devices[0], network1))
return results
and in the child class, I execute the method from the parent class, with multiprocessing pool.
class Child(Parent):
...
def execute(self):
pool = Pool(os.cpu_count() - 1)
devices = list(itertools.combinations(list(Utils.raw_data().keys()), 2))
results = pool.map(super().evaluate_without_prefix, devices)
return results
The print() in the Parent class shows that the raw_data() is empty, but the variable actually has data, devices variable in Child class actually get data from the raw_data() but when it enters the multiprocessing pool, the raw_data() becomes empty. Any reason for this?

The problem seems to be as follows:
The class data created in your main process must be serialized/de-serialized using pickle so that it can be passed from the main process's address space to the address spaces of the processes in the multiprocessing pool that needs to work with these objects. But the class data in question is an instance of class Parent since you are calling one of its methods, i.e. valuate_without_prefix. But nowhere in that instance is there a reference to class Util or anything that would cause the multiprocessing pool to be serializing the Util class along with the Parent instance. Consequently, when that method references class Util in any of the processes, a new Util will be created and, of course, it will not have its dictionary initialized.
I think the simplest change is to:
Make attribute _raw_data an instance attribute rather than a class attribute (by the way, according to your current usage, there is no need for this to be a defaultdict).
Create an instance of class Util named util and initialize the dictionary via this reference.
Use the initializer and initargs arguments of the multiprocessing.Pool constructor to initialize each process in the multiprocessing pool to have a global variable named util that will be a copy of the util instance created by the main process.
So I would organize the code along the following lines:
class Utils:
def __init__(self):
self._raw_data = {}
def raw_data(self):
# No need to make a copy ???
return self._raw_data.copy()
def set_raw_data(self, key, data):
self._raw_data[key] = data
def init_processes(utils_instance):
"""
Initialize each process in the process pool with global variable utils.
"""
global utils
utils = utils_instance
class Parent:
...
def evaluate_without_prefix(self, devices):
results = []
print(utils.raw_data())
for network1, network2 in itertools.product(utils.raw_data()[devices[0]], utils.raw_data()[devices[1]]):
results.append([network1, network2])
return results
class Child(Parent):
...
def execute(self, utils):
pool = Pool(os.cpu_count() - 1, initializer=init_processes, initargs=(utils,))
# No need to make an explicit list (map will do that for you) ???
devices = list(itertools.combinations(list(utils.raw_data().keys()), 2))
results = pool.map(super().evaluate_without_prefix, devices)
return results
def main():
utils = Utils()
# Initialize utils:
...
data = [ipaddress.IPv4Network(address) for address in ip_addresses]
utils.set_raw_data(device_name, data)
child = Child()
results = child.execute(utils)
if __name__ == '__main__':
main()
Further Explanation
The following program's main process calls class method Foo.set_x to update class attribute x to the value of 10 before creating a multiprocessing pool and invoking worker function worker, which prints out the value of Foo.x.
On Windows, which uses OS spawn to create new processes, the process in the pool is initialized prior to calling the worker function essentially by launching a new Python interpreter and re-executing the source program executing every statement at global scope. Hence the class definition of Foo is created by the Python interpreter compiling it; there is no pickling involved. But Foo.x will be 0.
The same program run on Linux, which uses OS fork to create new processes, inherits a copy-on-write address space from the main process. Therefore, it will have a copy of the Foo class as it existed at the time the multiprocessing pool was created and Foo.x will be 10.
My solution above, which uses a pool initializer to set a global variable in each pool's process's address space to the value of the Util instance, is what is required for Windows platforms and will work also for Linux. An alternative, of course, is to pass the Util instance as an additional argument to your worker function instead of using a pool initializer, but this is generally not as efficient because generally the number of processes in the pool is less than the number of times the worker function is being invoked so less pickling will be required with the pool initializer method.
from multiprocessing import Pool
class Foo:
x = 0
#classmethod
def set_x(cls, x):
cls.x = x
def worker():
print(Foo.x)
if __name__ == '__main__':
Foo.set_x(10)
pool = Pool(1)
pool.apply(worker)

Related

Is there a way to sync a serializable structure with python multiprocessing?

If you create a new Process in python, it will serialize and copy the entire available scope, as far as I understand it. If you use multiprocessing.Pipe() it also allows sending various things, not just raw bytes.
However, instead of sending, I simply want to update a variable that contains a simple POD object like this:
class MyStats:
def __init__(self):
self.bytes_read = 0
self.bytes_written = 0
So say that in a process, when I update these stats, I want to tell python to serialize it and send it to the parent process' side somehow. I don't want to have to create multiprocessing.Value for each and every one of these things, that sounds super tedious.
Is there a way to tell python to pass and overwrite a specific object property somehow?
A manager is what you need here: it will be slower but all data stored inside will be automatically synced with other processes. Here is a simple example below:
from multiprocessing.managers import BaseManager, public_methods, NamespaceProxy
from multiprocessing import Process
def make_proxy(name, cls, base=None):
"""
Args:
name : A string that should match the variable name the proxy will be assigned to
cls : The class for which you want to create a proxy for
base : If you are subclassing NamespaceProxy (or any other implementation) and want to use that subclass as the
base for this new proxy, then pass the subclass as the base using this argument
"""
exposed = public_methods(cls) + ['__getattribute__', '__setattr__', '__delattr__']
return _MakeProxyType(name, exposed, base)
def _MakeProxyType(name, exposed, base=None):
"""
Attempts to replicate multiprocessing.managers.MakeProxType properly
"""
if base is None:
base = NamespaceProxy
exposed = tuple(exposed)
dic = {}
for meth in exposed:
if hasattr(base, meth):
continue
exec('''def %s(self, *args, **kwds):
return self._callmethod(%r, args, kwds)''' % (meth, meth), dic)
ProxyType = type(name, (base,), dic)
ProxyType._exposed_ = exposed
return ProxyType
class MyStats:
def __init__(self):
self.bytes_read = 0
self.bytes_written = 0
def worker(my_stats):
my_stats.bytes_read = 100
print("Worker process read 100 bytes!")
# Remember to set the name of the variable and the "name" argument to the same value otherwise you will have trouble
# pickling this. If for some reason you cannot do this then you must change the variable's __qualname__ property to
# reflect where the object actually resides so pickle can find it.
MyStatsProxy = make_proxy('MyStatsProxy', MyStats)
if __name__ == "__main__":
# Register our proxy and start the manager process
BaseManager.register("MyStats", MyStats, MyStatsProxy)
manager = BaseManager()
manager.start()
# Create our shared instance and modify it from another process
my_stats = manager.MyStats()
p = Process(target=worker, args=(my_stats,))
p.start()
p.join()
# Check value from main process
print(f"In main process, bytes read are {my_stats.bytes_read}!")
Output
Worker process read 100 bytes!
In main process, bytes read are 100!
Check this question and its answers for more useful information about managers/registering classes and alternate methods to achieve the same result
Note: Keep in mind that managers return pickled values for all objects you access through it. So any modifications you do on mutable objects should be done from within an instance method rather than requesting the mutable object through the proxy and modifying it from outside. For example, doing below will not modify the attribute some_list in the manager at all, only the local copy (to the process) of this attribute will be modified:
my_stats.some_list[0] = "some value"
Instead, you should create an instance method for modifications and call that instead:
my_stats.modify_list(0, "some value")
Alternatively, you can also force the manager to update the mutable object by re-assigning the new value for the object:
local_copy = my_stats.some_list
local_copy[0] = "some value"
my_stats.some_list = local_copy

Multiprocessing Manager can't be passed to another process

I need to pass a Manager instance to other processes as I need instances of proxy objects created in parallel and later on be re-used again in separate processes. However, it appears that I can' pass a Manager as an argument to a function that is ought to be ran by the other process. See an example:
from multiprocessing.managers import BaseManager
from multiprocessing import Pool
from functools import partial
class MyManager(BaseManager):
pass
class MyClass():
def __init__(self, i):
self.i = i
def my_fun(i, manager):
return manager.MyClass(i)
MyManager.register('MyClass', MyClass)
manager = MyManager()
manager.start()
f = partial(my_fun, manager=manager)
with Pool(4) as p:
res = [r.i for r in p.map(f, list(range(10)))]
print(res)
The following exception will arise if I run the code above:
TypeError: Pickling an AuthenticationString object is disallowed for security reasons
Interestingly, but passing Manager inside of args argument of the Pool.Process works, but I still need map functionality.
First of all, the proxy that is automatically generated for your class does not support the access of attributes. So if you want to access the i attribute of your managed class, you will need to explicitly define your own proxy class. It will be easier to just define, for example, a method get_i to return that attribute. I would typically define the get_i method in a subclass of the original class created just for the purpose of being used as the managed class implementation. In the code below I have defined such a method (although I have not bothered to create a special subclass) and a custom proxy class to show you how you would do this.
I just see no way of passing the manager instance to another process. The solution I came up with (there may be better ones) is to create a thread that will accept requests via the connections exposed by a multiprocessing.Pipe instance. You will need to enforce single threading of these requests not only because you cannot have multiple processes sending to the same connection concurrently but also because it is the only way to ensure that the response a requestor gets back matches up with its request.
The idea is that the my_fun function sends via its connection the argument i for which it wants to create a MyClass instance. A daemon thread running in the main process, function create_MyClass for which manager is defined, receives this argument, creates the desired class instance and sends the result back. Essentially create_MyClass behaves like a factory "method". The manner in which this "method" is "called", i.e. sending a message via a Pipe-created connection to a thread running in a different process, is actually similar to what happens when you make a method call on a managed class's proxy reference.
from multiprocessing.managers import BaseManager, NamespaceProxy
from multiprocessing import Pool, Pipe, Lock
from threading import Thread
class MyManager(BaseManager):
pass
class MyClass():
def __init__(self, i):
self.i = i
def get_i(self):
return self.i
class MyClassProxy(NamespaceProxy):
_exposed_ = ('__getattribute__', '__setattr__', '__delattr__', 'get_i')
def get_i(self):
return self._callmethod('get_i')
def init_pool(the_connection, the_lock):
global connection, lock
connection = the_connection
lock = the_lock
def my_fun(i):
with lock:
connection.send(i) # send argument
my_class = connection.recv() # get result
return my_class
def create_MyClass(connection):
while True:
i = connection.recv()
my_class = manager.MyClass(i)
connection.send(my_class)
if __name__ == '__main__':
MyManager.register('MyClass', MyClass, MyClassProxy)
manager = MyManager()
manager.start()
lock = Lock()
connection1, connection2 = Pipe(duplex=True)
# Give one of the bi-directional connections to the daemon thread:
Thread(target=create_MyClass, args=(connection1,), daemon=True).start()
# Initialize each process in the pool with the other bi-directional connection
# and a lock to ensure single-threading of the requests:
with Pool(4, initializer=init_pool, initargs=(connection2, lock)) as p:
res = [r.i for r in p.map(my_fun, list(range(10)))]
print(res)
Prints:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

How to make use of a multiprocessing manager within a class

To start with, here is some code that works
from multiprocessing import Pool, Manager
import random
manager = Manager()
dct = manager.dict()
def do_thing(n):
for i in range(10_000_000):
i += 1
dct[n] = random.randint(0, 9)
with Pool(2) as pool:
pool.map(do_thing, range(10))
Now if I try to make a class out of this:
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
self.manager = Manager()
self.dct = self.manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
I run into: TypeError: Pickling an AuthenticationString object is disallowed for security reasons. Now from here, I get the hint that Python is trying to pickle the Manager which as I understand has its own dedicated process, and processes can't be pickled because they contain an AuthenticationString.
I don't know enough about how forking works (I'm on Linux, so I understand this is the default method for starting new processes) to understand exactly why the Manager instance needs to be pickled.
So here are my questions:
Why is this happening?
How can I use a Manager when doing multiprocessing within a class? PS: I want to be able to import SomeClass from this module.
Is what I'm asking for unreasonable or unconventional?
PS: I know I can do this exact snippet without the Manager by exploiting the fact that pool.map will return things in order, so something like this: res = pool.map(self.do_thing, range(10)) then dct = {k: v for k, v in zip(range(10), res)}. But that's besides the point of the question.
To answer your questions:
Q1 - Why is this happening?
Each worker process created by the Pool.map() needs to execute the instance method self.do_thing(). In order to do that Python pickles the instance and passes it to the subprocess (which unpickles it). If each instance has a Manager it will be a problem because they're not pickleable. Part of the unpickling process involves importing the module that defines the class and restoring the instance's attributes (which were also pickled).
Q2 - How to fix it
You can avoid the problem by having the class create its own class-level Manager (shared by all instances of the class). Here the __init__() method creates the manager class attribute the first time an instance is created and from that point on, further instances will reuse this — it's sometimes called "lazy initialization"
from multiprocessing import Pool, Manager
import random
class SomeClass:
def __init__(self):
# Lazy creation of class attribute.
try:
manager = getattr(type(self), 'manager')
except AttributeError:
manager = type(self).manager = Manager()
self.dct = manager.dict()
def __call__(self):
with Pool(2) as pool:
pool.map(self.do_thing, range(10))
print('done')
def do_thing(self, n):
for i in range(10_000_000):
i += 1
self.dct[n] = random.randint(0, 9)
if __name__ == '__main__':
inst = SomeClass()
inst()
Q3 - Is this a reasonable thing to do?
In my opinion, yes.

python mock get calling object

I have a UUT class which instantiates Worker objects, and calls their do_stuff() method.
The Worker objects uses a Provider object for two things:
Calls methods on the provider object to do some stuff
Gets notifications from the provider by subscribing a method with the provider's events
When a worker gets a notification, it processes it, an notifies the UUT object, which in reponse can create more Worker objects.
I've already tested each class on its own, and I want to test UUT+Worker together. For that, I intend to mock-out Provider.
import mock
import unittest
import provider
class Worker():
def __init__(self, *args):
resource.default_resource.subscribe('on_spam', self._on_spam) # I'm going to patch 'resource.default_resource'
def do_stuff(self):
self.resource.do_stuff()
def _on_spam(self, message):
self._tell_uut_to_create_more_workers(message['num_of_new_workers_to_create'])
class UUT():
def __init__(self, *args):
self._workers = []
def gen_worker_and_do_stuff(self, *args)
worker = Worker(*args)
self._workers.append(resource)
worker.do_stuff()
class TestCase1(unittest.TestCase):
#mock.patch('resource.default_resource', spec_set=resource.Resource)
def test_1(self, mock_resource):
uut = UUT()
uut.gen_worker_and_do_stuff('Egg') # <-- say I automagically grabbed the resulting Worker into self.workers
self.workers[0]._on_spam({'num_of_new_workers_to_create':5}) # <-- I also want to get hold of the newly-created workers
Is there a way to grab the worker objects generated by uut, without directly accessing the _workers list in uut (which is an implementation detail)?
I guess I can do it in Worker.__init__, where the worker subscribes to provider events, so I guess the question reduces to:
How to I extract the self in the callee, when calling resource.default_resource.subscribe('on_spam', self._on_spam)?
As an application of the Dependency Inversion principle, I'd pass the Worker class as a dependency to UUT:
class UUT():
def __init__(self, make_worker=Worker):
self._workers = []
self._make_worker = make_worker
def gen_worker_and_connect(self, *args)
worker = self._make_worker(*args)
self._workers.append(resource)
worker.connect()
Then provide anything you want from the test instead of Worker. This own function could share the created object with the test scope. Besides solving this particular problem, that would also make the dependency explicit and independent of the UUT implementation. And you would not need to mock the resource thing as well, which makes the test dependent on things unrelated to the class under test.

python multiprocessing manager & composite pattern sharing

I'm trying to share a composite structure through a multiprocessing manager but I felt in trouble with a "RuntimeError: maximum recursion depth exceeded" when trying to use just one of the Composite class methods.
The class is token from code.activestate and tested by me before inclusion into the manager.
When retrieving the class into a process and invoking its addChild() method I kept the RunTimeError, while outside the process it works.
The composite class inheritates from a SpecialDict class, that implements a ** ____getattr()____ **
method.
Could be possible that while calling addChild() the interpreter of python looks for a different ** ____getattr()____ ** because the right one is not proxied by the manager?
If so It's not clear to me the right way to make a proxy to that class/method
The following code reproduce exactly this condition:
1) this is the manager.py:
from multiprocessing.managers import BaseManager
from CompositeDict import *
class PlantPurchaser():
def __init__(self):
self.comp = CompositeDict('Comp')
def get_cp(self):
return self.comp
class Manager():
def __init__(self):
self.comp = QueuePurchaser().get_cp()
BaseManager.register('get_comp', callable=lambda:self.comp)
self.m = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
self.s = self.m.get_server()
self.s.serve_forever()
2) I want to use the composite into this consumer.py:
from multiprocessing.managers import BaseManager
class Consumer():
def __init__(self):
BaseManager.register('get_comp')
self.m = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
self.m.connect()
self.comp = self.m.get_comp()
ret = self.comp.addChild('consumer')
3) run all launching by a controller.py:
from multiprocessing import Process
class Controller():
def __init__(self):
for child in _run_children():
child.join()
def _run_children():
from manager import Manager
from consumer import Consumer as Consumer
procs = (
Process(target=Manager, name='Manager' ),
Process(target=Consumer, name='Consumer'),
)
for proc in procs:
proc.daemon = 1
proc.start()
return procs
c = Controller()
Take a look this related questions on how to do a proxy for CompositeDict() class
as suggested by AlberT.
The solution given by tgray works but cannot avoid race conditions
Is it possible there is a circular reference between the classes? For example, the outer class has a reference to the composite class, and the composite class has a reference back to the outer class.
The multiprocessing manager works well, but when you have large, complicated class structures, then you are likely to run into an error where a type/reference can not be serialized correctly. The other problem is that errors from multiprocessing manager are very cryptic. This makes debugging failure conditions even more difficult.
I think the problem is that you have to instruct the Manager on how to manage you object, which is not a standard python type.
In other worlds you have to create a proxy for you CompositeDict
You could look at this doc for an example: http://ruffus.googlecode.com/svn/trunk/doc/html/sharing_data_across_jobs_example.html
Python has a default maximum recursion depth of 1000 (or 999, I forget...). But you can change the default behavior thusly:
import sys
sys.setrecursionlimit(n)
Where n is the number of recursions you wish to allow.
Edit:
The above answer does nothing to solve the root cause of this problem (as pointed out in the comments). It only needs to be used if you are intentionally recursing more than 1000 times. If you are in an infinite loop (like in this problem), you will eventually hit whatever limit you set.
To address your actual problem, I re-wrote your code from scratch starting as simply as I could make it and built it up to what I believe is what you want:
import sys
from multiprocessing import Process
from multiprocessing.managers import BaseManager
from CompositDict import *
class Shared():
def __init__(self):
self.comp = CompositeDict('Comp')
def get_comp(self):
return self.comp
def set_comp(self, c):
self.comp = c
class Manager():
def __init__(self):
shared = Shared()
BaseManager.register('get_shared', callable=lambda:shared)
mgr = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
srv = mgr.get_server()
srv.serve_forever()
class Consumer():
def __init__(self, child_name):
BaseManager.register('get_shared')
mgr = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
mgr.connect()
shared = mgr.get_shared()
comp = shared.get_comp()
child = comp.addChild(child_name)
shared.set_comp(comp)
print comp
class Controller():
def __init__(self):
pass
def main(self):
m = Process(target=Manager, name='Manager')
m.daemon = True
m.start()
consumers = []
for i in xrange(3):
p = Process(target=Consumer, name='Consumer', args=('Consumer_' + str(i),))
p.daemon = True
consumers.append(p)
for c in consumers:
c.start()
for c in consumers:
c.join()
return 0
if __name__ == '__main__':
con = Controller()
sys.exit(con.main())
I did this all in one file, but you shouldn't have any trouble breaking it up.
I added a child_name argument to your consumer so that I could check that the CompositDict was getting updated.
Note that there is both a getter and a setter for your CompositDict object. When I only had a getter, each Consumer was overwriting the CompositDict when it added a child.
This is why I also changed your registered method to get_shared instead of get_comp, as you will want access to the setter as well as the getter within your Consumer class.
Also, I don't think you want to try joining your manager process, as it will "serve forever". If you look at the source for the BaseManager (./Lib/multiprocessing/managers.py:Line 144) you'll notice that the serve_forever() function puts you into an infinite loop that is only broken by KeyboardInterrupt or SystemExit.
Bottom line is that this code works without any recursive looping (as far as I can tell), but let me know if you still experience your error.

Categories