I am trying to subclass multiprocessing.JoinableQueue so I can keep track of jobs that were skipped instead of completed. I am using a JoinableQueue to pass jobs to a set of multiprocessing.Process's and I have a threading.Thread populating the queue. Here is my implementation attempt:
import multiprocessing
class InputJobQueue(multiprocessing.JoinableQueue):
def __init__(self, max_size):
super(InputJobQueue, self).__init__(0)
self._max_size = max_size
self._skipped_job_count = 0
def isFull(self):
return self.qsize() >= self._max_size
def taskSkipped(self):
self._skipped_job_count += 1
self.task_done()
However, I run into this issue documented here:
class InputJobQueue(multiprocessing.JoinableQueue):
TypeError
:
Error when calling the metaclass bases
function() argument 1 must be code, not str
Looking at the code in multiprocessing I see that the actual class is in multiprocessing.queues. So I try to extend that class:
import multiprocessing.queues
class InputJobQueue(multiprocessing.queues.JoinableQueue):
def __init__(self, max_size):
super(InputJobQueue, self).__init__(0)
self._max_size = max_size
self._skipped_job_count = 0
def isFull(self):
return self.qsize() >= self._max_size
def taskSkipped(self):
self._skipped_job_count += 1
self.task_done()
But I get inconsistent results: sometimes my custom attributes exist, other times they don't. E.g. the following error is reported in one of my worker Processes:
AttributeError: 'InputJobQueue' object has no attribute '_max_size'
What am I missing to subclass multiprocessing.JoinableQueue?
With multiprocessing, the way objects like JoinableQueue are magically shared between processes is by explicitly sharing the core sync objects, and pickling the "wrapper" stuff to pass over a pipe.
If you understand how pickling works, you can look at the source to JoinableQueue and see that it's using __getstate__/__setstate__. So, you just need to override those to add your own attributes. Something like this:
def __getstate__(self):
return super(InputJobQueue, self).__getstate__() + (self._max_size,)
def __setstate__(self, state):
super(InputJobQueue, self).__setstate__(state[:-1])
self._max_size = state[-1]
I'm not promising this will actually work, since clearly these classes were not designed to be subclassed (the proposed fix for the bug you referenced is to document that the classes can't be subclassed and find a way to make the error messages nicer…). But it should get you past the particular problem you're having here.
You're trying to subclass a type that isn't meant to be subclassed. This requires you to depend on the internals of its implementation in two different ways (one of which is arguably a bug in the stdlib, but the other isn't). And this isn't necessary.
If the actual type is hidden under the covers, no code can actual expect you to be a formal subtype; as long as you duck-type as a queue, you're fine. Which you can do by delegating to a member:
class InputJobQueue(object):
def __init__(self, max_size):
self._jq = multiprocessing.JoinableQueue(0)
self._max_size = max_size
self._skipped_job_count = 0
def __getattr__(self, name):
return getattr(self._jq, name)
# your overrides/new methods
(It would probably be cleaner to explicitly delegate only the documented methods of JoinableQueue than to __getattr__-delegate everything, but in the interests of brevity, I did the shorter version.)
It doesn't matter whether that constructor is a function or a class, because the only thing you're doing is calling it. It doesn't matter how the actual type is pickled, because a class is only responsible for identifying its members, not knowing how to pickle them. All of your problems go away.
Related
I have a couple of classes that have the same methods but do things slightly different (WorkerOne, WorkerTwo). Those classes inherit from an abstract base class using the abc module and #abstractmethod annotation for the methods that should be implemented in WorkerOne and WorkerTwo.
Note: The actual question comes at the end.
Here's the shortened code:
class AbstractWorker(metaclass=ABCMeta):
#abstractmethod
def log_value(self, value):
pass
class WorkerOne(AbstractWorker):
def log_value(self, value):
# do something differently
class WorkerTwo(AbstractWorker):
def log_value(self, value):
# do something differently
This works fine and I can create objects for both worker classes and execute the functions accordingly.
E.g.
worker_one = WorkerOne()
worker_two = WorkerTwo()
worker_one.log_value(1)
worker_two.log_value('text')
Please note that this is simplified. Each worker uses a different package to track experiments in the ML field and not just differentiates between int and str.
I've been trying to find a way to not call both objects every single time I want to log something. I want to unify these methods in some sort of wrapper class that takes those two objects, and executes the method called on that wrapper, on each object. I call that wrapper a hive as it contains it workers.
Currently, I see two solutions to this but both are lacking quality. The first would be the easier one but results in duplication of code. It is simple and it works but it doesn't follow the DRY principle.
Solution #1:
class HiveSimple(AbstractWorker):
def __init__(self, workers: List[AbstractWorker]):
self.workers = workers
def log_value(self, value):
for worker in self.workers:
worker.log_value(value)
...
The idea is to have the wrapper/hive class to inherit from the abstract class as well, so we are forced to implement the functions. The workers are passed as a list for creating the object. For the log_value function we would iterate through the list of workers and execute their own implementation of that method. The problem, as shortly mentioned, is 1) duplicated code and 2) the hive class also grows or needs to be altered when a new method is added to the abstract base class.
The second solution is a bit more advanced and avoids duplicated code but has also a disadvantage.
Solution #2:
class Hive:
def __init__(self, trackers: List[AbstractWorker]):
self.workers = workers
self._ls_functions = []
def __getattr__(self, name):
for worker in self.workers:
self._ls_functions.append(getattr(worker, name))
return self.fn_executor
def fn_executor(self, *args, **kwargs):
for fn in self._ls_functions:
fn(*args, **kwargs)
self._ls_functions = []
In this solution I make use of the __getattr__ function. If I call the log_value() function on the hive object (hive.log_value()) it looks first if it has the log_value attribute/function. If the attribute does not exist, it enters the __getattr__ function and executes the code. There, I iterate through the list of workers and collect the functions with the same name. I then return the function fn_executor, because otherwise I wouldn't be able to hand over the parameters with which the log_value() function was called on the hive object. Although this works fine, the issue is that you need to know the parameters and the types beforehand. Since we don't use inheritance we don't have the advantage of IntelliSense, because the functions are no members of the hive class. Makes sense.
So I wanted to mitigate that by adding functions as attributes during the __init__, which works.
Solution #2.1:
class Hive:
def __init__(self, trackers: List[AbstractWorker]):
self.workers = workers
self._ls_functions = []
for fn_name in dir(AbstractWorker):
if not (fn_name.startswith('__') or fn_name.startswith('_')):
setattr(self, fn_name, self.fn_wrapper(str(fn_name)))
def fn_wrapper(self, name):
def fn(*params, **kwargs):
return self.__getattr__(name)(*params, **kwargs)
return fn
def __getattr__(self, name):
for worker in self.workers:
self._ls_functions.append(getattr(worker, name))
return self.fn_executor
def fn_executor(self, *args, **kwargs):
for fn in self._ls_functions:
fn(*args, **kwargs)
self._ls_functions = []
In solution #2.1 I try to fetch all functions from the abstract base class with dir(AbstractWorker), removing dunder functions and "private" ones with the if and set the name of the functions as an attribute. Additionally, I assign a wrapper function (similar to partial or a decorator) that contains the __getattr__ function. During runtime the members are correctly set, but since IntelliSense relies on static code analysis, it is difficult to handle dynamic attribute assignment and as a result IntelliSense refuses to bring them up.
Now to the question:
What would be the best approach to create a wrapper/hive class that knows about the signature of the functions from the abstract base class but gets rid of the duplication of code shown in solution #1?
Let's assume I am using a library which gives me instances of classes defined in that library when calling its functions:
>>> from library import find_objects
>>> result = find_objects("name = any")
[SomeObject(name="foo"), SomeObject(name="bar")]
Let's further assume that I want to attach new attributes to these instances. For example a classifier to avoid running this code every time I want to classify the instance:
>>> from library import find_objects
>>> result = find_objects("name = any")
>>> for row in result:
... row.item_class= my_classifier(row)
Note that this is contrived but illustrates the problem: I now have instances of the class SomeObject but the attribute item_class is not defined in that class and trips up the type-checker.
So when I now write:
print(result[0].item_class)
I get a typing error. It also trips up auto-completion in editors as the editor does not know that this attribute exists.
And, not to mention that this way of implementing this is quite ugly and hacky.
One thing I could do is create a subclass of SomeObject:
class ExtendedObject(SomeObject):
item_class = None
def classify(self):
cls = do_something_with(self)
self.item_class = cls
This now makes everything explicit, I get a chance to properly document the new attributes and give it proper type-hints. Everything is clean. However, as mentioned before, the actual instances are created inside library and I don't have control over the instantiation.
Side note: I ran into this issue in flask for the Response class. I noticed that flask actually offers a way to customise the instantiation using Flask.response_class. But I am still interested how this could be achieved in libraries that don't offer this injection seam.
One thing I could do is write a wrapper that does something like this:
class WrappedObject(SomeObject):
item_class = None
wrapped = None
#staticmethod
def from_original(wrapped):
self.wrapped = wrapped
self.item_class = do_something_with(wrapped)
def __getattribute__(self, key):
return getattr(self.wrapped, key)
But this seems rather hacky and will not work in other programming languages.
Or try to copy the data:
from copy import deepcopy
class CopiedObject(SomeObject):
item_class = None
#staticmethod
def from_original(wrapped):
for key, value in vars(wrapped):
setattr(self, key, deepcopy(value))
self.item_class = do_something_with(wrapped)
but this feels equally hacky, and is risky when the objects sue properties and/or descriptors.
Are there any known "clean" patterns for something like this?
I would go with a variant of your WrappedObject approach, with the following adjustments:
I would not extend SomeObject: this is a case where composition feels more appropriate than inheritance
With that in mind, from_original is unnecessary: you can have a proper __init__ method
item_class should be an instance variable and not a class variable. It should be initialized in your WrappedObject class constructor
Think twice before implementing __getattribute__ and forwarding everything to the wrapped object. If you need only a few method and attributes of the original SomeObject class, it might be better to implement them explicitly as methods and properties
class WrappedObject:
def __init__(self, wrapped):
self.wrapped = wrapped
self.item_class = do_something_with(wrapped)
def a_method(self):
return self.wrapped.a_method()
#property
def a_property(self):
return self.wrapped.a_property
I am using the functionality in the multiprocessing package to create synchronized shared objects. My objects have both property attributes and are also context managers (i.e. have __enter__ and __exit__ methods).
I've came across a curiosity where I can't make both work at the same time, at least with the recipes I found online, both in python 2 and 3.
Suppose this simple class being registered into a manager:
class Obj(object):
#property
def a(self): return 1
def __enter__(self): return self
def __exit__(self, *args, **kw): pass
Normally both won't work because what we need is not exposed:
from multiprocessing.managers import BaseManager, NamespaceProxy
BaseManager.register('Obj', Obj)
m = BaseManager(); m.start();
o = m.Obj()
o.a # AttributeError: 'AutoProxy[Obj]' object has no attribute 'a'
with o: pass # AttributeError: __exit__
A solution I have found on SO that uses a custom proxy instead of AutoProxy works for the property but not the context manager (no matter if __enter__ and __exit__ is exposed this way or not):
class MyProxy(NamespaceProxy):
_exposed_ = ['__getattribute__', '__setattr__', '__delattr__', 'a', '__enter__', '__exit__']
BaseManager.register('Obj', Obj, MyProxy)
m = BaseManager(); m.start();
o = m.Obj()
o.a # outputs 1
with o: pass # AttributeError: __exit__
I can make the context manager alone work by using the exposed keyword while registering:
BaseManager.register('Obj', Obj, exposed=['__enter__', '__exit__'])
m = BaseManager(); m.start();
o = m.Obj()
with o: pass # works
But if I also add the stuff for the property I get a max recursion error:
BaseManager.register('Obj', Obj, exposed=['__enter__', '__exit__', '__getattribute__', '__setattr__', '__delattr__', 'a'])
m = BaseManager(); m.start();
o = m.Obj() # RuntimeError: maximum recursion depth exceeded
If I leave out __getattribute__ and friends I see a as a bound method which tries to call the property value instead of the method itself, so that doesn't work either.
I have tried to mix and match in every way I could think of and couldn't find a solution. Is there a way of doing this or maybe this is a bug in the lib?
The fact is that the way these managers are implemented is focused in controlling the access to shared data in then, in the form of Attributes. They won't do great in dealing with other Python features such as properties, or "dunder" methods that depend on the object state, like __enter__ and __exit__.
It would certainly be possible to get to specific workarounds for each needed feature, by means of subclassing the Proxy object, until one would get each to work - but the result of that would never be bullet-proof for all corner cases, much less for all Python class features.
So, I think that in this case the best you do is to create a simple- data-only class! One that just uses plain attributes - no properties, no descriptors, no attribute-access customization - just a plain data class whose instances will hold the data you need to share. Actually, you may not even need such a class, since the managers module provide a Synced dictionary type - just use that,
And then you create a second class where you build this intelligence you need. This second classes will have getters and setters and properties, and can implement the context protocol, and any dunder method you like, and get hold of an associated instance of the data class. All the intelligence in the methods and properties can make use of the data in this instance. Actually, you might just use a multiprocessing.managers.SyncManager.dict syncronized dictionary to hold your data.
Then, if you make this associated data class managed, it will work in a straightforward and simple way, and, in each process, you build the "smart class" wrapping it.
Your code snippets don't give examples on how you pass your objetcs from one process to the other - I hope you are aware that by calling BaseManager.Obj() you get new, local, instances of your classes anyway - you have to get a Queue to share your objects cross-process, regardless of the managers.
The proof of concept bellow shows an example of what I mean.
import time
from multiprocessing import Process, Pool
from multiprocessing.managers import SyncManager
class MySpecialClass:
def __init__(self, data):
self.data = data
#property
def a(self):
return self.data["a"]
def __enter__(self):
return self
def __exit__(self, ext_type, exc_value, traceback):
pass
def worker(data):
obj = MySpecialClass(data)
for i in range(10):
time.sleep(1)
obj.data[i] = i ** 2
def main():
m = SyncManager()
m.start()
data = m.dict()
server_obj = MySpecialClass(data)
p = Process(target=worker, args=(data,))
p.start()
for i in range(22):
print(server_obj.data)
time.sleep(.5)
p.join()
main()
Keep in mind that if you need to coordinate your context-blocks across processes, due to some resources, you can pass manager.Lock() objects around as easily as the data dictionary above - even as a value in the dictionary - and it would then be ready to use inside the object's __enter__ method.
I'm trying to figure out what the following module is doing.
import Queue
import multiprocessing
import threading
class BufferedReadQueue(Queue.Queue):
def __init__(self, lim=None):
self.raw = multiprocessing.Queue(lim)
self.__listener = threading.Thread(target=self.listen)
self.__listener.setDaemon(True)
self.__listener.start()
Queue.Queue.__init__(self, lim)
def listen(self):
try:
while True:
self.put(self.raw.get())
except:
pass
#property
def buffered(self):
return self.qsize()
It is only instantiated once in the calling code, and the .raw attribute, multiprocessing.Queue, gets sent to another class, which appears to inherit from multiprocessing.Process.
So as I'm seeing it, an attribute of BufferedReadQueue is being used as a Queue, but not the class (nor an instance of it) itself.
What would be a reason that BufferedReadQueue inherits from Queue.Queue and not just object, if it's not actually being used as a queue?
It looks like BufferedReadQueue is meant to be used as a way to convert the read end of a multiprocessing.Queue into a normal Queue.Queue. Note this in __init__:
self.__listener = threading.Thread(target=self.listen)
self.__listener.setDaemon(True)
self.__listener.start()
This starts up a listener thread, which just constantly tries to get items from the internal multiprocessing.Queue, and then puts all those items to self. It looks like the use-case is something like this:
def func(queue):
queue.put('stuff')
...
buf_queue = BufferedReadQueue()
proc = multiprocessing.Process(target=func, args=(buf_queue.raw,))
proc.start()
out = buf_queue.get() # Only get calls in the parent
Now, why would you do this instead of just using the multiprocessing.Queue directly? Probably because multiprocessing.Queue has some shortcomings that Queue.Queue doesn't. For example qsize(), which this BufferedReadQueue uses, is not reliable with multiprocessing.Queue:
Return the approximate size of the queue. Because of multithreading/multiprocessing semantics, this number is not reliable.
Note that this may raise NotImplementedError on Unix platforms like Mac OS X where sem_getvalue() is not implemented.
It's also possible to introspect a Queue.Queue, and peek at its contents without popping them. This isn't possible with a multiprocessing.Queue.
I'm trying to write a class for a read-only object which will not be really copied with the copy module, and when it will be pickled to be transferred between processes each process will maintain no more than one copy of it, no matter how many times it will be passed around as a "new" object. Is there already something like that?
I made an attempt to implement this. #Alex Martelli and anyone else, please give me comments/improvements. I think this will eventually end up on GitHub.
"""
todo: need to lock library to avoid thread trouble?
todo: need to raise an exception if we're getting pickled with
an old protocol?
todo: make it polite to other classes that use __new__. Therefore, should
probably work not only when there is only one item in the *args passed to new.
"""
import uuid
import weakref
library = weakref.WeakValueDictionary()
class UuidToken(object):
def __init__(self, uuid):
self.uuid = uuid
class PersistentReadOnlyObject(object):
def __new__(cls, *args, **kwargs):
if len(args)==1 and len(kwargs)==0 and isinstance(args[0], UuidToken):
received_uuid = args[0].uuid
else:
received_uuid = None
if received_uuid:
# This section is for when we are called at unpickling time
thing = library.pop(received_uuid, None)
if thing:
thing._PersistentReadOnlyObject__skip_setstate = True
return thing
else: # This object does not exist in our library yet; Let's add it
new_args = args[1:]
thing = super(PersistentReadOnlyObject, cls).__new__(cls,
*new_args,
**kwargs)
thing._PersistentReadOnlyObject__uuid = received_uuid
library[received_uuid] = thing
return thing
else:
# This section is for when we are called at normal creation time
thing = super(PersistentReadOnlyObject, cls).__new__(cls, *args,
**kwargs)
new_uuid = uuid.uuid4()
thing._PersistentReadOnlyObject__uuid = new_uuid
library[new_uuid] = thing
return thing
def __getstate__(self):
my_dict = dict(self.__dict__)
del my_dict["_PersistentReadOnlyObject__uuid"]
return my_dict
def __getnewargs__(self):
return (UuidToken(self._PersistentReadOnlyObject__uuid),)
def __setstate__(self, state):
if self.__dict__.pop("_PersistentReadOnlyObject__skip_setstate", None):
return
else:
self.__dict__.update(state)
def __deepcopy__(self, memo):
return self
def __copy__(self):
return self
# --------------------------------------------------------------
"""
From here on it's just testing stuff; will be moved to another file.
"""
def play_around(queue, thing):
import copy
queue.put((thing, copy.deepcopy(thing),))
class Booboo(PersistentReadOnlyObject):
def __init__(self):
self.number = random.random()
if __name__ == "__main__":
import multiprocessing
import random
import copy
def same(a, b):
return (a is b) and (a == b) and (id(a) == id(b)) and \
(a.number == b.number)
a = Booboo()
b = copy.copy(a)
c = copy.deepcopy(a)
assert same(a, b) and same(b, c)
my_queue = multiprocessing.Queue()
process = multiprocessing.Process(target = play_around,
args=(my_queue, a,))
process.start()
process.join()
things = my_queue.get()
for thing in things:
assert same(thing, a) and same(thing, b) and same(thing, c)
print("all cool!")
I don't know of any such functionality already implemented. The interesting problem is as follows, and needs precise specs as to what's to happen in this case...:
process A makes the obj and sends it to B which unpickles it, so far so good
A makes change X to the obj, meanwhile B makes change Y to ITS copy of the obj
now either process sends its obj to the other, which unpickles it: what changes
to the object need to be visible at this time in each process? does it matter
whether A's sending to B or vice versa, i.e. does A "own" the object? or what?
If you don't care, say because only A OWNS the obj -- only A is ever allowed to make changes and send the obj to others, others can't and won't change -- then the problems boil down to identifying obj uniquely -- a GUID will do. The class can maintain a class attribute dict mapping GUIDs to existing instances (probably as a weak-value dict to avoid keeping instances needlessly alive, but that's a side issue) and ensure the existing instance is returned when appropriate.
But if changes need to be synchronized to any finer granularity, then suddenly it's a REALLY difficult problem of distributed computing and the specs of what happens in what cases really need to be nailed down with the utmost care (and more paranoia than is present in most of us -- distributed programming is VERY tricky unless a few simple and provably correct patterns and idioms are followed fanatically!-).
If you can nail down the specs for us, I can offer a sketch of how I would go about trying to meet them. But I won't presume to guess the specs on your behalf;-).
Edit: the OP has clarified, and it seems all he needs is a better understanding of how to control __new__. That's easy: see __getnewargs__ -- you'll need a new-style class and pickling with protocol 2 or better (but those are advisable anyway for other reasons!-), then __getnewargs__ in an existing object can simply return the object's GUID (which __new__ must receive as an optional parameter). So __new__ can check if the GUID is present in the class's memo [[weakvalue;-)]]dict (and if so return the corresponding object value) -- if not (or if the GUID is not passed, implying it's not an unpickling, so a fresh GUID must be generated), then make a truly-new object (setting its GUID;-) and also record it in the class-level memo.
BTW, to make GUIDs, consider using the uuid module in the standard library.
you could use simply a dictionnary with the key and the values the same in the receiver. And to avoid a memory leak use a WeakKeyDictionary.