Synchronized set in Python

Synchronized set in Python - python

Is there a synchronized set class in Python? Like Queue.Queue? I'm sending messages to a JMS queue and need to handle reciepts:
Keep track of sent messages in a set
When a receipt is received, removed it from the set
When the set is empty, set an Event
Something with the same interface as Queue would be perfect, but I need to be able to remove things in any order.

Look for object Locking.
http://docs.python.org/library/threading.html
Basicly, you lock the thread's execution based on an object. When you're done with the object, you release it and the thread continues.

Well Queue.Queue use a deque under the hood (and deque is not thread safe).
So what you can do is extend the Queue.Queue class and add a new method and in it you can call the deque method deque.remove() (if this is what you mean by removing elements in any order), and make sure that your new method is thread safe look at the Queue.Queue.put() method to see an example of what you should do.
Well it's a bit risky and race condition problems are very hard to debug if you miss something, but hope this can give a clear view.

Related

Does threading.Condition maintain a collection of Thread objects?

Trying to wrap my wits around how threading works. The high-level language in the docs and source code is helpful up to a degree but still leaves me scratching my head. What exactly, in terms of data structures, is the relationship between Thread and Condition objects? What does it mean when a thread "releases" a lock? That the Condition object dequeues its reference to the thread? Is there a lower-level description of these interactions, preferably in Python terms, to be found on the Internet?

A Condition maintains a list (actually a collections.deque) of what are notionally threads, waiting on the condition. It actually stores locks that the waiting threads are blocked on, but thinking of it storing the threads is a conceptual shortcut if you don't care too much about the implementation. The list is initially empty, but any time a thread calls the Condition's wait method, it will create a new lock and add it to the list before blocking on the lock (conceptually, this adds the thread to the list, and suspends it). Locks are removed from the list after another thread calls notify or notify_all, which unlocks one or more of the lock objects in the list, waking up the corresponding threads.
Releasing a lock means unlocking it. It's a basic operation on a Lock object (the reverse of acquire, which locks the Lock). A lock is "held" in between an acquire and a release, and only one thread can hold a Lock at a given time (other threads will either block in acquire, or the operation will fail, perhaps after a timeout). You can use the context manager protocol to call acquire and release for you in simple cases:
with some_lock: # this acquires some_lock, blocking until it's available
do_stuff() # some_lock is held while this runs
# some_lock will be released automatically when the with block ends
Each Condition object is associated with a Lock, either a pre-existing one that you pass to its constructor, or one it creates internally for you (if you don't pass anything). The main Condition operations (wait and notify, and their variants) require that you already hold the associated lock before you call them. You can do the lock operations directly on the Condition object itself, since it proxies the Lock's acquire and release methods (and the equivalent context manager methods).
The Condition class is written in pure Python, so if you want to know how it works on a low level, there's probably no better source of information than the source code itself!
It might also be useful to see how a Condition is used to synchronize multithreaded access to an object. A good example of that is the queue module in the standard library, where each Queue uses three Conditions (not_full, not_empty and all_tasks_done) to efficiently manage threads that are trying to access or modify its data.

How to know if a particular task inside a queue is complete?

I have a doubt with respect to python queues.
I have written a threaded class, whose run() method executes the queue.
import threading
import Queue
def AThread(threading.Thread):
def __init__(self,arg1):
self.file_resource=arg1
threading.Thread.__init__(self)
self.queue=Queue.Queue()
def __myTask(self):
self.file_resource.write()
''' Method that will access a common resource
Needs to be synchronized.
Returns a Boolean based on the outcome
'''
def run():
while True:
cmd=self.queue.get()
#cmd is actually a call to method
exec("self.__"+cmd)
self.queue.task_done()
#The problem i have here is while invoking the thread
a=AThread()
a.queue.put("myTask()")
print "Hai"
The same instance of AThread (a=AThread()) will load tasks to the queue from different locations.
Hence the print statement at the bottom should wait for the task added to the queue through the statement above and wait for a definitive period and also receive the value returned after executing the task.
Is there a simplistic way to achieve this ?. I have searched a lot regarding this, kindly review this code and provide suggessions.
And Why python's acquire and release lock are not on the instances of the class. In the scenario mentioned, instances a and b of AThread need not be synchronized, but myTask runs synchronized for both instances of a as well as b when acquire and release lock are applied.
Kindly provide suggestions.

There's lots of approaches you could take, depending on the particular contours of your problem.
If your print "Hai" just needs to happen after myTask completes, you could put it into a task and have myTask put that task on the queue when it finishes. (if you're a CS theory sort of person, you can think of this as being analogous to continuation-passing style).
If your print "Hai" has a more elaborate dependency on multiple tasks, you might look into futures or promises.
You could take a step into the world of Actor-based concurrency, in which case there would probably be a synchronous message send method that does more or less what you want.
If you don't want to use futures or promises, you can achieve a similar thing manually, by introducing a condition variable. Set the condition variable before myTask starts and pass it to myTask, then wait for it to be cleared. You'll have to be very careful as your program grows and constantly rethink your locking strategy to make sure it stays simple and comprehensible - this is the stuff of which difficult concurrency bugs is made.
The smallest sensible step to get what you want is probably to provide a blocking version of Queue.put() which does the condition variable thing. Make sure you think about whether you want to block until the queue is empty, or until the thing you put on the queue is removed from the queue, or until the thing you put on the queue has finished processing. And then make sure you implement the thing you decided to implement when you were thinking about it.

Multiprocessing Queue maxsize limit is 32767

I'm trying to write a Python 2.6 (OSX) program using multiprocessing, and I want to populate a Queue with more than the default of 32767 items.
from multiprocessing import Queue
Queue(2**15) # raises OSError
Queue(32767) works fine, but any higher number (e.g. Queue(32768)) fails with OSError: [Errno 22] Invalid argument
Is there a workaround for this issue?

One approach would be to wrap your multiprocessing.Queue with a custom class (just on the producer side, or transparently from the consumer perspective). Using that you would queue up items to be dispatched to the Queue object that you're wrapping, and only feed things from the local queue (Python list() object) into the multiprocess.Queue as space becomes available, with exception handling to throttle when the Queue is full.
That's probably the easiest approach since it should have the minimum impact on the rest of your code. The custom class should behave just like a Queue while hiding the underlying multiprocessing.Queue behind your abstraction.
(One approach might be to have your producer use threads, one thread to manage the dispatch from a threading Queue to your multiprocessing.Queue and any other threads actually just feeding the threading Queue).

I've already answered the original question but I do feel like adding that Redis lists are quite reliable and the Python module's support for them are extremely easy to use for implementing a Queue like object. These have the advantage of allowing one to scale out over multiple nodes (across a network) as well as just over multiple processes.
Basically to use those you'd just pick a key (string) for your queue name have your producers push into it and have your workers (task consumers) loop on blocking pops from that key.
The Redis BLPOP, and BRPOP commands all take a list of keys (lists/queues) and an optional timeout value. They return a tuple (key,value) or None (on timeout). So you can easily write up an event driven system that's very similar to the familiar structure of select() (but at a much higher level). The only thing you have to watch for are missing keys and invalid key types (just wrap your queue operations with exception handlers, of course). (If some other application stops on your shared Redis server removing keys or replacing keys that you were using as queues with string/integer or other types of values ... well, you have a different problem at that point). :)
Another advantage of this model is that Redis does persist its data to the disk. So your work queue could survive system restarts if you chose to allow it.
(Of course you could implement a simple Queue as a table in SQLlite or any other SQL system if you really wanted to do so; just using some sort of auto-incrementing index for the sequencing and a column to mark each item has having been "done" (consumed); but that does involve somewhat more complexity than using what Redis gives you "out of the box").

Working for me on MacOSX
>>> import Queue
>>> Queue.Queue(30000000)
<Queue.Queue instance at 0x1006035f0>

Communicating end of Queue

I'm learning to use the Queue module, and am a bit confused about how a queue consumer thread can be made to know that the queue is complete. Ideally I'd like to use get() from within the consumer thread and have it throw an exception if the queue has been marked "done". Is there a better way to communicate this than by appending a sentinel value to mark the last item in the queue?

original (most of this has changed; see updates below)
Based on some of the suggestions (thanks!) of Glenn Maynard and others, I decided to roll up a descendant of Queue.Queue that implements a close method. It's available in the form of a primitive (unpackaged) module. I'll clean this up a bit and package it properly when I have a bit more time. For now the module only contains the CloseableQueue class and the Closed exception class. I'm planning to expand it to also include subclasses of Queue.LifoQueue and Queue.PriorityQueue.
It's in a pretty preliminary state currently, which is to say that although it passes its test suite, I haven't actually used it for anything yet. Your mileage may vary. I'll keep this answer updated with exciting news.
The CloseableQueue class differs a bit from Glenn's suggestion in that closing the queue will prevent future puts, but not prevent future gets until the queue is emptied. This made the most sense to me; it seemed like functionality to clear the queue could be added as a separate mixin* that would be orthogonal to the closeability functionality. So basically with CloseableQueue, by closing the queue you indicate that the last element has been put. There's also an option to do this atomically by passing last=True to the final put call. Subsequent calls to put, and subsequent calls to get once the queue is emptied, as well as outstanding blocked calls matching those descriptions, will raise the Closed exception.
This is mostly useful for situations where a single producer is generating data for one or more consumers, but it could also be useful for a multi-multi arrangement where consumers are waiting for a particular item or set of items. In particular it doesn't provide a way to determine that all of a number of producers have finished production. Getting that working would entail the provision of some way to register producers (.open()?), as well as a way to indicate that producer registration is itself closed.
Suggestions and/or code reviews are quite welcome. I haven't written a whole lot of concurrency code, but hopefully the test suite is thorough enough that the fact that the code passes it is an indication of the code's quality, rather than the suite's lack thereof. I was able to reuse a bunch of the code from the Queue module's test suite: the file itself is included in this module and used as a basis for various subclasses and routines, including regression testing. This probably (hopefully) helped to avoid complete ineptitude in the testing department. The code itself just overrides Queue.get and Queue.put with fairly minimal changes, and adds the close and closed methods.
I've sort of intentionally avoided using any new-fangled fanciness like context managers in both the code itself and in the test suite in an effort to keep the code as backwards-compatible as is the Queue module itself, which is considerably backwards indeed. I'll probably add __enter__ and __exit__ methods at some point; otherwise, the contextlib's closing function should be applicable to a CloseableQueue instance.
*: Here I use the term "mixin" loosely. As the Queue module's classes are old-style, mixins would need to be mixed using class factory functions; some restrictions apply; offer void where prohibited by Guido.
update
The CloseableQueue module now provides CloseableLifoQueue and CloseablePriorityQueue classes. I've also added some convenience functions to support iteration. Still need to rework it as a proper package. There's a class factory function to allow for convenient subclassing of other Queue.Queue-derived classes.
update 2
CloseableQueue is now available via PyPI, e.g. with
$ easy_install CloseableQueue
Comments and criticism are welcome, especially from this answer's anonymous downvoter.

Queue's don't inherently have the idea of being complete or done. They can be used indefinitely. To close it up when you are done, you will indeed need to put None or some other magic value at the end and write the logic to check for it, as you described. The ideal way would probably be subclassing the Queue object.
See http://en.wikipedia.org/wiki/Queue_(data_structure) to learn more about queue in general.

A sentinel is a natural way to shut down a queue, but there are a couple things to watch out for.
First, remember that you may have more than one consumer, so you need to send a sentinel once for each running consumer, and guarantee that each consumer will only consume one sentinel, to ensure that each consumer receives its shutdown sentinel.
Second, remember that Queue defines an interface, and that when possible, code should behave regardless of the underlying Queue. You might have a PriorityQueue, or you might have some other class that exposes the same interface and returns values in some other order.
Unfortunately, it's hard to deal with both of these. To deal with the general case of different queues, a consumer that's shutting down must continue to consume values after receiving its shutdown sentinel until the queue is empty. That means that it may consume another thread's sentinel. This is a weakness of the Queue interface: it should have a Queue.shutdown call to cause an exception to be thrown by all consumers, but that's missing.
So, in practice:
if you're sure you're only ever using a regular Queue, simply send one sentinel per thread.
if you may be using a PriorityQueue, ensure that the sentinel has the lowest priority.

Queue is a FIFO (first in first out) register so remember that the consumer can be faster than producer. When consumers thread detect that the queue is empty normally realise one of following actions:
Send to API: switch to next thread.
Send to API: sleep some ms and than check again the queue.
Send to API: wait on event (like new message in queue).
If you wont that consumers thread terminate after job is complete than put in queue a sentinel value to terminate task.

The best practice way of doing this would be to have the queue itself notify a client that it has reached the 'done' state. The client can then take any action that is appropriate.
What you have suggested; checking the queue to see if it is done periodically, would be highly undesirable. Polling is an antipattern in multithreaded programming, you should always be using notifications.
EDIT:
So your saying that the queue itself knows that it's 'done' based on some criteria and needs to notify the clients of that fact. I think you are correct and the best way to do this is by throwing when a client calls get() and the queue is in the done state. If your throwing this would negate the need for a sentinel value on the client side. Internally the queue can detect that it is 'done' in any way it pleases e.g. queue is empty, it's state was set to done etc I don't see any need for a sentinel value.

On thread safety in python using D-Bus asynchronous method calls

I write a python class which makes asynchronous method calls using D-Bus. When my reply_handler is called, it stores data in list. This list can be used by another class methods at the same time. Is it safe or I can use only synchronized data structures like Queue class?

If you do not modify the list outside of the callback context, then you do not necessarily need synchronization - you will just need to be aware that the list object's state is volatile.
If the list must be modified both in the callback handler as well as, say, the main execution context (or other threads, etc.), then yes you will need synchronization.
The Python synchronized Queue works naturally for message pumps - allowing you to perform actions sequentially in the order that the events come in one of your own contexts. This benefits code simplicity and readability as well since major state changes are easier to track. Callbacks generally shouldn't be too complicated anyway as the outside context in which the callbacks are called shouldn't (and probably doesn't) have to deal with exceptions raised from your code. There are also potential timing considerations as well - the callback will block the async emitter's context - so keeping the handler short and sweet is also good.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.