Communicating end of Queue

Communicating end of Queue - python

I'm learning to use the Queue module, and am a bit confused about how a queue consumer thread can be made to know that the queue is complete. Ideally I'd like to use get() from within the consumer thread and have it throw an exception if the queue has been marked "done". Is there a better way to communicate this than by appending a sentinel value to mark the last item in the queue?

original (most of this has changed; see updates below)
Based on some of the suggestions (thanks!) of Glenn Maynard and others, I decided to roll up a descendant of Queue.Queue that implements a close method. It's available in the form of a primitive (unpackaged) module. I'll clean this up a bit and package it properly when I have a bit more time. For now the module only contains the CloseableQueue class and the Closed exception class. I'm planning to expand it to also include subclasses of Queue.LifoQueue and Queue.PriorityQueue.
It's in a pretty preliminary state currently, which is to say that although it passes its test suite, I haven't actually used it for anything yet. Your mileage may vary. I'll keep this answer updated with exciting news.
The CloseableQueue class differs a bit from Glenn's suggestion in that closing the queue will prevent future puts, but not prevent future gets until the queue is emptied. This made the most sense to me; it seemed like functionality to clear the queue could be added as a separate mixin* that would be orthogonal to the closeability functionality. So basically with CloseableQueue, by closing the queue you indicate that the last element has been put. There's also an option to do this atomically by passing last=True to the final put call. Subsequent calls to put, and subsequent calls to get once the queue is emptied, as well as outstanding blocked calls matching those descriptions, will raise the Closed exception.
This is mostly useful for situations where a single producer is generating data for one or more consumers, but it could also be useful for a multi-multi arrangement where consumers are waiting for a particular item or set of items. In particular it doesn't provide a way to determine that all of a number of producers have finished production. Getting that working would entail the provision of some way to register producers (.open()?), as well as a way to indicate that producer registration is itself closed.
Suggestions and/or code reviews are quite welcome. I haven't written a whole lot of concurrency code, but hopefully the test suite is thorough enough that the fact that the code passes it is an indication of the code's quality, rather than the suite's lack thereof. I was able to reuse a bunch of the code from the Queue module's test suite: the file itself is included in this module and used as a basis for various subclasses and routines, including regression testing. This probably (hopefully) helped to avoid complete ineptitude in the testing department. The code itself just overrides Queue.get and Queue.put with fairly minimal changes, and adds the close and closed methods.
I've sort of intentionally avoided using any new-fangled fanciness like context managers in both the code itself and in the test suite in an effort to keep the code as backwards-compatible as is the Queue module itself, which is considerably backwards indeed. I'll probably add __enter__ and __exit__ methods at some point; otherwise, the contextlib's closing function should be applicable to a CloseableQueue instance.
*: Here I use the term "mixin" loosely. As the Queue module's classes are old-style, mixins would need to be mixed using class factory functions; some restrictions apply; offer void where prohibited by Guido.
update
The CloseableQueue module now provides CloseableLifoQueue and CloseablePriorityQueue classes. I've also added some convenience functions to support iteration. Still need to rework it as a proper package. There's a class factory function to allow for convenient subclassing of other Queue.Queue-derived classes.
update 2
CloseableQueue is now available via PyPI, e.g. with
$ easy_install CloseableQueue
Comments and criticism are welcome, especially from this answer's anonymous downvoter.

Queue's don't inherently have the idea of being complete or done. They can be used indefinitely. To close it up when you are done, you will indeed need to put None or some other magic value at the end and write the logic to check for it, as you described. The ideal way would probably be subclassing the Queue object.
See http://en.wikipedia.org/wiki/Queue_(data_structure) to learn more about queue in general.

A sentinel is a natural way to shut down a queue, but there are a couple things to watch out for.
First, remember that you may have more than one consumer, so you need to send a sentinel once for each running consumer, and guarantee that each consumer will only consume one sentinel, to ensure that each consumer receives its shutdown sentinel.
Second, remember that Queue defines an interface, and that when possible, code should behave regardless of the underlying Queue. You might have a PriorityQueue, or you might have some other class that exposes the same interface and returns values in some other order.
Unfortunately, it's hard to deal with both of these. To deal with the general case of different queues, a consumer that's shutting down must continue to consume values after receiving its shutdown sentinel until the queue is empty. That means that it may consume another thread's sentinel. This is a weakness of the Queue interface: it should have a Queue.shutdown call to cause an exception to be thrown by all consumers, but that's missing.
So, in practice:
if you're sure you're only ever using a regular Queue, simply send one sentinel per thread.
if you may be using a PriorityQueue, ensure that the sentinel has the lowest priority.

Queue is a FIFO (first in first out) register so remember that the consumer can be faster than producer. When consumers thread detect that the queue is empty normally realise one of following actions:
Send to API: switch to next thread.
Send to API: sleep some ms and than check again the queue.
Send to API: wait on event (like new message in queue).
If you wont that consumers thread terminate after job is complete than put in queue a sentinel value to terminate task.

The best practice way of doing this would be to have the queue itself notify a client that it has reached the 'done' state. The client can then take any action that is appropriate.
What you have suggested; checking the queue to see if it is done periodically, would be highly undesirable. Polling is an antipattern in multithreaded programming, you should always be using notifications.
EDIT:
So your saying that the queue itself knows that it's 'done' based on some criteria and needs to notify the clients of that fact. I think you are correct and the best way to do this is by throwing when a client calls get() and the queue is in the done state. If your throwing this would negate the need for a sentinel value on the client side. Internally the queue can detect that it is 'done' in any way it pleases e.g. queue is empty, it's state was set to done etc I don't see any need for a sentinel value.

Related

Implementing a FIFO queue

In python3.7 there are two options for an off-the-shelf FIFO queue object:
queue.Queue
queue.SimpleQueue
However, all that the docs say about a SimpleQueue is that:
Constructor for an unbounded FIFO queue. Simple queues lack advanced functionality such as task tracking. New in version 3.7.
What would be a functional difference between then two?

Literally what the docs you quoted tell you (the details can be gleaned by comparing the available methods):
SimpleQueue is always unbounded (Queue is optionally bounded); they removed the full method (because it can never be full)
SimpleQueue doesn't do task tracking, so it doesn't provide task_done (for consumers to indicate a task has been completed) or join (to allow a thread to block until all items put have had a matching task_done call)
This simplifies a lot of the code by dropping support for comparatively rare scenarios (I've almost never seen anyone use task_done or join, and when they do, it's often buggy, with exceptions potentially bypassing a task_done, making join block forever), improving performance a bit by side-effect.
Side-note: If you don't actually need the blocking features at all, you can skip the queue module and just use collections.deque. It's still thread-safe and atomic for appends and pops from either end, it just doesn't provide the ability to do a blocking get when the deque is empty, in exchange for being much lower overhead (implemented entirely in C, with no supplementary locking like queue classes).

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.

Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

How to know if a particular task inside a queue is complete?

I have a doubt with respect to python queues.
I have written a threaded class, whose run() method executes the queue.
import threading
import Queue
def AThread(threading.Thread):
def __init__(self,arg1):
self.file_resource=arg1
threading.Thread.__init__(self)
self.queue=Queue.Queue()
def __myTask(self):
self.file_resource.write()
''' Method that will access a common resource
Needs to be synchronized.
Returns a Boolean based on the outcome
'''
def run():
while True:
cmd=self.queue.get()
#cmd is actually a call to method
exec("self.__"+cmd)
self.queue.task_done()
#The problem i have here is while invoking the thread
a=AThread()
a.queue.put("myTask()")
print "Hai"
The same instance of AThread (a=AThread()) will load tasks to the queue from different locations.
Hence the print statement at the bottom should wait for the task added to the queue through the statement above and wait for a definitive period and also receive the value returned after executing the task.
Is there a simplistic way to achieve this ?. I have searched a lot regarding this, kindly review this code and provide suggessions.
And Why python's acquire and release lock are not on the instances of the class. In the scenario mentioned, instances a and b of AThread need not be synchronized, but myTask runs synchronized for both instances of a as well as b when acquire and release lock are applied.
Kindly provide suggestions.

There's lots of approaches you could take, depending on the particular contours of your problem.
If your print "Hai" just needs to happen after myTask completes, you could put it into a task and have myTask put that task on the queue when it finishes. (if you're a CS theory sort of person, you can think of this as being analogous to continuation-passing style).
If your print "Hai" has a more elaborate dependency on multiple tasks, you might look into futures or promises.
You could take a step into the world of Actor-based concurrency, in which case there would probably be a synchronous message send method that does more or less what you want.
If you don't want to use futures or promises, you can achieve a similar thing manually, by introducing a condition variable. Set the condition variable before myTask starts and pass it to myTask, then wait for it to be cleared. You'll have to be very careful as your program grows and constantly rethink your locking strategy to make sure it stays simple and comprehensible - this is the stuff of which difficult concurrency bugs is made.
The smallest sensible step to get what you want is probably to provide a blocking version of Queue.put() which does the condition variable thing. Make sure you think about whether you want to block until the queue is empty, or until the thing you put on the queue is removed from the queue, or until the thing you put on the queue has finished processing. And then make sure you implement the thing you decided to implement when you were thinking about it.

How to synchronize python lists?

I have different threads and after processing they put data in a common list. Is there anything built in python for a list or a numpy array to be accessed by only a single thread. Secondly, if it is not what is an elegant way of doing it?

According to Thread synchronisation mechanisms in Python, reading a single item from a list and modifying a list in place are guaranteed to be atomic. If this is right (although it seems to be partially contradicted by the very existence of the Queue module), then if your code is all of the form:
try:
val = mylist.pop()
except IndexError:
# wait for a while or exit
else:
# process val
And everything put into mylist is done by .append(), then your code is already threadsafe. If you don't trust that one document on that score, use a queue.queue, which does all synchronisation for you, and has a better API than list for concurrent programs - particularly, it gives you the option of blocking indefinitely, or for a timeout, waiting for .pop() to work if you don't have anything else the thread could be getting on with in the mean time.
For numpy arrays, and in general any case where you need more than a producer/consumer queue, use a Lock or RLock from threading - these implement the context manager protocol, so using them is quite simple:
with mylock:
# Process as necessarry
And python will guarantee that the lock gets released once you fall off the end of the with block - including in tricky cases like if something you do raises an exception.
Finally, consider whether multiprocessing is a better fit for your application than threading - threads in Python aren't guaranteed to actually run concurrently, and in CPython only can if the drop to C-level code. multiprocessing gets around that issue, but may have some extra overhead - if you haven't already, you should read the docs to determine which one suits your needs better.

threading provides Lock objects if you need to protect an entire critical section, or the Queue module provides a queue that is threadsafe.

How about the standard library Queue?

A multi-part/threaded downloader via python?

I've seen a few threaded downloaders online, and even a few multi-part downloaders (HTTP).
I haven't seen them together as a class/function.
If any of you have a class/function lying around, that I can just drop into any of my applications where I need to grab multiple files, I'd be much obliged.
If there is there a library/framework (or a program's back-end) that does this, please direct me towards it?

Threadpool by Christopher Arndt may be what you're looking for. I've used this "easy to use object-oriented thread pool framework" for the exact purpose you describe and it works great. See the usage examples at the bottom on the linked page. And it really is easy to use: just define three functions (one of which is an optional exception handler in place of the default handler) and you are on your way.
from http://www.chrisarndt.de/projects/threadpool/:
Object-oriented, reusable design
Provides callback mechanism to process results as they are returned from the worker threads.
WorkRequest objects wrap the tasks assigned to the worker threads and allow for easy passing of arbitrary data to the callbacks.
The use of the Queue class solves most locking issues.
All worker threads are daemonic, so they exit when the main program exits, no need for joining.
Threads start running as soon as you create them. No need to start or stop them. You can increase or decrease the pool size at any time, superfluous threads will just exit when they finish their current task.
You don't need to keep a reference to a thread after you have assigned the last task to it. You just tell it: "don't come back looking for work, when you're done!"
Threads don't eat up cycles while waiting to be assigned a task, they just block when the task queue is empty (though they wake up every few seconds to check whether they are dismissed).
Also available at http://pypi.python.org/pypi/threadpool, easy_install, or as a subversion checkout (see project homepage).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.