Multiprocessing Queue maxsize limit is 32767 - python

I'm trying to write a Python 2.6 (OSX) program using multiprocessing, and I want to populate a Queue with more than the default of 32767 items.
from multiprocessing import Queue
Queue(2**15) # raises OSError
Queue(32767) works fine, but any higher number (e.g. Queue(32768)) fails with OSError: [Errno 22] Invalid argument
Is there a workaround for this issue?

One approach would be to wrap your multiprocessing.Queue with a custom class (just on the producer side, or transparently from the consumer perspective). Using that you would queue up items to be dispatched to the Queue object that you're wrapping, and only feed things from the local queue (Python list() object) into the multiprocess.Queue as space becomes available, with exception handling to throttle when the Queue is full.
That's probably the easiest approach since it should have the minimum impact on the rest of your code. The custom class should behave just like a Queue while hiding the underlying multiprocessing.Queue behind your abstraction.
(One approach might be to have your producer use threads, one thread to manage the dispatch from a threading Queue to your multiprocessing.Queue and any other threads actually just feeding the threading Queue).

I've already answered the original question but I do feel like adding that Redis lists are quite reliable and the Python module's support for them are extremely easy to use for implementing a Queue like object. These have the advantage of allowing one to scale out over multiple nodes (across a network) as well as just over multiple processes.
Basically to use those you'd just pick a key (string) for your queue name have your producers push into it and have your workers (task consumers) loop on blocking pops from that key.
The Redis BLPOP, and BRPOP commands all take a list of keys (lists/queues) and an optional timeout value. They return a tuple (key,value) or None (on timeout). So you can easily write up an event driven system that's very similar to the familiar structure of select() (but at a much higher level). The only thing you have to watch for are missing keys and invalid key types (just wrap your queue operations with exception handlers, of course). (If some other application stops on your shared Redis server removing keys or replacing keys that you were using as queues with string/integer or other types of values ... well, you have a different problem at that point). :)
Another advantage of this model is that Redis does persist its data to the disk. So your work queue could survive system restarts if you chose to allow it.
(Of course you could implement a simple Queue as a table in SQLlite or any other SQL system if you really wanted to do so; just using some sort of auto-incrementing index for the sequencing and a column to mark each item has having been "done" (consumed); but that does involve somewhat more complexity than using what Redis gives you "out of the box").

Working for me on MacOSX
>>> import Queue
>>> Queue.Queue(30000000)
<Queue.Queue instance at 0x1006035f0>

Related

Handling endless data stream with multiprocessing and Queues

I want to use the Python 2.7 multiprocessing package to operate on an endless stream of data. A subprocess will constantly receive data via TCP/IP or UDP packets and immediately place the data in a multiprocessing.Queue. However, at certain intervals, say, every 500ms, I only want to operate on a user specified slice of this data. Let's say, the last 200 data packets.
I know I can put() and get() on the Queue, but how can I create that slice of data without a) Backing up the queue and b) Keeping things threadsafe?
I'm thinking I have to constantly get() from the Queue with another subprocess to prevent the Queue from getting full. Then I have to store the data in another data structure (such as a list) to build the user specified slice. But the data structure would probably not be thread safe, so it does not sound like a good solution.
Is there some programming paradigm that achieves what I am trying to do easily? I looked at the multiprocessing.Manager class, but wasn't sure it would work.
You can do this as follows:
Use an instance of the threading.Lock class. Call method acquire to claim exclusive access to your queue from a certain thread and call release to grant other threads access.
Since you want to keep gathering your input, copying the whole queue would be probably be to expensive. Probably the fastest way is to first collect data in one queue, than swap it for another and use the old one to read data from into your application by a different thread. Protect the swapping with a Lock instance, so you can be sure that whenever the writer acquires the lock, the current 'listener' queue is ready to receive data.
If only recent data is important, use two circular buffer instead of queues, allowing old data to be overwritten.

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

Clean way to get near-LIFO behavior from multiprocessing.Queue? (or even just *not* near-FIFO)

Does anyone know a clean way to get near-LIFO or even not near-FIFO (e.g. random) behavior from multiprocessing.Queue?
Alternative Question: Could someone point me to the code for the thread that manages the actual storage structure behind multiprocessing.Queue? It seems like it would be trivial within that to provide approximately LIFO access, but I got lost in the rabbit hole trying to find it.
Notes:
I believe multiprocessing.Queue does not guarantee order. Fine. But it is near-FIFO so near-LIFO would be great.
I could pull all the current items off the queue and reverse the order before working with them, but I prefer to avoid a kludge if possible.
(edit) To clarify: I am doing a CPU bound simulation with multiprocessing and so can't use the specialized queues from Queue. Since I haven't seen any answers for a few days, I've added the alternative question above.
In case it is an issue, below is slight evidence that multiprocessing.Queue is near-FIFO. It just shows that in a simple case (a single thread), it is perfectly FIFO on my system:
import multiprocessing as mp
import Queue
q = mp.Queue()
for i in xrange(1000):
q.put(i)
deltas = []
while True:
try:
value1 = q.get(timeout=0.1)
value2 = q.get(timeout=0.1)
deltas.append(value2-value1)
except Queue.Empty:
break
#positive deltas would indicate the numbers are coming out in increasing order
min_delta, max_delta = min(deltas), max(deltas)
avg_delta = sum(deltas)/len(deltas)
print "min", min_delta
print "max", max_delta
print "avg", avg_delta
prints: min, max, and average are exactly 1 (perfect FIFO)
I've looked over the Queue class that lives in Lib/multiprocessing/queues.py in my Python installation (Python 2.7, but nothing obvious is different in the version from Python 3.2 that I briefly checked). Here's how I understand it works:
There are two sets of objects that are maintained by the Queue object. One set are multiprocess-safe primatives that are shared by all processes. The others are created and used separately by each process.
The cross-process objects are set up in the __init__ method:
A Pipe object, who's two ends are saved as self._reader and self._writer.
A BoundedSemaphore object, which counts (and optionally limits) how many objects are in the queue.
A Lock object for reading the Pipe, and on non-Windows platforms another for writing. (I assume that this is because writing to a pipe is inherently multiprocess-safe on Windows.)
The per-process objects are set up in the _after_fork and _start_thread methods:
A collections.deque object used to buffer writes to the Pipe.
A threading.condition object used to signal when the buffer is not empty.
A threading.Thread object that does the actual writing. It is created lazily, so it won't exist until at least one write to the Queue has been requested in a given process.
Various Finalize objects that clean stuff up when the process ends.
A get from the queue is pretty simple. You acquire the read lock, decrement the semaphore, and grab an object from the read end of the Pipe.
A put is more complicated. It uses multiple threads. The caller to put grabs the condition's lock, then adds its object to the buffer and signals the condition before unlocking it. It also increments the semaphore and starts up the writer thread if it isn't running yet.
The writer thread loops forever (until canceled) in the _feed method. If the buffer is empty, it waits on the notempty condition. Then it takes an item from the buffer, acquires the write lock (if it exists) and writes the item to the Pipe.
So, given all of that, can you modify it to get a LIFO queue? It doesn't seem easy. Pipes are inherently FIFO objects, and while the Queue can't guarantee FIFO behavior overall (due to the asynchronous nature of the writes from multiple processes) it is always going to be mostly FIFO.
If you have only a single consumer, you could get objects from the queue and add them to your own process-local stack. It would be harder to do a multi-consumer stack, though with shared memory a bounded-size stack wouldn't be too hard. You'd need a lock, a pair of conditions (for blocking/signaling on full and empty states), a shared integer value (for the number of values held) and a shared array of an appropriate type (for the values themselves).
There is a LIFO queue in the Queue package (queue in Python 3). This isn't exposed in the multiprocessing or multiprocessing.queues modules.
Replacing your line q = mp.Queue() with q = Queue.LifoQueue() and running prints: min, max and average as exactly -1.
(Also I think you should always get exactly FIFO/LIFO order when getting items from only one thread.)

How to synchronize python lists?

I have different threads and after processing they put data in a common list. Is there anything built in python for a list or a numpy array to be accessed by only a single thread. Secondly, if it is not what is an elegant way of doing it?
According to Thread synchronisation mechanisms in Python, reading a single item from a list and modifying a list in place are guaranteed to be atomic. If this is right (although it seems to be partially contradicted by the very existence of the Queue module), then if your code is all of the form:
try:
val = mylist.pop()
except IndexError:
# wait for a while or exit
else:
# process val
And everything put into mylist is done by .append(), then your code is already threadsafe. If you don't trust that one document on that score, use a queue.queue, which does all synchronisation for you, and has a better API than list for concurrent programs - particularly, it gives you the option of blocking indefinitely, or for a timeout, waiting for .pop() to work if you don't have anything else the thread could be getting on with in the mean time.
For numpy arrays, and in general any case where you need more than a producer/consumer queue, use a Lock or RLock from threading - these implement the context manager protocol, so using them is quite simple:
with mylock:
# Process as necessarry
And python will guarantee that the lock gets released once you fall off the end of the with block - including in tricky cases like if something you do raises an exception.
Finally, consider whether multiprocessing is a better fit for your application than threading - threads in Python aren't guaranteed to actually run concurrently, and in CPython only can if the drop to C-level code. multiprocessing gets around that issue, but may have some extra overhead - if you haven't already, you should read the docs to determine which one suits your needs better.
threading provides Lock objects if you need to protect an entire critical section, or the Queue module provides a queue that is threadsafe.
How about the standard library Queue?

Communicating end of Queue

I'm learning to use the Queue module, and am a bit confused about how a queue consumer thread can be made to know that the queue is complete. Ideally I'd like to use get() from within the consumer thread and have it throw an exception if the queue has been marked "done". Is there a better way to communicate this than by appending a sentinel value to mark the last item in the queue?
original (most of this has changed; see updates below)
Based on some of the suggestions (thanks!) of Glenn Maynard and others, I decided to roll up a descendant of Queue.Queue that implements a close method. It's available in the form of a primitive (unpackaged) module. I'll clean this up a bit and package it properly when I have a bit more time. For now the module only contains the CloseableQueue class and the Closed exception class. I'm planning to expand it to also include subclasses of Queue.LifoQueue and Queue.PriorityQueue.
It's in a pretty preliminary state currently, which is to say that although it passes its test suite, I haven't actually used it for anything yet. Your mileage may vary. I'll keep this answer updated with exciting news.
The CloseableQueue class differs a bit from Glenn's suggestion in that closing the queue will prevent future puts, but not prevent future gets until the queue is emptied. This made the most sense to me; it seemed like functionality to clear the queue could be added as a separate mixin* that would be orthogonal to the closeability functionality. So basically with CloseableQueue, by closing the queue you indicate that the last element has been put. There's also an option to do this atomically by passing last=True to the final put call. Subsequent calls to put, and subsequent calls to get once the queue is emptied, as well as outstanding blocked calls matching those descriptions, will raise the Closed exception.
This is mostly useful for situations where a single producer is generating data for one or more consumers, but it could also be useful for a multi-multi arrangement where consumers are waiting for a particular item or set of items. In particular it doesn't provide a way to determine that all of a number of producers have finished production. Getting that working would entail the provision of some way to register producers (.open()?), as well as a way to indicate that producer registration is itself closed.
Suggestions and/or code reviews are quite welcome. I haven't written a whole lot of concurrency code, but hopefully the test suite is thorough enough that the fact that the code passes it is an indication of the code's quality, rather than the suite's lack thereof. I was able to reuse a bunch of the code from the Queue module's test suite: the file itself is included in this module and used as a basis for various subclasses and routines, including regression testing. This probably (hopefully) helped to avoid complete ineptitude in the testing department. The code itself just overrides Queue.get and Queue.put with fairly minimal changes, and adds the close and closed methods.
I've sort of intentionally avoided using any new-fangled fanciness like context managers in both the code itself and in the test suite in an effort to keep the code as backwards-compatible as is the Queue module itself, which is considerably backwards indeed. I'll probably add __enter__ and __exit__ methods at some point; otherwise, the contextlib's closing function should be applicable to a CloseableQueue instance.
*: Here I use the term "mixin" loosely. As the Queue module's classes are old-style, mixins would need to be mixed using class factory functions; some restrictions apply; offer void where prohibited by Guido.
update
The CloseableQueue module now provides CloseableLifoQueue and CloseablePriorityQueue classes. I've also added some convenience functions to support iteration. Still need to rework it as a proper package. There's a class factory function to allow for convenient subclassing of other Queue.Queue-derived classes.
update 2
CloseableQueue is now available via PyPI, e.g. with
$ easy_install CloseableQueue
Comments and criticism are welcome, especially from this answer's anonymous downvoter.
Queue's don't inherently have the idea of being complete or done. They can be used indefinitely. To close it up when you are done, you will indeed need to put None or some other magic value at the end and write the logic to check for it, as you described. The ideal way would probably be subclassing the Queue object.
See http://en.wikipedia.org/wiki/Queue_(data_structure) to learn more about queue in general.
A sentinel is a natural way to shut down a queue, but there are a couple things to watch out for.
First, remember that you may have more than one consumer, so you need to send a sentinel once for each running consumer, and guarantee that each consumer will only consume one sentinel, to ensure that each consumer receives its shutdown sentinel.
Second, remember that Queue defines an interface, and that when possible, code should behave regardless of the underlying Queue. You might have a PriorityQueue, or you might have some other class that exposes the same interface and returns values in some other order.
Unfortunately, it's hard to deal with both of these. To deal with the general case of different queues, a consumer that's shutting down must continue to consume values after receiving its shutdown sentinel until the queue is empty. That means that it may consume another thread's sentinel. This is a weakness of the Queue interface: it should have a Queue.shutdown call to cause an exception to be thrown by all consumers, but that's missing.
So, in practice:
if you're sure you're only ever using a regular Queue, simply send one sentinel per thread.
if you may be using a PriorityQueue, ensure that the sentinel has the lowest priority.
Queue is a FIFO (first in first out) register so remember that the consumer can be faster than producer. When consumers thread detect that the queue is empty normally realise one of following actions:
Send to API: switch to next thread.
Send to API: sleep some ms and than check again the queue.
Send to API: wait on event (like new message in queue).
If you wont that consumers thread terminate after job is complete than put in queue a sentinel value to terminate task.
The best practice way of doing this would be to have the queue itself notify a client that it has reached the 'done' state. The client can then take any action that is appropriate.
What you have suggested; checking the queue to see if it is done periodically, would be highly undesirable. Polling is an antipattern in multithreaded programming, you should always be using notifications.
EDIT:
So your saying that the queue itself knows that it's 'done' based on some criteria and needs to notify the clients of that fact. I think you are correct and the best way to do this is by throwing when a client calls get() and the queue is in the done state. If your throwing this would negate the need for a sentinel value on the client side. Internally the queue can detect that it is 'done' in any way it pleases e.g. queue is empty, it's state was set to done etc I don't see any need for a sentinel value.

Categories