Python multiprocessing queue makes code hang with large data

Python multiprocessing queue makes code hang with large data - python

I'm using python's multiprocessing to analyse some large texts. After some days trying to figure out why my code was hanging (i.e. the processes didn't end), I was able to recreate the problem with the following simple code:
import multiprocessing as mp
for y in range(65500, 65600):
print(y)
def func(output):
output.put("a"*y)
if __name__ == "__main__":
output = mp.Queue()
process = mp.Process(target = func, args = (output,))
process.start()
process.join()
As you can see, if the item to put in the queue gets too large, the process just hangs.
It doesn't freeze, if I write more code after output.put() it will run, but still, the process never stops.
This starts happening when the string gets to 65500 chars, depending on your interpreter it may vary.
I was aware that mp.Queue has a maxsize argument, but doing some search I found out it is about the Queue's size in number of items, not the size of the items themselves.
Is there a way around this?
The data I need to put inside the Queue in my original code is very very large...

Your queue fills up with no consumer to empty it.
From the definition of Queue.put:
If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available.
Assuming there is no deadlock possible between producer and consumer (and assuming your original code does have a consumer, since your sample doesn't), eventually the producers should be unlocked and terminate. Check the code of your consumer (or add it to the question, so we an have a look)
Update
This is not the problem, because queue has not been given a maxsize so put should succeed until you run out of memory.
This is not the behavior of Queue. As elaborated in this ticket, the part blocking here is not the queue itself, but the underlying pipe. From the linked resource (inserts between "[]" are mine):
A queue works like this:
- when you call queue.put(data), the data is added to a deque, which can grow and shrink forever
- then a thread pops elements from the deque, and sends them so that the other process can receive them through a pipe or a Unix socket (created via socketpair). But, and that's the important point, both pipes and unix sockets have a limited capacity (used to be 4k - pagesize - on older Linux kernels for pipes, now it's 64k, and between 64k-120k for unix sockets, depending on tunable systcls).
- when you do queue.get(), you just do a read on the pipe/socket
[..] when size [becomes too big] the writing thread blocks on the write syscall.
And since a join is performed before dequeing the item [note: that's your process.join], you just deadlock, since the join waits for the sending thread to complete, and the write can't complete since the pipe/socket is full!
If you dequeue the item before waiting the submitter process, everything works fine.
Update 2
I understand. But I don't actually have a consumer (if it is what I'm thinking it is), I will only get the results from the queue when process has finished putting it into the queue.
Yeah, this is the problem. The multiprocessing.Queue is not a storage container. You should use it exclusively for passing data between "producers" (the processes that generate data that enters the queue) and "consumers (the processes that "use" that data). As you now know, leaving the data there is a bad idea.
How can I get an item from the queue if I cannot even put it there first?
put and get hide away the problem of putting together the data if it fills up the pipe, so you only need to set up a loop in your "main" process to get items out of the queue and, for example, append them to a list. The list is in the memory space of the main process and does not clog the pipe.

Related

multiprocessing.Pool not running on last element of iterable

I am trying to run a function func which takes in a list of indices as argument and process the data.
def func(rng):
**some processing**
write_csv_to_disk(processed_data[rng],mode="a")
import multiprocessing
pool = multiprocessing.Pool(4)
pool.map(func,list_of_lists_of_indices)
pool.close()
The function saves partial DataFrame[indices] processed in parallel onto a file in append mode. The code runs well for all the sub-lists of list_of_lists_of_indices, except the last list. The data against the indices in the last list is not saved to my file and the pool is closed.
list_of_lists_of_indices = [[0,1,2,3,4,.....,99999],[100000,100001,100002,100003,100004,......,199999],.....,[10000000,10000001,...,100000895]]
import multiprocessing
pool = multiprocessing.Pool(4)
pool.map(func,iterable = list_of_lists_of_indices)
pool.close()

Well you're not saying what write_csv_to_disk does, but there seem to be a few possible issues here:
you have multiple processes writing to the same file at the same time and that really can't go well unless you're taking specific steps (e.g. lockfile) to avoid them overwriting one another
the symptoms you're explaining look a lot like you're not properly closing your file objects, relying on the garbage collector to do that and close your buffers, except on the last iteration it's possible that e.g. the worker dies before the GC running, therefore the file is not closed and its buffer is not flushed to disk
also while the results of a Pool.map are in-order (at great expense) there's no guarantee as to what order they'll execute in. Since it's the workers doing the writing to disk, there is no reason for these to be ordered. I don't even see why you're using map, the entire point of map is to return computation results, which you're not doing here
You should not be using Pool.map, and you should not be "saving to a file in append mode".
Also note that Pool.close means you're not going to give new work to the pool, it doesn't wait for the workers to be done. Now in theory that should not matter if you're only using sync methods, however in this case and given (2) that might be a problem: when the parent process exits the Pool probably gets garbage-collected which means it hard-shuts down pool workers.

Multithreaded socket Program - Handling Critical section

I am creating a multi-threaded program, in which I want only 1 thread at a time to go in the critical section where is creates a socket and send some data and all the other to wait for that variable to clear.
I tried threading.Events but later realized that on set() it will notify all the threads waiting. While I only wanted to notify one.
Tried locks(acquire and release). It suited my scenario well but I got to know that lock contention for a long time is expensive. After acquiring the lock my thread was performing many functions and hence resulted in holding the lock for long.
Now I tried threading.conditions. Just wanted to know if acquiring and holding the condition for a long time, is it not a good practice as it also uses locks.
Can anyone suggest a better approach to my problem statement.

I would use an additional thread dedicated to sending. Use a Queue where the other threads put their Send-Data. The socket-thread gets items from the queue in a loop and sends them one after the other.
As long as the queue is empty, .get blocks and the send-thread sleeps.
The "producer" threads have no waiting time at all, they just put their data in the queue and continue.
There is no concern about possible deadlock conditions.
To stop the send-thread, put some special item (e.g. None) in the queue.
To enable returning of values, put a tuple (send_data, return_queue) in the send-queue. when a result is ready, return it by putting it in the return_queue.

Avoiding deadlocks due to queue overflow with multiprocessing.JoinableQueue

Suppose we have a multiprocessing.Pool where worker threads share a multiprocessing.JoinableQueue, dequeuing work items and potentially enqueuing more work:
def worker_main(queue):
while True:
work = queue.get()
for new_work in process(work):
queue.put(new_work)
queue.task_done()
When the queue fills up, queue.put() will block. As long as there is at least one process reading from the queue with queue.get(), it will free up space in the queue to unblock the writers. But all of the processes could potentially block at queue.put() at the same time.
Is there a way to avoid getting jammed up like this?

Depending on how often process(work) creates more items, there may be no solution beside a queue of an infinite maximum size.
In short, your queue must be large enough to accomodate the entire backlog of work items that you can have at any time.
Since queue is implemented with semaphores, there may indeed be a hard size limit of SEM_VALUE_MAX which in MacOS is 32767. So you'll need to subclass that implementation or use put(block=False) and handle queue.Full (e.g. put excess items somewhere else) if that's not enough.
Alternatively, look at one of the 3rd-party implementations of distributed work item queue for Python.

Clean way to get near-LIFO behavior from multiprocessing.Queue? (or even just not near-FIFO)

Does anyone know a clean way to get near-LIFO or even not near-FIFO (e.g. random) behavior from multiprocessing.Queue?
Alternative Question: Could someone point me to the code for the thread that manages the actual storage structure behind multiprocessing.Queue? It seems like it would be trivial within that to provide approximately LIFO access, but I got lost in the rabbit hole trying to find it.
Notes:
I believe multiprocessing.Queue does not guarantee order. Fine. But it is near-FIFO so near-LIFO would be great.
I could pull all the current items off the queue and reverse the order before working with them, but I prefer to avoid a kludge if possible.
(edit) To clarify: I am doing a CPU bound simulation with multiprocessing and so can't use the specialized queues from Queue. Since I haven't seen any answers for a few days, I've added the alternative question above.
In case it is an issue, below is slight evidence that multiprocessing.Queue is near-FIFO. It just shows that in a simple case (a single thread), it is perfectly FIFO on my system:
import multiprocessing as mp
import Queue
q = mp.Queue()
for i in xrange(1000):
q.put(i)
deltas = []
while True:
try:
value1 = q.get(timeout=0.1)
value2 = q.get(timeout=0.1)
deltas.append(value2-value1)
except Queue.Empty:
break
#positive deltas would indicate the numbers are coming out in increasing order
min_delta, max_delta = min(deltas), max(deltas)
avg_delta = sum(deltas)/len(deltas)
print "min", min_delta
print "max", max_delta
print "avg", avg_delta
prints: min, max, and average are exactly 1 (perfect FIFO)

I've looked over the Queue class that lives in Lib/multiprocessing/queues.py in my Python installation (Python 2.7, but nothing obvious is different in the version from Python 3.2 that I briefly checked). Here's how I understand it works:
There are two sets of objects that are maintained by the Queue object. One set are multiprocess-safe primatives that are shared by all processes. The others are created and used separately by each process.
The cross-process objects are set up in the __init__ method:
A Pipe object, who's two ends are saved as self._reader and self._writer.
A BoundedSemaphore object, which counts (and optionally limits) how many objects are in the queue.
A Lock object for reading the Pipe, and on non-Windows platforms another for writing. (I assume that this is because writing to a pipe is inherently multiprocess-safe on Windows.)
The per-process objects are set up in the _after_fork and _start_thread methods:
A collections.deque object used to buffer writes to the Pipe.
A threading.condition object used to signal when the buffer is not empty.
A threading.Thread object that does the actual writing. It is created lazily, so it won't exist until at least one write to the Queue has been requested in a given process.
Various Finalize objects that clean stuff up when the process ends.
A get from the queue is pretty simple. You acquire the read lock, decrement the semaphore, and grab an object from the read end of the Pipe.
A put is more complicated. It uses multiple threads. The caller to put grabs the condition's lock, then adds its object to the buffer and signals the condition before unlocking it. It also increments the semaphore and starts up the writer thread if it isn't running yet.
The writer thread loops forever (until canceled) in the _feed method. If the buffer is empty, it waits on the notempty condition. Then it takes an item from the buffer, acquires the write lock (if it exists) and writes the item to the Pipe.
So, given all of that, can you modify it to get a LIFO queue? It doesn't seem easy. Pipes are inherently FIFO objects, and while the Queue can't guarantee FIFO behavior overall (due to the asynchronous nature of the writes from multiple processes) it is always going to be mostly FIFO.
If you have only a single consumer, you could get objects from the queue and add them to your own process-local stack. It would be harder to do a multi-consumer stack, though with shared memory a bounded-size stack wouldn't be too hard. You'd need a lock, a pair of conditions (for blocking/signaling on full and empty states), a shared integer value (for the number of values held) and a shared array of an appropriate type (for the values themselves).

There is a LIFO queue in the Queue package (queue in Python 3). This isn't exposed in the multiprocessing or multiprocessing.queues modules.
Replacing your line q = mp.Queue() with q = Queue.LifoQueue() and running prints: min, max and average as exactly -1.
(Also I think you should always get exactly FIFO/LIFO order when getting items from only one thread.)

File downloading using python with threads

I'm creating a python script which accepts a path to a remote file and an n number of threads. The file's size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file.
How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don't get scrambled?
Also, what if I'm to download several files simultaneously?

You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python.
I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global Queue.Queue instance, call it workQ for example, for "work requests" (wr = workQ.get() will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss).
A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).
The main thread pushes all work requests to the workQ (just workQ.put((url, from, numbytes)) for each request) and waits for results to come to another Queue instance, call it resultQ (each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).
As each working thread satisfies the request it's doing, it puts the results into resultQ and goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- i.e. if the main thread has other work to do, for example on the GUI) gets results from resultQ and performs the needed open, seek, and write operations to place the data at the right spot.
There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).

You need to fetch completely separate parts of the file on each thread. Calculate the chunk start and end positions based on the number of threads. Each chunk must have no overlap obviously.
For example, if target file was 3000 bytes long and you want to fetch using three thread:
Thread 1: fetches bytes 1 to 1000
Thread 2: fetches bytes 1001 to 2000
Thread 3: fetches bytes 2001 to 3000
You would pre-allocate an empty file of the original size, and write back to the respective positions within the file.

You can use a thread safe "semaphore", like this:
class Counter:
counter = 0
#classmethod
def inc(cls):
n = cls.counter = cls.counter + 1 # atomic increment and assignment
return n
Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes.
That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading.
The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk.

for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.