multiprocessing.Pool not running on last element of iterable

multiprocessing.Pool not running on last element of iterable - python

I am trying to run a function func which takes in a list of indices as argument and process the data.
def func(rng):
**some processing**
write_csv_to_disk(processed_data[rng],mode="a")
import multiprocessing
pool = multiprocessing.Pool(4)
pool.map(func,list_of_lists_of_indices)
pool.close()
The function saves partial DataFrame[indices] processed in parallel onto a file in append mode. The code runs well for all the sub-lists of list_of_lists_of_indices, except the last list. The data against the indices in the last list is not saved to my file and the pool is closed.
list_of_lists_of_indices = [[0,1,2,3,4,.....,99999],[100000,100001,100002,100003,100004,......,199999],.....,[10000000,10000001,...,100000895]]
import multiprocessing
pool = multiprocessing.Pool(4)
pool.map(func,iterable = list_of_lists_of_indices)
pool.close()

Well you're not saying what write_csv_to_disk does, but there seem to be a few possible issues here:
you have multiple processes writing to the same file at the same time and that really can't go well unless you're taking specific steps (e.g. lockfile) to avoid them overwriting one another
the symptoms you're explaining look a lot like you're not properly closing your file objects, relying on the garbage collector to do that and close your buffers, except on the last iteration it's possible that e.g. the worker dies before the GC running, therefore the file is not closed and its buffer is not flushed to disk
also while the results of a Pool.map are in-order (at great expense) there's no guarantee as to what order they'll execute in. Since it's the workers doing the writing to disk, there is no reason for these to be ordered. I don't even see why you're using map, the entire point of map is to return computation results, which you're not doing here
You should not be using Pool.map, and you should not be "saving to a file in append mode".
Also note that Pool.close means you're not going to give new work to the pool, it doesn't wait for the workers to be done. Now in theory that should not matter if you're only using sync methods, however in this case and given (2) that might be a problem: when the parent process exits the Pool probably gets garbage-collected which means it hard-shuts down pool workers.

Related

How to run a python multiprocessing pool without closing

I am trying to run multiple copies of a Bert model simultaneously.
I have a python object which holds a pool:
self.tokenizer = BertTokenizer.from_pretrained(BERT_LARGE)
self.model = BertForQuestionAnswering.from_pretrained(BERT_LARGE)
self.pool = Pool(processes=max_processes,
initializer=pool_init,
initargs=(self.model, self.tokenizer))
Each process in the pool copies across a Bert tokenizer and model:
process_model = None
process_tokenizer = None
def pool_init(m: BertForQuestionAnswering, t: BertTokenizer):
global process_model, process_tokenizer
process_model, process_tokenizer = m, t
To use the pool, I then run
while condition:
answers = self.pool.map(answer_func, questions)
condition = check_condition(answers)
This design is in order to avoid the large overhead of reloading the Bert model into each process each time the pool is initialized (which takes about 1.5-2 seconds per process).
Question 1. Is this the best way of doing this?
Question 2. If so, when am I supposed to call self.pool.close() and self.pool.join()? I want to join() before the check_condition() function, but I don't really ever want to close() the pool (unless until the __del__() of the object) but calling join() before calling close() gives me errors, and calling close() makes the pool uncallable in the future. Is pool just not meant for these kind of jobs, and I should manage an array of processes? Help...?
Thanks!!

You said, "This design is in order to avoid the large overhead of reloading the Bert model into each process each time the pool is initialized (which takes about 1.5-2 seconds per process)." Your statement and the small amount of code you showed does not quite make perfect sense to me. I think it's a question of terminology.
First, I don't see where the pool is being initialized multiple times; I only see one instance of creating the pool:
self.pool = Pool(processes=max_processes,
initializer=pool_init,
initargs=(self.model, self.tokenizer))
But if you are creating the pool multiple times, you are in fact with your current design using the pool_init function to reload the Bert model into each process of the pool each time the pool is created and not avoiding what you say you are avoiding. But this can be a good thing. So I suspect we are talking about two different things. So I can only explain what is actually going on:
You are invoking the pool.map function potentially multiple times because of you while condition: loop. But, in general, you do want to avoid creating a pool multiple times if you can avoid doing so. Now there are two reasons I can think of for using the initializer and initargs arguments to the Pool constructor as you are doing:
If you have read-only data items that your worker function (answer_func in your case) needs to access, rather than passing these items on each call to that function, it is generally cheaper to initialize global variables of each process in the pool with these data items and have your worker function just access the global variables.
Certain data types, for example a multiprocessing.Lock instance, cannot be passed as an argument using any of the multiprocessing.Pool methods and need to be "passed" by using a pool initialization function.
Case 2 does not seem to apply. So if you have 100 questions and a pool size of 8, it is better to pass the model and tokenizer 8 times, once for each process in the pool, rather than 100 times, once for each question.
If you are using method Pool.map, which blocks until all submit tasks are complete, you can be sure that there are no processes in the pool running any tasks when that method returns. If you will be re-executing the pool creation code, then when you terminate the while condition: loop you should free resources by either calling pool.close() followed by pool.join(), which will wait for the processes in the pool to terminate or you could just call pool.terminate(), which just terminates all the pool processes immediately (which we know are idle at this point). If you are only creating the pool once, you really do not have to call anything; the processes in the pool are daemon processes, which will terminate when your main process terminates. But, if you will be doing further processing after you have no further need for the pool, then to free up resources sooner rather than later, you should do the previously described "cleanup."
Does this make sense?
Additional Note
Since pool.map blocks until all submit tasks complete, there is no need to call pool.join() just to be sure that the tasks are completed; pool.map will return with a list of all the return values that were returned by your worker function. answer_func.
Where pool.join() can be useful, aside from the freeing of resources I have already mentioned, is when you are issuing a one or more pool.apply_async method calls. This method is non-blocking and returns an AsyncResult instance on which you can later issue a get call to block for the completion of the task and get the return value. But if you are not interested in the return value(s) and just need to wait for the completion of the task(s), then as long as you will not need to submit any more tasks to the pool you simply issue a pool.close() followed by a pool.join() and at the completion of those two calls you can be sure that all of the submitted tasks have completed (possibly with exceptions).
So putting a call to pool.terminate() in the class's __del__ method is a good idea for general usage.

Python multiprocessing queue makes code hang with large data

I'm using python's multiprocessing to analyse some large texts. After some days trying to figure out why my code was hanging (i.e. the processes didn't end), I was able to recreate the problem with the following simple code:
import multiprocessing as mp
for y in range(65500, 65600):
print(y)
def func(output):
output.put("a"*y)
if __name__ == "__main__":
output = mp.Queue()
process = mp.Process(target = func, args = (output,))
process.start()
process.join()
As you can see, if the item to put in the queue gets too large, the process just hangs.
It doesn't freeze, if I write more code after output.put() it will run, but still, the process never stops.
This starts happening when the string gets to 65500 chars, depending on your interpreter it may vary.
I was aware that mp.Queue has a maxsize argument, but doing some search I found out it is about the Queue's size in number of items, not the size of the items themselves.
Is there a way around this?
The data I need to put inside the Queue in my original code is very very large...

Your queue fills up with no consumer to empty it.
From the definition of Queue.put:
If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available.
Assuming there is no deadlock possible between producer and consumer (and assuming your original code does have a consumer, since your sample doesn't), eventually the producers should be unlocked and terminate. Check the code of your consumer (or add it to the question, so we an have a look)
Update
This is not the problem, because queue has not been given a maxsize so put should succeed until you run out of memory.
This is not the behavior of Queue. As elaborated in this ticket, the part blocking here is not the queue itself, but the underlying pipe. From the linked resource (inserts between "[]" are mine):
A queue works like this:
- when you call queue.put(data), the data is added to a deque, which can grow and shrink forever
- then a thread pops elements from the deque, and sends them so that the other process can receive them through a pipe or a Unix socket (created via socketpair). But, and that's the important point, both pipes and unix sockets have a limited capacity (used to be 4k - pagesize - on older Linux kernels for pipes, now it's 64k, and between 64k-120k for unix sockets, depending on tunable systcls).
- when you do queue.get(), you just do a read on the pipe/socket
[..] when size [becomes too big] the writing thread blocks on the write syscall.
And since a join is performed before dequeing the item [note: that's your process.join], you just deadlock, since the join waits for the sending thread to complete, and the write can't complete since the pipe/socket is full!
If you dequeue the item before waiting the submitter process, everything works fine.
Update 2
I understand. But I don't actually have a consumer (if it is what I'm thinking it is), I will only get the results from the queue when process has finished putting it into the queue.
Yeah, this is the problem. The multiprocessing.Queue is not a storage container. You should use it exclusively for passing data between "producers" (the processes that generate data that enters the queue) and "consumers (the processes that "use" that data). As you now know, leaving the data there is a bad idea.
How can I get an item from the queue if I cannot even put it there first?
put and get hide away the problem of putting together the data if it fills up the pipe, so you only need to set up a loop in your "main" process to get items out of the queue and, for example, append them to a list. The list is in the memory space of the main process and does not clog the pipe.

Is it possible to avoid locking overhead when sharing dicts between threads in Python?

I have a multi-threaded application in Python where threads are reading very large (so I cannot copy them to thread-local storage) dicts (read from disk and never modified). Then they process huge amounts of data using the dicts as read-only data:
# single threaded
d1,d2,d3 = read_dictionaries()
while line in stdin:
stdout.write(compute(line,d1,d2,d3)+line)
I am trying to speed this up by using threads, which would then each read its own input and write its own output, but since the dicts are huge, I want the threads to share the storage.
IIUC, every time a thread reads from the dict, it has to lock it, and that imposes a performance cost on the application. This data locking is not necessary because the dicts are read-only.
Does CPython actually lock the data individually or does it just use the GIL?
If, indeed, there is per-dict locking, is there a way to avoid it?

Multithreading processing in python is useless. It's better to use multiprocessing module. Because multithreading can give positive effort only in lower number of cases.
Python implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
Official documentation.
Without any code examples from your side I can only recommend to split your big dictionary on several parts and process every part using Pool.map. And merge results in main process.
Unfortunately, it's impossible to share a lot of memory between different python processes effective (we are not talking about shared memory pattern based on mmap). But you can read different parts of your dictionary in different processes. Or just read entire dictionary in main process and give a small chunks to child processes.
Also, I should warn you that you should be very carefully with multiprocessing algorithms. Because every extra megabytes will be multiplied on number of process.
So, based on your pseudocode example I can assume two possible algorithm based on compute function:
# "Stateless"
for line in stdin:
res = compute_1(line) + compute_2(line) + compute_3(line)
print res, line
# "Shared" state
for line in stdin:
res = compute_1(line)
res = compute_2(line, res)
res = compute_3(line, res)
print res, line
In first case, you can create a several workers, read each dictionary in separate worker based on Process class (it's good idea to decrease memory usage for each process), and compute it like a production line.
In second case, you have a shared state. For each next worker you need a result of previous one. It's worst case for multithreading/multiprocessing programming. But you can write algorithm there several workers are using same Queue and pushing result to it without waiting finish of all cycle. And you just share a Queue instance between your processes.

Where to write parallelized program output to?

I have a program that is using pool.map() to get the values using ten parallel workers. I'm having trouble wrapping my head around how I am suppose to stitch the values back together to make use of it at the end.
What I have is structured like this:
initial_input = get_initial_values()
pool.map(function, initial_input)
pool.close()
pool.join()
# now how would I get the output?
send_ftp_of_output(output_data)
Would I write the function to a log file? If so, if there are (as a hypothetical) a million processes trying to write to the same file, would things overwrite each other?

pool.map(function,input)
returns a list.
You can get the output by doing:
output_data = pool.map(function,input)
pool.map simply runs the map function in paralell, but it still only returns a single list. If you're not outputting anything in the function you are mapping (and you shouldn't), then it simply returns a list. This is the same as map() would do, except it is executed in paralell.

In regards to the log file, yes, having multiple threads right to the same place would interleave within the log file. You could have the thread log the file before the write, which would ensure that something wouldn't get interrupted mid-entry, but it would still interleave things chronologically amongst all the threads. Locking the log file each time also would significantly slow down logging due to the overhead involved.
You can also have, say, the thread number -- %(thread)d -- or some other identifying mark in the logging Formatter output that would help to differentiate, but it could still be hard to follow, especially for a bunch of threads.
Not sure if this would work in your specific application, as the specifics in your app may preclude it, however, I would strongly recommend considering GNU Parallel (http://www.gnu.org/software/parallel/) to do the parallelized work. (You can use, say, subprocess.check_output to call into it).
The benefit of this is several fold, chiefly that you can easily vary the number of parallel workers -- up to having parallel use one worker per core on the machine -- and it will pipeline the items accordingly. The other main benefit, and the one more specifically related to your question -- is that it will stitch the output of all of these parallel workers together as if they had been invoked serially.
If your program wouldn't work so well having, say, a single command line piped from a file within the app and parallelized, you could perhaps make your Python code single-worker and then as the commands piped to parallel, make it a number of permutations of your Python command line, varying the target each time, and then have it output the results.
I use GNU Parallel quite often in conjunction with Python, often to do things, like, say, 6 simultaneous Postgres queries using psql from a list of 50 items.

Using Tritlo's suggestion, here is what worked for me:
def run_updates(input_data):
# do something
return {data}
if __name__ == '__main__':
item = iTunes()
item.fetch_itunes_pulldowns_to_do()
initial_input_data = item.fetched_update_info
pool = Pool(NUM_IN_PARALLEL)
result = pool.map(run_updates, initial_input_data)
pool.close()
pool.join()
print result
And this gives me a list of results

Clean way to get near-LIFO behavior from multiprocessing.Queue? (or even just not near-FIFO)

Does anyone know a clean way to get near-LIFO or even not near-FIFO (e.g. random) behavior from multiprocessing.Queue?
Alternative Question: Could someone point me to the code for the thread that manages the actual storage structure behind multiprocessing.Queue? It seems like it would be trivial within that to provide approximately LIFO access, but I got lost in the rabbit hole trying to find it.
Notes:
I believe multiprocessing.Queue does not guarantee order. Fine. But it is near-FIFO so near-LIFO would be great.
I could pull all the current items off the queue and reverse the order before working with them, but I prefer to avoid a kludge if possible.
(edit) To clarify: I am doing a CPU bound simulation with multiprocessing and so can't use the specialized queues from Queue. Since I haven't seen any answers for a few days, I've added the alternative question above.
In case it is an issue, below is slight evidence that multiprocessing.Queue is near-FIFO. It just shows that in a simple case (a single thread), it is perfectly FIFO on my system:
import multiprocessing as mp
import Queue
q = mp.Queue()
for i in xrange(1000):
q.put(i)
deltas = []
while True:
try:
value1 = q.get(timeout=0.1)
value2 = q.get(timeout=0.1)
deltas.append(value2-value1)
except Queue.Empty:
break
#positive deltas would indicate the numbers are coming out in increasing order
min_delta, max_delta = min(deltas), max(deltas)
avg_delta = sum(deltas)/len(deltas)
print "min", min_delta
print "max", max_delta
print "avg", avg_delta
prints: min, max, and average are exactly 1 (perfect FIFO)

I've looked over the Queue class that lives in Lib/multiprocessing/queues.py in my Python installation (Python 2.7, but nothing obvious is different in the version from Python 3.2 that I briefly checked). Here's how I understand it works:
There are two sets of objects that are maintained by the Queue object. One set are multiprocess-safe primatives that are shared by all processes. The others are created and used separately by each process.
The cross-process objects are set up in the __init__ method:
A Pipe object, who's two ends are saved as self._reader and self._writer.
A BoundedSemaphore object, which counts (and optionally limits) how many objects are in the queue.
A Lock object for reading the Pipe, and on non-Windows platforms another for writing. (I assume that this is because writing to a pipe is inherently multiprocess-safe on Windows.)
The per-process objects are set up in the _after_fork and _start_thread methods:
A collections.deque object used to buffer writes to the Pipe.
A threading.condition object used to signal when the buffer is not empty.
A threading.Thread object that does the actual writing. It is created lazily, so it won't exist until at least one write to the Queue has been requested in a given process.
Various Finalize objects that clean stuff up when the process ends.
A get from the queue is pretty simple. You acquire the read lock, decrement the semaphore, and grab an object from the read end of the Pipe.
A put is more complicated. It uses multiple threads. The caller to put grabs the condition's lock, then adds its object to the buffer and signals the condition before unlocking it. It also increments the semaphore and starts up the writer thread if it isn't running yet.
The writer thread loops forever (until canceled) in the _feed method. If the buffer is empty, it waits on the notempty condition. Then it takes an item from the buffer, acquires the write lock (if it exists) and writes the item to the Pipe.
So, given all of that, can you modify it to get a LIFO queue? It doesn't seem easy. Pipes are inherently FIFO objects, and while the Queue can't guarantee FIFO behavior overall (due to the asynchronous nature of the writes from multiple processes) it is always going to be mostly FIFO.
If you have only a single consumer, you could get objects from the queue and add them to your own process-local stack. It would be harder to do a multi-consumer stack, though with shared memory a bounded-size stack wouldn't be too hard. You'd need a lock, a pair of conditions (for blocking/signaling on full and empty states), a shared integer value (for the number of values held) and a shared array of an appropriate type (for the values themselves).

There is a LIFO queue in the Queue package (queue in Python 3). This isn't exposed in the multiprocessing or multiprocessing.queues modules.
Replacing your line q = mp.Queue() with q = Queue.LifoQueue() and running prints: min, max and average as exactly -1.
(Also I think you should always get exactly FIFO/LIFO order when getting items from only one thread.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.