I want to do multiple transformations on some data. I figured I can use multiple Pool.imap's because each of the transformations is just a simple map. And Pool.imap is lazy, so it only does computation when needed.
But strangely, it looks like multiple consecutive Pool.imap's are blocking. And not lazy. Look at the following code as an example.
import time
from multiprocessing import Pool
def slow(n):
time.sleep(0.01)
return n*n
for i in [10, 100, 1000]:
with Pool() as p:
numbers = range(i)
iter1 = p.imap(slow, numbers)
iter2 = p.imap(slow, iter1)
start = time.perf_counter()
next(iter2)
print(i, time.perf_counter() - start)
# Prints
# 10 0.0327413540071575
# 100 0.27094774100987706
# 1000 2.6275791430089157
As you can see the time to the first element is increasing. I have 4 cores on my machine, so it roughly takes 2.5 seconds to process 1000 items with a 0.01 second delay. Hence, I think two consecutive Pool.imap's are blocking. And that the first Pool.imap finishes the entire workload before the second one starts. That is not lazy.
I've did some additional research. It does not matter if I use a process pool or a thread pool. It happens with Pool.imap and Pool.imap_unordered. The blocking takes longer when I do a third Pool.imap. A single Pool.imap is not blocking. This bug report seems related but different.
TL;DR imap is not a real generator, meaning it does not generate items on-demand (lazy computation aka similar to coroutine), and pools initiate "jobs" in serial.
longer answer: Every type of submission to a Pool be it imap, apply, apply_async etc.. gets written to a queue of "jobs". This queue is read by a thread in the main process (pool._handle_tasks) in order to allow jobs to continue to be initiated while the main process goes off and does other things. This thread contains a very simple double for loop (with a lot of error handling) that basically iterates over each job, then over each task within each job. The inner loop blocks until a worker is available to get each task, meaning tasks (and jobs) are always started in serial in the exact order they were submitted. This does not mean they will finish in perfect serial, which is why map, and imap collect results, and re-order them back to their original order (handled by pool._handle_resluts thread) before passing back to the main thread.
Rough pseudocode of what's going on:
#task_queue buffers task inputs first in - first out
pool.imap(foo, ("bar", "baz", "bat"), chunksize=1)
#put an iterator on the task queue which will yield "chunks" (a chunk is given to a single worker process to compute)
pool.imap(fun, ("one", "two", "three"), chunksize=1)
#put a second iterator to the task queue
#inside the pool._task_handler thread within the main proces
for task in task_queue: #[imap_1, imap_2]
#this is actually a while loop in reality that tries to get new tasks until the pool is close()'d
for chunk in task:
_worker_input_queue.put(chunk) # give the chunk to the next available worker
# This blocks until a worker actually takes the chunk, meaning the loop won't
# continue until all chunks are taken by workers.
def worker_function(_worker_input_queue, _worker_output_queue):
while True:
task = _worker_input_queue.get() #get the next chunk of tasks
#if task == StopSignal: break
result = task.func(task.args)
_worker_output_queue.put(result) #results are collected, and re-ordered
# by another thread in the main process
# as they are completed.
Related
I am trying to run inference with tensorflow using multiprocessing. Each process uses 1 GPU. I have a list of files input_files[]. Every process gets one file, runs the model.predict on it and writes the results to file. To move on to next file, I need to close the process and restart it. This is because tensorflow doesn't let go of memory. So if I use the same process, I get memory leak.
I have written a code below which works. I start 5 processes, close them and start another 5. The issue is that all processes need to wait for the slowest one before they can move on. How can I start and close each process independent of the others?
Note that Pool.map is over input_files_small not input_files.
file1 --> start new process --> run prediction --> close process --> file2 --> start new process --> etc.
for i in range(0, len(input_files), num_process):
input_files_small = input_files[i:i+num_process]
try:
process_pool = multiprocessing.Pool(processes=num_process, initializer=init_worker, initargs=(gpu_ids))
pool_output = process_pool.map(worker_fn, input_files_small)
finally:
process_pool.close()
process_pool.join()
There is no need to re-create over and over the processing pool. First, specify maxtasksperchild=1 when creating the pool. This should result in creating a new process for each new task submitted. And instead of using method map, use method map_async, which will not block. You can use pool.close followed by pool.join() to wait for these submissions to complete implicitly if your worker function does not return results you need, as follows or use the second code variation:
process_pool = multiprocessing.Pool(processes=num_process, initializer=init_worker, initargs=(gpu_ids), maxtasksperchild=1)
for i in range(0, len(input_files), num_process):
input_files_small = input_files[i:i+num_process]
process_pool.map_async(worker_fn, input_files_small))
# wait for all outstanding tasks to complete
process_pool.close()
process_pool.join()
If you need return values from worker_fn:
process_pool = multiprocessing.Pool(processes=num_process, initializer=init_worker, initargs=(gpu_ids), maxtasksperchild=1)
results = []
for i in range(0, len(input_files), num_process):
input_files_small = input_files[i:i+num_process]
results.append(process_pool.map_async(worker_fn, input_files_small))
# get return values from map_async
pool_outputs = [result.get() for result in results]
# you do not need process_pool.close() and process_pool.join()
But, since there may be some "slow" tasks still running from an earlier invocation of map_async when tasks from a later invocation of map_async start up, some of these tasks may still have to wait to run. But at least all of your processes in the pool should stay fairly busy.
If you are expecting exceptions from your worker function and need to handle them in your main process, it gets more complicated.
I am breaking a very large text file up into smaller chunks, and performing further processing on the chunks. For this example, let text_chunks be a list of lists, each list containing a section of text. The elements of text_chunks range in length from ~50 to ~15000. The class ProcessedText exists elsewhere in the code and does a large amount of subsequent processing and data classification based on the text fed to it. The different text chunks are processed into ProcessedText instances in parallel using code like the following:
def do_things_to_text(a, b):
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
#prepare inputs for starmap, pairing with list index so order can be reimposed later
pool_inputs = list(enumerate(text_chunks))
#parallel processing
pool = mp.Pool(processes=8)
results = pool.starmap_async(do_things_to_text, pool_inputs)
output = results.get()
The code executes successfully, but it seems that some of the worker processes created as part of the Pool randomly sit idle while the code runs. I track the memory usage, CPU usage, and status in top while the code executes.
At the beginning all 8 worker processes are engaged (status "R" in top and nonzero CPU usage), after ~20 entries from text_chunks are completed, the worker processes start to vary wildly. At times, as few as 1 worker process is running, and the others are in status "S" with zero CPU usage. I can also see from my printed output statements that do_things_to_text() is being called less frequently. So far I haven't been able to identify why the processes start to idle. There are plenty of entries left to process, so them sitting idle leads to time-inefficiency.
My questions are:
Why are these worker processes sitting idle?
Is there a better way to implement multiprocessing that will prevent this?
EDITED to ADD:
I have further characterized the problem. It is clear from the indexes I print out in do_things_to_text() that multiprocessing is dividing the total number of jobs into threads at every tenth index. So my console output shows Job 0, 10, 20, 30, 40, 50, 60, 70 being submitted at the same time (8 processes). And some of the Jobs complete faster than others, so you might see Job 22 completed before you see Job 1 completed.
Up until this first batch of threads is completed, all processes are active with nothing idle. However, when that batch is complete, and Job 80 starts, only one process is active, and the other 7 are idle. I have not confirmed, but I believe it stays like this until the 80-series is complete.
Here are some recommendations for better memory utilization:
I don't know how text_chunks is created but ultimately you end up with 8GB worth of strings in pool_inputs. Ideally, you would have a generator function, for example make_text_chunks, that yields the individual "text chunks" that formerly comprised the text_chunks iterable (if text_chunks is already such a generator expression, then you are all set). The idea is to not create all 8GB worth of data at once but only as the data is needed. With this strategy you can no longer use Pool method starmap_asynch; we will be using Pool.imap. This method, unlike startmap_asynch, will iteratively submit jobs in chunksize chunks and you can process the results as they become available (although that doesn't seem to be an issue).
def make_text_chunks():
# logic goes here to generate the next chunk
yield text_chunk
def do_things_to_text(t):
# t is now a tuple:
a, b = t
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
# do not turn into a list!
pool_inputs = enumerate(make_text_chunks())
def compute_chunksize(n_jobs, poolsize):
"""
function to compute chunksize as is done by Pool module
"""
if n_jobs == 0:
return 0
chunksize, remainder = divmod(n_jobs, poolsize * 4)
if remainder:
chunksize += 1
return chunksize
#parallel processing
# number of jobs approximately
# don't know exactly without turning pool_inputs into a list, which would be self-defeating
N_JOBS = 300
POOLSIZE = 8
CHUNKSIZE = compute_chunksize(N_JOBS, POOLSIZE)
with mp.Pool(processes=POOLSIZE) as pool:
output = [result for result in pool.imap(do_things_to_text, pool_inputs, CHUNKSIZE)]
I am trying pass each element from my list to a function that is being started on its own thread doing its own work. The problem is if the list has 100+ elements it will start 100 functions() on 100 threads.
For the sake of my computer I want to process the list in batches of 10's with he following steps:
Batch 1 gets queued.
Pass each element from batch1 to the function getting started on its own thread (This way I will only have 10 function threads running at a time)
Once all 10 threads have finished, they get popped off from their queue
Repeat until all batches are done
I was trying to use two lists where first 10 elements gets popped into list2. Process the list2, once the threads are done, pop 10 more elements until list1 reaches length of 0.
I have gotten this far not sure how to proceed.
carsdotcomOptionVal, carsdotcomOptionMakes = getMakes()
second_list = []
threads = []
while len(carsdotcomOptionVal) != 0:
second_list.append(carsdotcomOptionVal.pop(10))
for makesOptions in second_list:
th = threading.Thread(target=getModels, args=[makesOptions])
th.start()
threads.append(th)
for thread in threads:
thread.join()
Lastly the elements from the main list dont have to be even as they can be odd.
You should use a queue.Queue object, which can create a thread-safe list of tasks for other "worker-threads". You can choose how many worker-threads are active, and they would each feed from the list until it's done.
Here's what a sample code looks like with a queue:
import queue
import threading
threads_to_start = 10 # or choose how many you want
my_queue = queue.Queue()
def worker():
while not my_queue.empty():
data = my_queue.get()
do_something_with_data(data)
my_queue.task_done()
for i in range(100):
my_queue.put(i) # replace "i" with whatever data you want for the threads to process
for i in range(threads_to_start):
t = threading.Thread(target=worker, daemon=True) # daemon means that all threads will exit when the main thread exits
t.start()
my_queue.join() # this will block the main thread from exiting until the queue is empty and all data has been processed
Keep in mind this is just a pseudo-code rough start to introduce you to threading and queues, there is more to it than just that, but that example should work in most simple use-cases
This is scalable too - all you have to change if you can support more or less threads is change the number you initially set in threads_to_start
please be warned that this demonstration code generates a few GB data.
I have been using versions of the code below for multiprocessing for some time. It works well when the run time of each process in the pool is similar but if one process takes much longer I end up with many blocked processes waiting on the one, so I'm trying to make it run asynchronously - just for one function at a time.
For example, if I have 70 cores and need to run a function 2000 times I want that to run asynchronously then wait for the last process before calling the next function. Currently it just submits processes in batches of how ever many cores I give it and each batch has to wait for the longest process.
As you can see I've tried using map_async but this is clearly the wrong syntax. Can anyone help me out?
import os
p='PATH/test/'
def f1(tup):
x,y=tup
to_write = x*(y**5)
with open(p+x+str(y)+'.txt','w') as fout:
fout.write(to_write)
def f2(tup):
x,y=tup
print (os.path.exists(p+x+str(y)+'.txt'))
def call_func(f,nos,threads,call):
print (call)
for i in range(0, len(nos), threads):
print (i)
chunk = nos[i:i + threads]
tmp = [('args', no) for no in chunk]
pool.map(f, tmp)
#pool.map_async(f, tmp)
nos=[i for i in range(55)]
threads=8
if __name__ == '__main__':
with Pool(processes=threads) as pool:
call_func(f1,nos,threads,'f1')
call_func(f2,nos,threads,'f2')
map will only return and map_async will only call the callback after all tasks of the current chunk are done.
So you can only either give all tasks to map/map_async at once or use apply_async (initially called threads times) where the callback calls apply_asyncfor the next task.
If the actual return values of the call don't matter (or at least their order doesn't), imap_unordered may be another efficient solution when giving it all tasks at once (or an iterator/generator producing the tasks on demand)
I am trying to implement an online recursive parallel algorithm, which is highly parallelizable. My problem is that my python implementation does not work as I want. I have two 2D matrices where I want to update recursively every column every time a new observation is observed at time-step t.
My parallel code is like this
def apply_async(t):
worker = mp.Pool(processes = 4)
for i in range(4):
X[:,i,np.newaxis], b[:,i,np.newaxis] = worker.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis])).get()
worker.close()
worker.join()
for t in range(p,T):
count = 0
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T
Gt = (1/(t-p+1))*G
if __name__ == '__main__':
apply_async(t)
The two matrices are X and b. I want to replace directly on master's memory as each process updates recursively only one specific column of the matrices.
Why this implementation is slower than the sequential?
Is there any way to resume the process every time-step rather than killing them and create them again? Could this be the reason it is slower?
The reason is, your program is in practice sequential. This is an example code snippet that is from parallelism standpoint identical to yours:
from multiprocessing import Pool
from time import sleep
def gwork( qq):
print (qq)
sleep(1)
return 42
p = Pool(processes=4)
for q in range(1, 10):
p.apply_async(gwork, args=(q,)).get()
p.close()
p.join()
Run this and you shall notice numbers 1-9 appearing exactly once in a second. Why is this? The reason is your .get(). This means every call to apply_async will in practice block in get() until a result is available. It will submit one task, wait a second emulating processing delay, then return the result, after which another task is submitted to your pool. This means there is no parallel execution ongoing at all.
Try replacing the pool management part with this:
results = []
for q in range(1, 10):
res = p.apply_async(gwork, args=(q,))
results.append(res)
p.close()
p.join()
for r in results:
print (r.get())
You can now see parallelism at work, as four of your tasks are now processed simultaneously. Your loop does not block in get, as get is moved out of the loop and results are received only when they are ready.
NB: If your arguments to your worker or the return values from them are large data structures, you will lose some performance. In practice Python implements these as queues, and transmitting a lot of data via a queue is slow on relative terms compared to getting an in-memory copy of a data structure when a subprocess is forked.