Why do I have idle workers when using Python multiprocessing pools? - python

I am breaking a very large text file up into smaller chunks, and performing further processing on the chunks. For this example, let text_chunks be a list of lists, each list containing a section of text. The elements of text_chunks range in length from ~50 to ~15000. The class ProcessedText exists elsewhere in the code and does a large amount of subsequent processing and data classification based on the text fed to it. The different text chunks are processed into ProcessedText instances in parallel using code like the following:
def do_things_to_text(a, b):
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
#prepare inputs for starmap, pairing with list index so order can be reimposed later
pool_inputs = list(enumerate(text_chunks))
#parallel processing
pool = mp.Pool(processes=8)
results = pool.starmap_async(do_things_to_text, pool_inputs)
output = results.get()
The code executes successfully, but it seems that some of the worker processes created as part of the Pool randomly sit idle while the code runs. I track the memory usage, CPU usage, and status in top while the code executes.
At the beginning all 8 worker processes are engaged (status "R" in top and nonzero CPU usage), after ~20 entries from text_chunks are completed, the worker processes start to vary wildly. At times, as few as 1 worker process is running, and the others are in status "S" with zero CPU usage. I can also see from my printed output statements that do_things_to_text() is being called less frequently. So far I haven't been able to identify why the processes start to idle. There are plenty of entries left to process, so them sitting idle leads to time-inefficiency.
My questions are:
Why are these worker processes sitting idle?
Is there a better way to implement multiprocessing that will prevent this?
EDITED to ADD:
I have further characterized the problem. It is clear from the indexes I print out in do_things_to_text() that multiprocessing is dividing the total number of jobs into threads at every tenth index. So my console output shows Job 0, 10, 20, 30, 40, 50, 60, 70 being submitted at the same time (8 processes). And some of the Jobs complete faster than others, so you might see Job 22 completed before you see Job 1 completed.
Up until this first batch of threads is completed, all processes are active with nothing idle. However, when that batch is complete, and Job 80 starts, only one process is active, and the other 7 are idle. I have not confirmed, but I believe it stays like this until the 80-series is complete.

Here are some recommendations for better memory utilization:
I don't know how text_chunks is created but ultimately you end up with 8GB worth of strings in pool_inputs. Ideally, you would have a generator function, for example make_text_chunks, that yields the individual "text chunks" that formerly comprised the text_chunks iterable (if text_chunks is already such a generator expression, then you are all set). The idea is to not create all 8GB worth of data at once but only as the data is needed. With this strategy you can no longer use Pool method starmap_asynch; we will be using Pool.imap. This method, unlike startmap_asynch, will iteratively submit jobs in chunksize chunks and you can process the results as they become available (although that doesn't seem to be an issue).
def make_text_chunks():
# logic goes here to generate the next chunk
yield text_chunk
def do_things_to_text(t):
# t is now a tuple:
a, b = t
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
# do not turn into a list!
pool_inputs = enumerate(make_text_chunks())
def compute_chunksize(n_jobs, poolsize):
"""
function to compute chunksize as is done by Pool module
"""
if n_jobs == 0:
return 0
chunksize, remainder = divmod(n_jobs, poolsize * 4)
if remainder:
chunksize += 1
return chunksize
#parallel processing
# number of jobs approximately
# don't know exactly without turning pool_inputs into a list, which would be self-defeating
N_JOBS = 300
POOLSIZE = 8
CHUNKSIZE = compute_chunksize(N_JOBS, POOLSIZE)
with mp.Pool(processes=POOLSIZE) as pool:
output = [result for result in pool.imap(do_things_to_text, pool_inputs, CHUNKSIZE)]

Related

Why are multiple multiprocessing imap's blocking?

I want to do multiple transformations on some data. I figured I can use multiple Pool.imap's because each of the transformations is just a simple map. And Pool.imap is lazy, so it only does computation when needed.
But strangely, it looks like multiple consecutive Pool.imap's are blocking. And not lazy. Look at the following code as an example.
import time
from multiprocessing import Pool
def slow(n):
time.sleep(0.01)
return n*n
for i in [10, 100, 1000]:
with Pool() as p:
numbers = range(i)
iter1 = p.imap(slow, numbers)
iter2 = p.imap(slow, iter1)
start = time.perf_counter()
next(iter2)
print(i, time.perf_counter() - start)
# Prints
# 10 0.0327413540071575
# 100 0.27094774100987706
# 1000 2.6275791430089157
As you can see the time to the first element is increasing. I have 4 cores on my machine, so it roughly takes 2.5 seconds to process 1000 items with a 0.01 second delay. Hence, I think two consecutive Pool.imap's are blocking. And that the first Pool.imap finishes the entire workload before the second one starts. That is not lazy.
I've did some additional research. It does not matter if I use a process pool or a thread pool. It happens with Pool.imap and Pool.imap_unordered. The blocking takes longer when I do a third Pool.imap. A single Pool.imap is not blocking. This bug report seems related but different.
TL;DR imap is not a real generator, meaning it does not generate items on-demand (lazy computation aka similar to coroutine), and pools initiate "jobs" in serial.
longer answer: Every type of submission to a Pool be it imap, apply, apply_async etc.. gets written to a queue of "jobs". This queue is read by a thread in the main process (pool._handle_tasks) in order to allow jobs to continue to be initiated while the main process goes off and does other things. This thread contains a very simple double for loop (with a lot of error handling) that basically iterates over each job, then over each task within each job. The inner loop blocks until a worker is available to get each task, meaning tasks (and jobs) are always started in serial in the exact order they were submitted. This does not mean they will finish in perfect serial, which is why map, and imap collect results, and re-order them back to their original order (handled by pool._handle_resluts thread) before passing back to the main thread.
Rough pseudocode of what's going on:
#task_queue buffers task inputs first in - first out
pool.imap(foo, ("bar", "baz", "bat"), chunksize=1)
#put an iterator on the task queue which will yield "chunks" (a chunk is given to a single worker process to compute)
pool.imap(fun, ("one", "two", "three"), chunksize=1)
#put a second iterator to the task queue
#inside the pool._task_handler thread within the main proces
for task in task_queue: #[imap_1, imap_2]
#this is actually a while loop in reality that tries to get new tasks until the pool is close()'d
for chunk in task:
_worker_input_queue.put(chunk) # give the chunk to the next available worker
# This blocks until a worker actually takes the chunk, meaning the loop won't
# continue until all chunks are taken by workers.
def worker_function(_worker_input_queue, _worker_output_queue):
while True:
task = _worker_input_queue.get() #get the next chunk of tasks
#if task == StopSignal: break
result = task.func(task.args)
_worker_output_queue.put(result) #results are collected, and re-ordered
# by another thread in the main process
# as they are completed.

Why does not multithreading speed up my program?

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

Multiprocessing pool map_async for one function then block before the next (python 3)

please be warned that this demonstration code generates a few GB data.
I have been using versions of the code below for multiprocessing for some time. It works well when the run time of each process in the pool is similar but if one process takes much longer I end up with many blocked processes waiting on the one, so I'm trying to make it run asynchronously - just for one function at a time.
For example, if I have 70 cores and need to run a function 2000 times I want that to run asynchronously then wait for the last process before calling the next function. Currently it just submits processes in batches of how ever many cores I give it and each batch has to wait for the longest process.
As you can see I've tried using map_async but this is clearly the wrong syntax. Can anyone help me out?
import os
p='PATH/test/'
def f1(tup):
x,y=tup
to_write = x*(y**5)
with open(p+x+str(y)+'.txt','w') as fout:
fout.write(to_write)
def f2(tup):
x,y=tup
print (os.path.exists(p+x+str(y)+'.txt'))
def call_func(f,nos,threads,call):
print (call)
for i in range(0, len(nos), threads):
print (i)
chunk = nos[i:i + threads]
tmp = [('args', no) for no in chunk]
pool.map(f, tmp)
#pool.map_async(f, tmp)
nos=[i for i in range(55)]
threads=8
if __name__ == '__main__':
with Pool(processes=threads) as pool:
call_func(f1,nos,threads,'f1')
call_func(f2,nos,threads,'f2')
map will only return and map_async will only call the callback after all tasks of the current chunk are done.
So you can only either give all tasks to map/map_async at once or use apply_async (initially called threads times) where the callback calls apply_asyncfor the next task.
If the actual return values of the call don't matter (or at least their order doesn't), imap_unordered may be another efficient solution when giving it all tasks at once (or an iterator/generator producing the tasks on demand)

Multiprocessing time increases linearly with more cores

I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.
You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)
I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.

Why is a ThreadPoolExecutor with one worker still faster than normal execution?

I'm using this library, Tomorrow, that in turn uses the ThreadPoolExecutor from the standard library, in order to allow for Async function calls.
Calling the decorator #tomorrow.threads(1) spins up a ThreadPoolExecutor with 1 worker.
Question
Why is it faster to execute a function using 1 thread worker over just calling it as is (e.g. normally)?
Why is it slower to execute the same code with 10 thread workers in place of just 1, or even None?
Demo code
imports excluded
def openSync(path: str):
for row in open(path):
for _ in row:
pass
#tomorrow.threads(1)
def openAsync1(path: str):
openSync(path)
#tomorrow.threads(10)
def openAsync10(path: str):
openSync(path)
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
[func(p) for p in paths]
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Note: The data folder contains 18 files, each 700 lines of random text.
Output
0 workers: 0.0120 seconds
1 worker: 0.0009 seconds
10 workers: 0.0535 seconds
What I've tested
I've ran the code more than a couple dusin times, with different programs running in the background (ran a bunch yesterday, and a couple today). The numbers change, ofc, but the order is always the same. (I.e. 1 is fastest, then 0 then 10).
I've also tried changing the order of execution (e.g. moving the do calls around) in order to eliminate caching as a factor, but still the same.
Turns out that executing in the order 10, 1, None results in a different order (1 is fastest, then 10, then 0) compared to every other permutation. The result shows that whatever do call is executed last, is considerably slower than it would have been had it been executed first or in the middle instead.
Results (After receiving solution from #Dunes)
0 workers: 0.0122 seconds
1 worker: 0.0214 seconds
10 workers: 0.0296 seconds
When you call one of your async functions it returns a "futures" object (instance of tomorrow.Tomorrow in this case). This allows you to submit all your jobs without having to wait for them to finish. However, never actually wait for the jobs to finish. So all do(openAsync1) does is time how long it takes to setup all the jobs (should be very fast). For a more accurate test you need to do something like:
def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
# do all jobs if openSync, else start all jobs if openAsync
results = [func(p) for p in paths]
# if openAsync, the following waits until all jobs are finished
if func is not openSync:
for r in results:
r._wait()
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))
openAll(glob.glob("data/*"))
Using additional threads in python generally slows things down. This is because of the global interpreter lock which means only 1 thread can ever be active, regardless of the number of cores the CPU has.
However, things are complicated by the fact that your job is IO bound. More worker threads might speed things up. This is because a single thread might spend more time waiting for the hard drive to respond than is lost between context switching between the various threads in the multi-threaded variant.
Side note, even though neither openAsync1 and openAsync10 wait for jobs to complete, do(openAsync10) is probably slower because it requires more synchronisation between threads when submitting a new job.

Categories