`multiprocessing.Pool.map()` seems to schedule wrongly - python

I have a function which request a server, retrieves some data, process it and saves a csv file. This function should be launch 20k times. Each execution last differently: sometimes It last more than 20 minutes and other less than a second. I decided to go with multiprocessing.Pool.map to parallelize the execution. My code looks like:
def get_data_and_process_it(filename):
print('getting', filename)
...
print(filename, 'has been process')
with Pool(8) as p:
p.map(get_data_and_process_it, long_list_of_filenames)
Looking at how prints are generated it seems that long_list_of_filenames it's been splited into 8 parts and assinged to each CPU because sometimes is just get blocked in one 20 minutes execution with no other element of long_list_of_filenames been processed in those 20 minutes. What I was expecting is map to schedule each element in a cpu core in a FIFO style.
Is there a better approach for my case?

The map method only returns when all operations have finished.
And printing from a pool worker is not ideal. For one thing, files like stdout use buffering, so there might be a variable amount of time between printing a message and it actually appearing. Furthermore, since all workers inherit the same stdout, their output would become intermeshed and possibly even garbled.
So I would suggest using imap_unordered instead. That returns an iterator that will begin yielding results as soon as they are available. The only catch is that this returns results in the order they finish, not in the order they started.
Your worker function (get_data_and_process_it) should return some kind of status indicator. For example a tuple of the filename and the result.
def get_data_and_process_it(filename):
...
if (error):
return (filename, f'has *failed* bacause of {reason}')
return (filename, 'has been processed')
You could then do:
with Pool(8) as p:
for fn, res in p.imap_unordered(get_data_and_process_it, long_list_of_filenames):
print(fn, res)
That gives accurate information about when a job finishes, and since only the parent process writes to stdout, there is no change of the output becoming garbled.
Additionally, I would suggest to use sys.stdout.reconfigure(line_buffering=True) somewhere in the beginning of your program. That ensures that the stdout stream will be flushed after every line of output.

map is blocking, instead of p.map you can use p.map_async. map will wait for all those function calls to finish so we see all the results in a row. map_async does the work in random order and does not wait for a proceeding task to finish before starting a new task. This is the fastest approach.(For more) There is also a SO thread which in detail discusses about map and map_async.
The multiprocessing Pool class handles the queuing logic for us. It's perfect for running web scraping jobs in parallel (example) or really any job that can be broken up and distributed independently. If you need more control over the queue or need to share data between multiple processes, you may want to look at the Queue class(For more).

Related

Why does not multithreading speed up my program?

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

Multiprocessing pool map_async for one function then block before the next (python 3)

please be warned that this demonstration code generates a few GB data.
I have been using versions of the code below for multiprocessing for some time. It works well when the run time of each process in the pool is similar but if one process takes much longer I end up with many blocked processes waiting on the one, so I'm trying to make it run asynchronously - just for one function at a time.
For example, if I have 70 cores and need to run a function 2000 times I want that to run asynchronously then wait for the last process before calling the next function. Currently it just submits processes in batches of how ever many cores I give it and each batch has to wait for the longest process.
As you can see I've tried using map_async but this is clearly the wrong syntax. Can anyone help me out?
import os
p='PATH/test/'
def f1(tup):
x,y=tup
to_write = x*(y**5)
with open(p+x+str(y)+'.txt','w') as fout:
fout.write(to_write)
def f2(tup):
x,y=tup
print (os.path.exists(p+x+str(y)+'.txt'))
def call_func(f,nos,threads,call):
print (call)
for i in range(0, len(nos), threads):
print (i)
chunk = nos[i:i + threads]
tmp = [('args', no) for no in chunk]
pool.map(f, tmp)
#pool.map_async(f, tmp)
nos=[i for i in range(55)]
threads=8
if __name__ == '__main__':
with Pool(processes=threads) as pool:
call_func(f1,nos,threads,'f1')
call_func(f2,nos,threads,'f2')
map will only return and map_async will only call the callback after all tasks of the current chunk are done.
So you can only either give all tasks to map/map_async at once or use apply_async (initially called threads times) where the callback calls apply_asyncfor the next task.
If the actual return values of the call don't matter (or at least their order doesn't), imap_unordered may be another efficient solution when giving it all tasks at once (or an iterator/generator producing the tasks on demand)

Multiprocessing not parallelizing

I have a function that can be run in parallel, however, as I try running it, it appears that the function is being called serially.
import multiprocessing as mp
def function_to_be_parallelized(x,y,z):
#compute_array takes 1-5 minutes computation to depending on x,y,z
computed_array=compute_array(x,y,z)
print ("running with parameters"+str(x*y*z))
return computed_array
def run(xs,ys,zs):
pool = mp.Pool(processes=4)
all_outputs = [pool.apply(function_to_be_parallelized, args=(x,y,z)) for x in xs for y in ys for z in zs]
What I find is that the print statements are printed one at a time, and each is only printed once the previous process is finished, I'm running this on a machine with 4 cores.
Is this because the processes in the inner function each occupy more than 2 cores (so that it cannot be parallelized)? Or is there another reason?
pool.apply waits for the result to be ready, so you're not submitting a new job until the previous job finishes. You'd have to use something like apply_async or map, but even then, there's no guarantee you'll see interleaved or out-of-order execution, and the benefits of parallelization will probably be swamped by overhead for a function like this.
This looks okay to me. It is likely an issue with waiting for the print buffer to fill. Look into apply_async: https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.apply_async
Also,
The print command is being called, python will not send your print to stdout unless there is enough stuff in there. Try adding a sys.stdout.flush() into your function_to_be_parallellized to force printing ASAP.

Python - weird behavior with multiprocessing - join does not execute

I am using the multiprocessing python module. I have about 20-25 tasks to run simultaneously. Each task will create a pandas.DataFrame object of ~20k rows. Problem is, all tasks execute well, but when it comes to "joining" the processes, it just stops. I've tried with "small" DataFrames and it works very well. To illustrate my point, I created the code below.
import pandas
import multiprocessing as mp
def task(arg, queue):
DF = pandas.DataFrame({"hello":range(10)}) # try range(1000) or range(10000)
queue.put(DF)
print("DF %d stored" %arg)
listArgs = range(20)
queue = mp.Queue()
processes = [mp.Process(target=task,args=(arg,queue)) for arg in listArgs]
for p in processes:
p.start()
for i,p in enumerate(processes):
print("joining %d" %i)
p.join()
results = [queue.get() for p in processes]
EDIT:
With DF = pandas.DataFrame({"hello":range(10)}) I have everything correct: "DF 0 stored" up to "DF 19 stored", same with "joining 0" to "joining 19".
However with DF = pandas.DataFrame({"hello":range(1000)}) the issue arises: while it is storing the DF, the joining step stops after "joining 3".
Thanks for the useful tips :)
This problem is explained in the docs, under Pipes and Queues:
Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
Using a manager would work, but there are a lot of easier ways to solve this:
Read the data off the queue first, then join the processes, instead of the other way around.
Manage the Queue manually (e.g., using a JoinableQueue and task_done).
Just use Pool.map instead of reinventing the wheel. (Yes, much of what Pool does isn't necessary for your use case—but it also isn't going to get in the way, and the nice thing is, you already know it works.)
I won't show the implementation for #1 because it's so trivial, or for #2 because it's such a pain, but for #3:
def task(arg):
DF = pandas.DataFrame({"hello":range(1000)}) # try range(1000) or range(10000)
return DF
with mp.Pool(processes=20) as p:
results = p.map(task, range(20), chunksize=1)
(In 2.7, Pool may not work in a with statement; you can install the port of the later version of multiprocessing back to 2.7 off PyPI, or you can just manually create the pool, then close it in a try/finally, just you would handle a file if it didn't work in a with statement...)
You may ask yourself, why exactly does it fail at this point, but work with smaller numbers—even just a little bit smaller?
A pickle of that DataFrame is just over 16K. (The list by itself is a little smaller, but if you try it with 10000 instead of 1000 you should see the same thing without Pandas.)
So, the first child writes 16K, then blocks until there's room to write the last few hundred bytes. But you're not pulling anything off the pipe (by calling queue.get) until after the join, and you can't join until they exit, which they can't do until you unblock the pipe, so it's a classic deadlock. There's enough room for the first 4 to get through, but no room for 5. Because you have 4 cores, most of the time, the first 4 that get through will be the first 4. But occasionally #4 will beat #3 or something, and then you'll fail to join #3. That would happen more often with an 8-core machine.

Does pool.map() from multiprocessing lock process to CPU core automatically?

I've submitted several questions over last few days trying to understand how to use the multiprocessing python library properly.
Current method I'm using is to split a task over a number of processes that is equal to the number of available CPU cores on the machine, as follows:
from multiprocessing import Pool
from contextlib import closing
def myFunction(row):
# row function
with closing(Pool(processes=multiprocessing.cpu_count())) as pool:
pool.map(myFunction, rowList)
However, when the map part is reached in the program it seems to actually slow down, not speed up. One of my functions for example moves through only 60 records (the first function) and it prints a result at the end of each record. The record printing seems to slow down to an eventual stop and do not much! I am wondering if the program is loading the next function into memory async or whether there's something wrong with my methodology.
So I am wondering - are the child processes automatically 'LOCKED' to each CPU core with the pool.map() or do I need to do something extra?
EDIT:
So the program does not actually stop, it just begins to print the values very slowly.
here is an example of myFunction in very simplified terms (row is from a list object):
def myFunction(row):
d = string
j=0
for item in object:
d+= row[j]
j=j+1
d += row[x] + string
d += row[y] + string
print row[z]
return
As I said, the above function is for a very small list, however the function proceeding it deals with a much much larger list.
The problem is that you don't appear to be doing enough work in each call to the worker function. All you seem to be doing is pasting together list of strings being passed as argument. However this is pretty much exactly what the multiprocessing module needs to do in the parent process to pass the list of strings to the worker process. It pickles them, writes them to a pipe, which the child process then reads, unpickles and then passes as argument to myFunction.
Since in order to pass the argument to the worker process the parent process has to do at least as much work as the worker process needs to do, you gain no benefit from using the multiprocessing module in this case.

Categories