Python multiprocessing pool number of jobs not correct - python

I wrote a python program to launch parallel processes (16) using pool, to process some files. At the beginning of the run, the number of processes is maintained at 16 until almost all files get processed. Then, for some reasons which I don't understand, when there're only a few files left, only one process runs at a time which makes processing time much longer than necessary. Could you help with this?

Force map() to use a chunksize of 1 instead of guessing the best value by itself, es.:
pool = Pool(16)
pool.map(func, iterable, 1)
This should (in theory) guarantee the best distribution of load among workers until the end of the input data.
See here

Python, before starts the execution of the process that you specify in applyasync/asyncmap of Pool, assigns to each worker a piece of the work.
For example, lets say that you have 8 files to process and you start a Pool with 4 workers.
Before starting the file processing, two specific files will be assigned to each worker. This means that if some worker ends its job earlier than the others, will simply "have a break" and will not start helping the others.

Related

Understanding python multiprocessing pool map thread safety

This question had conflicting answers: Are Python multiprocessing Pool thread safe?
I am new to concurrency patterns and I am trying to run a project that takes in an array and distributes the work of the array onto multiple processes. The array is large.
inputs = range(100000)
with Pool(2) as pool:
res = pool.map(some_func, inputs)
My understanding is that pool will distribute tasks to the processes. My questions are:
Is this map operation thread safe? Will two processes ever accidentally try to process the same value?
I superficially understand that tasks will be divide up into chunks and sent to processes. However, if different inputs take more time than others, will the work always be evenly distributed across my processes? Will I ever be in a scenario where one process is hanging but has a long queue of tasks to do while other processes are idle?
My understanding is that since I am just reading inputs in, I don't need to use any interprocess communication paterns like a server manager / shared memory. Is that right?
If I set up more processes than cores, will it basically operate like threads where the CPU is switching between tasks?
Thank you!
With the code provided, it is impossible that the same item of inputs will be processed by more than one process (an exception would be if the same instance of an object appears more than once in the iterable passed as argument). Nevertheless, this way of using multiprocessing has a lot of overhead, since the inputs items are sent one by one to the processes. A better approach is to use the chunksize parameter:
inputs = range(100000)
n_proc = 2
chunksize = len(inputs)//n_proc
if len(inputs) % n_proc:
chunksize += 1
with Pool(nproc) as pool:
res = pool.map(some_func, inputs, chunksize=chunksize)
this way, chunks of inputs are passed at once to each process, leading to a better performance.
The work is not divided in chunks unless you ask so. If no chunksize is provided, each chunk is one item from the iterable (the equivalent of chunksize=1). Each chunk will be 'sent' one by one to the available processes in the pool. The chunks are sent to the processes as they finish working on the previous one and become available. There is no need for every process to take the same number of chunks. In your example, if some_func takes longer for larger values and chunksize = len(items)/2 the process that gets the chunk with the first half of inputs (with smaller values) will finish first while the other takes much longer. In that case, a smaller chunk is a better option so the work is evenly distributed.
This depends on what some_func does. If you do not need the result of some_func(n) to process some_func(m), you do not need to communicate between processes. If you are using map and need to communicate between processes, it is very likely that you are taking a bad approach to solving your problem.
if max_workers > os.cpu_count() the CPU will switch between processes more often than with a lower number of processes. Don't forget that there are many more processes running in a (not amazingly old) computer than your program. In windows, max_workers must be equal or less than 61 (see the docs here)

Can one terminate a python process which is a worker in a pool?

Each worker runs a long CPU-bound computation. The computation depends on parameters that can change anytime, even while the computation is in progress. Should that happen, the eventual result of the computation will become useless. We do not control the computation code, so we cannot signal it to stop. What can we do?
Nothing: Let the worker complete its task and somehow recognize afterwards that the result is incorrect and must be recomputed. That would means continuing using a processor for a useless result, possibly for a long time.
Don't use Pool: Create and join the processes as needed. We can then terminate the useless process and create another one. We can even keep bounds on the number of processes existing simultaneously. Unfortunately, we will not be reusing processes.
Find a way to terminate and replace a Pool worker: Is terminating a Pool worker even possible? Will Pool create replace the terminated one? If not, is there an external way of creating a new worker in a pool?
Given the strict "can't change computation code" limitation (which prevents checking for invalidation intermittently), your best option is probably #2.
In this case, the downside you mention for #2 ("Unfortunately, we will not be reusing processes.") isn't a huge deal. Reusing processes is an issue when the work done by a process is small relative to the overhead of launching the process. But it sounds like you're talking about processes that run over the course of seconds or longer; the cost of forking a new process (default on most UNIX-likes) is a trivial fraction of that, and spawning a process (default behavior on MacOS and Windows) is typically still measured in small fractions of a second.
For comparison:
Option #1 is wasteful; if you're anywhere close to using up your cores, and invalidation occurs with any frequency at all, you don't want to leave a core chugging on garbage indefinitely.
Option #3, even if it worked, would work only by coincidence, and might break in a new release of Python, since the behavior of killing workers explicitly is not a documented feature.

Python processing items from list/queue and saving progress

If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)
First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229
I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.

From multiprocessing to distributed processing in python standard library

I am studying this code from gitHub about distributed processing. I would like to thank eliben for this nice post. I have read his explanations but there are some dark spots. As far as I understand, the code is for distributing tasks in multiple machines/clients. My questions are:
The most basic of my questions is where the distribution of the work to different machines is happening?
Why there is an if else statement in the main function?
Let me start this question in a more general way. I thought that we usually start a Process in a specific chunk (independent memory part) and not pass all the chunks at once like this:
chunksize = int(math.ceil(len(HugeList) / float(nprocs)))
for i in range(nprocs):
p = Process(
target = myWorker, # This is my worker
args=(HugeList[chunksize * i:chunksize * (i + 1)],
HUGEQ)
processes.append(p)
p.start()
In this simple case where we have nprocs processes. Each process initiate an instance of the function myWorker that work on the specified chunk.
My question here is:
How many threads do we have for each process that work in each chunk?
Looking now into the gitHub code I am trying to understand the mp_factorizer? More specifically, in this function we do not have chunks but a huge queue (shared_job_q). This huge queue is consisted of sub-lists of size 43 maximum. This queue is passed into the factorizer_worker. There via get we obtain those sub-lists and pass them into the serial worker. I understand that we need this queue to share data between clients.
My questions here is:
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
How many threads exist for each process?
Does get function called from each process thread?
Thanks for your time.
The distribution to multiple machines only happens if you actually run the script on multiple machines. The first time you run the script (without the --client option), it starts the Manager server on a specific IP/port, which hosts the shared job/result queues. In addition to starting the Manager server, runserver will also act as a worker, by calling mp_factorizer. It is additionally responsible for collecting the results from the result queue and processing them. You could run this script by itself and get a complete result.
However, you can also distribute the factorization work to other machines, by running the script on other machines using the --client option. That will call runclient, which will connect to the existing Manager server you started with the initial run of the script. That means that the clients are accessing the same shared queues runserver is using, so they can all pull work from and put results to the same queues.
The above should covers questions 1 and 2.
I'm not exactly sure what you're asking in question 3. I think you're wondering why we don't pass a chunk of the list to each worker explicitly (like in the example you included), rather than putting all the chunks into a queue. The answer there is because the runserver method doesn't know how many workers there are actually going to be. It knows that it's going to start 8 workers. However, it doesn't want to split the HugeList into eight chunks and send them to the 8 processes it's creating, because it wants to support remote clients connection to the Manager and doing work, too. So instead, it picks an arbitrary size for each chunk (43) and divides the list into as many chunks of that size as it takes to consume the entire HugeList, and sticks it in a Queue. Here's the code in runserver that does that:
chunksize = 43
for i in range(0, len(nums), chunksize):
#print 'putting chunk %s:%s in job Q' % (i, i + chunksize)
shared_job_q.put(nums[i:i + chunksize]) # Adds a 43-item chunk to the shared queue.
That way, as many workers as you want can connect to the Manager server, grab a chunk from shared_job_q, process it, and return a result.
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Yes
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
We don't have 43 chunks. We have X number of chunks, each of size 43. Each worker process just grabs chunks off the queue and processes them. Which part it gets is arbitrary and depends on how many workers there are and how fast each is going.
How many threads exist for each process?
One. If you mean now many worker processes exist for each instance of the script, there are 8 in the server process, and 4 in each client process.
Does get function called from each process thread?
Not sure what you mean by this.

python: spawn threads as per requirements

I am creating a small application which will perform say 4 different, time consuming tasks such that the output of first is the input of second and so on.
At every task level, the output is appended to a list and the next task pops, operates and appends its output to its output list and so on...
The way I thought I would get the task done is by having multiple threads on each of those 4 tasks.
Coming to the question, is there any way using which I can have my application spawn threads at each task level depending upon the number of tasks in its input queue?
Say the input list of the second task is empty in the beginning so number threads is zero however if one task is there, single thread is spawned, two for two etc ... And of course have an upper limit on the number of threads say 10, so that if the length of the input list goes as high as 100, number of threads operating stays still at 10.
Please suggest the pythonic way to go about achieving this.
You have successfully invented the Thread Pool. There is builtin support and there are many libraries and examples that will provide this for you, so use one or learn from their code.
from multiprocessing.pool import ThreadPool

Categories