Understanding python multiprocessing pool map thread safety - python

This question had conflicting answers: Are Python multiprocessing Pool thread safe?
I am new to concurrency patterns and I am trying to run a project that takes in an array and distributes the work of the array onto multiple processes. The array is large.
inputs = range(100000)
with Pool(2) as pool:
res = pool.map(some_func, inputs)
My understanding is that pool will distribute tasks to the processes. My questions are:
Is this map operation thread safe? Will two processes ever accidentally try to process the same value?
I superficially understand that tasks will be divide up into chunks and sent to processes. However, if different inputs take more time than others, will the work always be evenly distributed across my processes? Will I ever be in a scenario where one process is hanging but has a long queue of tasks to do while other processes are idle?
My understanding is that since I am just reading inputs in, I don't need to use any interprocess communication paterns like a server manager / shared memory. Is that right?
If I set up more processes than cores, will it basically operate like threads where the CPU is switching between tasks?
Thank you!

With the code provided, it is impossible that the same item of inputs will be processed by more than one process (an exception would be if the same instance of an object appears more than once in the iterable passed as argument). Nevertheless, this way of using multiprocessing has a lot of overhead, since the inputs items are sent one by one to the processes. A better approach is to use the chunksize parameter:
inputs = range(100000)
n_proc = 2
chunksize = len(inputs)//n_proc
if len(inputs) % n_proc:
chunksize += 1
with Pool(nproc) as pool:
res = pool.map(some_func, inputs, chunksize=chunksize)
this way, chunks of inputs are passed at once to each process, leading to a better performance.
The work is not divided in chunks unless you ask so. If no chunksize is provided, each chunk is one item from the iterable (the equivalent of chunksize=1). Each chunk will be 'sent' one by one to the available processes in the pool. The chunks are sent to the processes as they finish working on the previous one and become available. There is no need for every process to take the same number of chunks. In your example, if some_func takes longer for larger values and chunksize = len(items)/2 the process that gets the chunk with the first half of inputs (with smaller values) will finish first while the other takes much longer. In that case, a smaller chunk is a better option so the work is evenly distributed.
This depends on what some_func does. If you do not need the result of some_func(n) to process some_func(m), you do not need to communicate between processes. If you are using map and need to communicate between processes, it is very likely that you are taking a bad approach to solving your problem.
if max_workers > os.cpu_count() the CPU will switch between processes more often than with a lower number of processes. Don't forget that there are many more processes running in a (not amazingly old) computer than your program. In windows, max_workers must be equal or less than 61 (see the docs here)

Related

multiprocessing not using all cores

I wrote a sample script, and am having issues after reinstalling Ubuntu 20.04. It appears that multiprocessing is only using a single core. Here is my sample script:
import random
from multiprocessing import Pool, cpu_count
def f(x): return x*x
if __name__ == '__main__':
with Pool(32) as p:
print(p.imap(f,random.sample(range(10, 99999999), 50000000)))
And and image of my processing is below. Any idea what might cause this?
The Pool of workers is an effective design pattern when your job can be split into separate units of works which can be distributed among multiple workers.
To do so, you need to divide your input in chunks and distribute these chunks via some means to all the workers. The multiprocessing.Pool uses OS processes for workers and a single OS pipe as transport layer.
This introduces a significant overhead which is often referred as Inter Process Communication (IPC) cost.
In your specific example, you generate in the main process a large dataset using the random.sample function. This alone takes quite a lot of resources. Then, you send each and every sample to a separate process which does a very trivial computation.
Needless to say, most of the time is spent in the main process which has to generate a large set of data, divide it in chunks of size 1 (as this is the default value for pool.imap) send each and every chunk to the workers and collect the returned values. All the worker processes are basically idling waiting for the main one to feed them work.
If you try to simulate some computation on your function f, you will notice how all cores become busy.

Python multiprocessing - reassigning jobs dynamically from pool - without async?

So I have a batch of 1000 tasks that I assign using parmap/python multiprocessing module to 8 cores (dual xeon machine 16 physical cores). Currently this runs using synchronized.
The issue is that usually 1 of the cores lags well behind the other cores and still has several jobs/tasks to complete after all the other cores finished their work. This may be related to core speed (older computer) but more likely due to some of the tasks being more difficult than others - so the 1 core that gets the slightly more difficult jobs gets laggy...
I'm a little confused here - but is this what asynch parallelization does? I've tried using it before, but because this step is part of a very large processing step - it wasn't clear how to create a barrier to force the program to wait until all async processes are done.
Any advice/links to similar questions/answers are appreciated.
[EDIT] To clarify, the processes are ok to run independently, they all save data to disk and do not share variables.
parmap author here
By default, both in multiprocessing and in parmap, tasks are divided in chunks and chunks are sent to each multiprocessing process (see the multiprocessing documentation). The reason behind this is that sending tasks individually to a process would introduce significant computational overhead in many situations. The overhead is reduced if several tasks are sent at once, in chunks.
The number of tasks on each chunk is controlled with chunksize in multiprocessing (and pm_chunksize in parmap). By default, chunksize is computed as "number of tasks"/(4*"pool size"), rounded up (see the multiprocessing source code). So for your case, 1000/(4*4) = 62.5 -> 63 tasks per chunk.
If, as in your case, many computationally expensive tasks fall into the same chunk, that chunk will take a long time to finish.
One "cheap and easy" way to workaround this is to pass a smaller chunksize value. Note that using the extreme chunksize=1 may introduce undesired larger cpu overhead.
A proper queuing system as suggested in other answers is a better solution on the long term, but maybe an overkill for a one-time problem.
You really need to look at creating microservices and using a queue pool. For instance, you could put a list of jobs in celery or redis, and then have the microservices pull from the queue one at a time and process the job. Once done they pull the next item and so forth. That way your load is distributed based on readiness, and not based on a preset list.
http://www.celeryproject.org/
https://www.fullstackpython.com/task-queues.html

Execute Python threads in small groups

I am trying to insert some number(100) of data sets into SQL server using python. I am using multi-threading to create 100 threads in a loop. All of them are starting at the same time and this is bogging down the database. I want to group my threads into set of 5 and once that group is done, I would like to start the next group of threads and so on. As I am new to python and multi-threading, any help would be highly appreciated.Please find my code below.
for row in datasets:
argument1=row[0]
argument2=row[1]
jobs=[]
t = Thread(target=insertDataIntoSQLServer, args=(argument1,argument2,))
jobs.append(t)
t.start()
for t in jobs:
t.join()
On Python 2 and 3 you could use a multiprocessing.ThreadPool. This is like a multiprocessing.Pool, but using threads instead of processes.
import multiprocessing
datasets = [(1,2,3), (4,5,6)] # Iterable of datasets.
def insertfn(data):
pass # shove data to SQL server
pool = multiprocessing.ThreadPool()
p.map(insertfn, datasets)
By default, a Pool will create as many worker threads as your CPU has cores. Using more threads will probably not help, because they will be fighting for CPU time.
Note that I've grouped data into tuples. That is one way to get around the one argument restriction for pool workers.
On Python 3 you can also use a ThreadPoolExecutor.
Note however that on Python implementations (like the "standard" CPython) that have a Global Interpreter Lock, only one thread at a time can be executing Python bytecode. So using large numbers of threads will not automatically increase performance. Threads might help with operations that are I/O bound. If one thread is waiting for I/O, another thread can run.
First note that your code doesn't work as you intended: it sets jobs to an empty list every time through the loop, so after the loop is over you only join() the last thread created.
So repair that, by moving jobs=[] out of the loop. After that, you can get exactly what you asked for by adding this after t.start():
if len(jobs) == 5:
for t in jobs:
t.join()
jobs = []
I'd personally use some kind of pool (as other answers suggest), but it's easy to directly get what you had in mind.
You can create a ThreadPoolExecutor and specify max_workers=5.
See here.
And you can use functools.partial to turn your functions into the required 0-argument functions.
EDIT: You can pass the args in with the function name when you submit to the executor. Thanks, Roland Smith, for reminding me that partial is a bad idea. There was a better way.

Python multiprocessing pool number of jobs not correct

I wrote a python program to launch parallel processes (16) using pool, to process some files. At the beginning of the run, the number of processes is maintained at 16 until almost all files get processed. Then, for some reasons which I don't understand, when there're only a few files left, only one process runs at a time which makes processing time much longer than necessary. Could you help with this?
Force map() to use a chunksize of 1 instead of guessing the best value by itself, es.:
pool = Pool(16)
pool.map(func, iterable, 1)
This should (in theory) guarantee the best distribution of load among workers until the end of the input data.
See here
Python, before starts the execution of the process that you specify in applyasync/asyncmap of Pool, assigns to each worker a piece of the work.
For example, lets say that you have 8 files to process and you start a Pool with 4 workers.
Before starting the file processing, two specific files will be assigned to each worker. This means that if some worker ends its job earlier than the others, will simply "have a break" and will not start helping the others.

From multiprocessing to distributed processing in python standard library

I am studying this code from gitHub about distributed processing. I would like to thank eliben for this nice post. I have read his explanations but there are some dark spots. As far as I understand, the code is for distributing tasks in multiple machines/clients. My questions are:
The most basic of my questions is where the distribution of the work to different machines is happening?
Why there is an if else statement in the main function?
Let me start this question in a more general way. I thought that we usually start a Process in a specific chunk (independent memory part) and not pass all the chunks at once like this:
chunksize = int(math.ceil(len(HugeList) / float(nprocs)))
for i in range(nprocs):
p = Process(
target = myWorker, # This is my worker
args=(HugeList[chunksize * i:chunksize * (i + 1)],
HUGEQ)
processes.append(p)
p.start()
In this simple case where we have nprocs processes. Each process initiate an instance of the function myWorker that work on the specified chunk.
My question here is:
How many threads do we have for each process that work in each chunk?
Looking now into the gitHub code I am trying to understand the mp_factorizer? More specifically, in this function we do not have chunks but a huge queue (shared_job_q). This huge queue is consisted of sub-lists of size 43 maximum. This queue is passed into the factorizer_worker. There via get we obtain those sub-lists and pass them into the serial worker. I understand that we need this queue to share data between clients.
My questions here is:
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
How many threads exist for each process?
Does get function called from each process thread?
Thanks for your time.
The distribution to multiple machines only happens if you actually run the script on multiple machines. The first time you run the script (without the --client option), it starts the Manager server on a specific IP/port, which hosts the shared job/result queues. In addition to starting the Manager server, runserver will also act as a worker, by calling mp_factorizer. It is additionally responsible for collecting the results from the result queue and processing them. You could run this script by itself and get a complete result.
However, you can also distribute the factorization work to other machines, by running the script on other machines using the --client option. That will call runclient, which will connect to the existing Manager server you started with the initial run of the script. That means that the clients are accessing the same shared queues runserver is using, so they can all pull work from and put results to the same queues.
The above should covers questions 1 and 2.
I'm not exactly sure what you're asking in question 3. I think you're wondering why we don't pass a chunk of the list to each worker explicitly (like in the example you included), rather than putting all the chunks into a queue. The answer there is because the runserver method doesn't know how many workers there are actually going to be. It knows that it's going to start 8 workers. However, it doesn't want to split the HugeList into eight chunks and send them to the 8 processes it's creating, because it wants to support remote clients connection to the Manager and doing work, too. So instead, it picks an arbitrary size for each chunk (43) and divides the list into as many chunks of that size as it takes to consume the entire HugeList, and sticks it in a Queue. Here's the code in runserver that does that:
chunksize = 43
for i in range(0, len(nums), chunksize):
#print 'putting chunk %s:%s in job Q' % (i, i + chunksize)
shared_job_q.put(nums[i:i + chunksize]) # Adds a 43-item chunk to the shared queue.
That way, as many workers as you want can connect to the Manager server, grab a chunk from shared_job_q, process it, and return a result.
Do we call an instance of the factorizer_worker function for each of the nprocs(=8) processes?
Yes
Which part of the data each process work? (Generally, we have 8 processes and 43 chunks.)
We don't have 43 chunks. We have X number of chunks, each of size 43. Each worker process just grabs chunks off the queue and processes them. Which part it gets is arbitrary and depends on how many workers there are and how fast each is going.
How many threads exist for each process?
One. If you mean now many worker processes exist for each instance of the script, there are 8 in the server process, and 4 in each client process.
Does get function called from each process thread?
Not sure what you mean by this.

Categories