multiprocessing not using all cores - python

I wrote a sample script, and am having issues after reinstalling Ubuntu 20.04. It appears that multiprocessing is only using a single core. Here is my sample script:
import random
from multiprocessing import Pool, cpu_count
def f(x): return x*x
if __name__ == '__main__':
with Pool(32) as p:
print(p.imap(f,random.sample(range(10, 99999999), 50000000)))
And and image of my processing is below. Any idea what might cause this?

The Pool of workers is an effective design pattern when your job can be split into separate units of works which can be distributed among multiple workers.
To do so, you need to divide your input in chunks and distribute these chunks via some means to all the workers. The multiprocessing.Pool uses OS processes for workers and a single OS pipe as transport layer.
This introduces a significant overhead which is often referred as Inter Process Communication (IPC) cost.
In your specific example, you generate in the main process a large dataset using the random.sample function. This alone takes quite a lot of resources. Then, you send each and every sample to a separate process which does a very trivial computation.
Needless to say, most of the time is spent in the main process which has to generate a large set of data, divide it in chunks of size 1 (as this is the default value for pool.imap) send each and every chunk to the workers and collect the returned values. All the worker processes are basically idling waiting for the main one to feed them work.
If you try to simulate some computation on your function f, you will notice how all cores become busy.

Related

How to launch 100 workers in multiprocessing?

I am trying to use python to call my function, my_function() 100 times. Since my_function takes a while to run, I want to parallelize this process.
I tried reading the docs for https://docs.python.org/3/library/multiprocessing.html but could not find an easy example to get started with launching 100 workers. Order does not matter; I just need the function to run 100 times.
Any suggestions/code tips?
The literally first example on the page you link to works. So I'm just going to copy and paste it here and change two values.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(100) as p:
print(p.map(f, range(100)))
EDIT: you just said that you're using Google colab. I think google colab offers you two cpu cores, not more. (you can check by running !cat /proc/cpuinfo). With 2 cpu cores, you can only execute two pieces of computation at once.
So, if your function is not primarily something that waits for external IO (e.g. from network), then this makes no sense: you've got 50 executions competing for one core. The magic of modern multiprocessing is that this means that suddenly, one function will be interrupted, its state saved to RAM, the other function then may run for a while, gets interrupted, and so on.
This whole exchanging of state of course is overhead. You'd be faster just running as many instances your function in parallel as you have cores. Read the documentation on Pool as used above for more information.

Understanding python multiprocessing pool map thread safety

This question had conflicting answers: Are Python multiprocessing Pool thread safe?
I am new to concurrency patterns and I am trying to run a project that takes in an array and distributes the work of the array onto multiple processes. The array is large.
inputs = range(100000)
with Pool(2) as pool:
res = pool.map(some_func, inputs)
My understanding is that pool will distribute tasks to the processes. My questions are:
Is this map operation thread safe? Will two processes ever accidentally try to process the same value?
I superficially understand that tasks will be divide up into chunks and sent to processes. However, if different inputs take more time than others, will the work always be evenly distributed across my processes? Will I ever be in a scenario where one process is hanging but has a long queue of tasks to do while other processes are idle?
My understanding is that since I am just reading inputs in, I don't need to use any interprocess communication paterns like a server manager / shared memory. Is that right?
If I set up more processes than cores, will it basically operate like threads where the CPU is switching between tasks?
Thank you!
With the code provided, it is impossible that the same item of inputs will be processed by more than one process (an exception would be if the same instance of an object appears more than once in the iterable passed as argument). Nevertheless, this way of using multiprocessing has a lot of overhead, since the inputs items are sent one by one to the processes. A better approach is to use the chunksize parameter:
inputs = range(100000)
n_proc = 2
chunksize = len(inputs)//n_proc
if len(inputs) % n_proc:
chunksize += 1
with Pool(nproc) as pool:
res = pool.map(some_func, inputs, chunksize=chunksize)
this way, chunks of inputs are passed at once to each process, leading to a better performance.
The work is not divided in chunks unless you ask so. If no chunksize is provided, each chunk is one item from the iterable (the equivalent of chunksize=1). Each chunk will be 'sent' one by one to the available processes in the pool. The chunks are sent to the processes as they finish working on the previous one and become available. There is no need for every process to take the same number of chunks. In your example, if some_func takes longer for larger values and chunksize = len(items)/2 the process that gets the chunk with the first half of inputs (with smaller values) will finish first while the other takes much longer. In that case, a smaller chunk is a better option so the work is evenly distributed.
This depends on what some_func does. If you do not need the result of some_func(n) to process some_func(m), you do not need to communicate between processes. If you are using map and need to communicate between processes, it is very likely that you are taking a bad approach to solving your problem.
if max_workers > os.cpu_count() the CPU will switch between processes more often than with a lower number of processes. Don't forget that there are many more processes running in a (not amazingly old) computer than your program. In windows, max_workers must be equal or less than 61 (see the docs here)

Python multiprocessing: dealing with 2000 processes

Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.

Python multiprocessing - reassigning jobs dynamically from pool - without async?

So I have a batch of 1000 tasks that I assign using parmap/python multiprocessing module to 8 cores (dual xeon machine 16 physical cores). Currently this runs using synchronized.
The issue is that usually 1 of the cores lags well behind the other cores and still has several jobs/tasks to complete after all the other cores finished their work. This may be related to core speed (older computer) but more likely due to some of the tasks being more difficult than others - so the 1 core that gets the slightly more difficult jobs gets laggy...
I'm a little confused here - but is this what asynch parallelization does? I've tried using it before, but because this step is part of a very large processing step - it wasn't clear how to create a barrier to force the program to wait until all async processes are done.
Any advice/links to similar questions/answers are appreciated.
[EDIT] To clarify, the processes are ok to run independently, they all save data to disk and do not share variables.
parmap author here
By default, both in multiprocessing and in parmap, tasks are divided in chunks and chunks are sent to each multiprocessing process (see the multiprocessing documentation). The reason behind this is that sending tasks individually to a process would introduce significant computational overhead in many situations. The overhead is reduced if several tasks are sent at once, in chunks.
The number of tasks on each chunk is controlled with chunksize in multiprocessing (and pm_chunksize in parmap). By default, chunksize is computed as "number of tasks"/(4*"pool size"), rounded up (see the multiprocessing source code). So for your case, 1000/(4*4) = 62.5 -> 63 tasks per chunk.
If, as in your case, many computationally expensive tasks fall into the same chunk, that chunk will take a long time to finish.
One "cheap and easy" way to workaround this is to pass a smaller chunksize value. Note that using the extreme chunksize=1 may introduce undesired larger cpu overhead.
A proper queuing system as suggested in other answers is a better solution on the long term, but maybe an overkill for a one-time problem.
You really need to look at creating microservices and using a queue pool. For instance, you could put a list of jobs in celery or redis, and then have the microservices pull from the queue one at a time and process the job. Once done they pull the next item and so forth. That way your load is distributed based on readiness, and not based on a preset list.
http://www.celeryproject.org/
https://www.fullstackpython.com/task-queues.html

multiprocessing of video frames in python

I am new to multiprocessing in python. I want to extract features from each frame of hour long video files. Processing each frame takes on the order of 30 ms. I thought multiprocessing was a good idea because each frame is processed independentle of all other frames.
I want to store the results of the feature extraction in a custom class.
I read a few examples and ended up using multiprocessing and Queues as suggested here. The result was disappointing though, now each frames takes about 1000 ms to process. I am guessing I produced a ton of overhead.
is there a more efficient way to process the frames in parallel and collect the results?
to illustrate, I put together a dummy example.
import multiprocessing as mp
from multiprocessing import Process, Queue
import numpy as np
import cv2
def main():
#path='path\to\some\video.avi'
coordinates=np.random.random((1000,2))
#video = cv2.VideoCapture(path)
listOf_FuncAndArgLists=[]
for i in range(50):
#video.set(cv2.CAP_PROP_POS_FRAMES,i)
#img_frame_original = video.read()[1]
#img_frame_original=cv2.cvtColor(img_frame_original, cv2.COLOR_BGR2GRAY)
img_frame_dummy=np.random.random((300,300)) #using dummy image for this example
frame_coordinates=coordinates[i,:]
listOf_FuncAndArgLists.append([parallel_function,frame_coordinates,i,img_frame_dummy])
queues=[Queue() for fff in listOf_FuncAndArgLists] #create a queue object for each function
jobs = [Process(target=storeOutputFFF,args=[funcArgs[0],funcArgs[1:],queues[iii]]) for iii,funcArgs in enumerate(listOf_FuncAndArgLists)]
for job in jobs: job.start() # Launch them all
for job in jobs: job.join() # Wait for them all to finish
# And now, collect all the outputs:
return([queue.get() for queue in queues])
def storeOutputFFF(fff,theArgs,que): #add a argument to function for assigning a queue
print 'MULTIPROCESSING: Launching %s in parallel '%fff.func_name
que.put(fff(*theArgs)) #we're putting return value into queue
def parallel_function(frame_coordinates,i,img_frame_original):
#do some image processing that takes about 20-30 ms
dummyResult=np.argmax(img_frame_original)
return(resultClass(dummyResult,i))
class resultClass(object):
def __init__(self,maxIntensity,i):
self.maxIntensity=maxIntensity
self.i=i
if __name__ == '__main__':
mp.freeze_support()
a=main()
[x.maxIntensity for x in a]
Parallel processing in (regular) python is a bit of a pain: in other languages we'd just use threads but the GIL makes that problematic, and using multiprocessing has a big overhead in moving data around. I've found that fine-grained parallelism is (relatively) hard to do, whilst processing 'chunks' of work that take 10's of seconds (or more) to process in a single process can be much more straight-forward.
An easier path to parallel processing your problem - if you're on a UNIXy system - would be to make a python program which processes a segment of video specified on the command-line (i.e. a frame number to start with, and a number of frames to process), and then use the GNU parallel tool to process multiple segments at once. A second python program can consolidate the results from a collection of files, or reading from stdin, piped from parallel. This way means that the processing code doesn't need to do it's own parallelism, but it does require the input file to be multiply accessed and to extract frames starting from mid-points. (This might also be extendable to work across multiple machines without changing the python...)
Using multiprocessing.Pool.map could be used in a similar way if you need a pure-python solution: map over a list of tuples (say, (file, startframe, endframe)) and then open the file in the function and process that segment.
Multiprocessing creates some overhead for starting several processes and bringing them all back together.
Your code does that for every frame.
Try splitting your video into N evenly-sized pieces and processing them in parallel.
Put N equal to number of cores on your machine or something like that (your mileage may vary, but it's a good number to start experimenting with). There's no point in creating 50 processes if, say, 4 of them are getting executed and rest are simply waiting for their turn.

Categories