How to launch 100 workers in multiprocessing? - python

I am trying to use python to call my function, my_function() 100 times. Since my_function takes a while to run, I want to parallelize this process.
I tried reading the docs for https://docs.python.org/3/library/multiprocessing.html but could not find an easy example to get started with launching 100 workers. Order does not matter; I just need the function to run 100 times.
Any suggestions/code tips?

The literally first example on the page you link to works. So I'm just going to copy and paste it here and change two values.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(100) as p:
print(p.map(f, range(100)))
EDIT: you just said that you're using Google colab. I think google colab offers you two cpu cores, not more. (you can check by running !cat /proc/cpuinfo). With 2 cpu cores, you can only execute two pieces of computation at once.
So, if your function is not primarily something that waits for external IO (e.g. from network), then this makes no sense: you've got 50 executions competing for one core. The magic of modern multiprocessing is that this means that suddenly, one function will be interrupted, its state saved to RAM, the other function then may run for a while, gets interrupted, and so on.
This whole exchanging of state of course is overhead. You'd be faster just running as many instances your function in parallel as you have cores. Read the documentation on Pool as used above for more information.

Related

Multithreading CPU load

I'm trying to run a program external to Python with multithreading using this code:
def handle_multiprocessing_pool(num_threads: int, partial: Callable, variable: list) -> list:
progress_bar = TqdmBar(len(variable))
with multiprocessing.pool.ThreadPool(num_threads) as pool:
jobs = [
pool.apply_async(partial, (value,), callback=progress_bar.update_progress_bar)
for value in variable
]
pool.close()
processing_results = []
for job in jobs:
processing_results.append(job.get())
pool.join()
return processing_results
The Callable being called here loads an external program (with a C++ back-end), runs it and then extracts some data. Inside its GUI, the external program has an option to run cases in parallel, each case is assigned to a thread, from which I assumed it would be best to work with multithreading (instead of multiprocessing).
The script is running without issues, but I cannot quite manage to utilize the CPU power of our machine efficiently. The machine has 64 cores with 2 threads each. I will list some of my findings about the CPU utilisation.
When I run the cases from the GUI, it manages to utilize 100% CPU power.
When I run the script on 120 threads, it seems like only half of the threads are properly engaged:
The external program allows me to run on two threads, however if I run 60 parallel processes on 2 threads each, the utilisation looks similar.
When I run two similar scripts on 60 threads each, the full CPU power is properly used:
I have read about the Global Interpreter Lock in Python, but the multiprocessing package should circumvent this, right? Before test #4, I was assuming that for some reason the processes were still running on cores and the two threads on each were not able to run concurrently (this seems suggested here: multiprocessing.Pool vs multiprocessing.pool.ThreadPool), but especially the behaviour from #4 above is puzzling me.
I have tried the suggestions here Why does multiprocessing use only a single core after I import numpy? which unfortunately did not solve the problem.

Python: How many cores are used by my python program with five processes?

I have a python program consisting of 5 processes outside of the main process. Now I'm looking to get an AWS server or something similar on which I can run the script. But how can I find out how many vCPU cores are used by the script/how many are needed? I have looked at:
import multiprocessing
multiprocessing.cpu_count()
But it seems that it just returns the CPU count that's on the system. I just need to know how many vCPU cores the script uses.
Thanks for your time.
EDIT:
Just for some more information. The Processes are running indefinitely.
On Linux you can use the "top" command at the command line to monitor the real-time activity of all threads of a process id:
top -H -p <process id>
Answer to this post probably lies in the following question:
Multiprocessing : More processes than cpu.count
In short, you have probably hundreds of processes running, but that doesn't mean you will use hundreds of cores. It all depends on utilization, and the workload of the processes.
You can also get some additional info from the psutil module
import psutil
print(psutil.cpu_percent())
print(psutil.cpu_stats())
print(psutil.cpu_freq())
or using OS to receive current cpu usage in python:
import os
import psutil
l1, l2, l3 = psutil.getloadavg()
CPU_use = (l3/os.cpu_count()) * 100
print(CPU_use)
Credit: DelftStack
Edit
There might be some information for you in the following medium article. Maybe there are some tools for CPU usage too.
https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba
Edit 2
A good guideline for how many processes to start depends on the amount of threads available. It's basically just Thread_Count + 1, this ensures your processor doesn't just 'sit around and wait', this however is best used when you are IO bound, think of waiting for files from disk. Once it waits, that process is locked, thus you have 8 others to take over. The one extra is redundancy, in case all 8 are locked, the one that's left can take over right away. You can however in- or decrease this if you see fit.
Your question uses some general terms and leaves much unspecified so answers must be general.
It is assumed you are managing the processes using either Process directly or ProcessPoolExecutor.
In some cases, vCPU is a logical processor but per the following link there are services offering configurations of fractional vCPUs such as those in shared environments...
What is vCPU in AWS
You mention/ask...
... Now I'm looking to get an AWS server or something similar on which I can run the script. ...
... But how can I find out how many vCPU cores are used by the script/how many are needed? ...
You state AWS or something like it. The answer would depend on what your subprocess do, and how much of a vCPU or factional vCPU each subprocess needs. Generally, a vCPU is analogous to a logical processor upon which a thread can execute. A fractional portion of a vCPU will be some limited usage (than some otherwise "full" or complete "usage") of a vCPU.
The meaning of one or more vCPUs (or fractional vCPUs thereto) to your subprocesses really depends on those subprocesses, what they do. If one subprocess is sitting waiting on I/O most of the time, you hardly need a dedicated vCPU for it.
I recommend starting with some minimal least expensive configuration and see how it works with your app's expected workload. If you are not happy, increase the configuration as needed.
If it helps...
I usually use subprocesses if I need simultaneous execution that avoids Python's GIL limitations by breaking things into subprocesses. I generally use a single active thread per subprocess, where any other threads in the same subprocess are usually at a wait, waiting for I/O or do not otherwise compete with the primary active thread of the subprocess. Of course, a subprocess could be dedicated to I/O if you want to separate such from other threads you place in other subprocesses.
Since we do not know your app's purpose, architecture and many other factors, it's hard to say more than the generalities above.
Your computer has hundreds if not thousands of processes running at any given point. How does it handle all of those if it only has 5 cores? The thing is, each core takes a process for a certain amount of time or until it has nothing left to do inside that process.
For example, if I create a script that calculates the square root of all numbers from 1 to say a billion, you will see that a single core will hit max usage, then a split second later another core hits max while the first drops to normal and so on until the calculation is done.
Or if the process waits for an I/O process, then the core has nothing to do, so it drops the process, and goes to another process, when the I/O operation is done, the core can pick the process back, and get back to work.
You can run your multiprocessing python code on a single core, or on 100 cores, you can't really do much about it. However, on windows, you can set affinity of a process, which gives the process access to certain cores only. So, when the processes start, you can go to each one and set the affinity to say core 1 or each one to a separate core. Not sure how you do that on Linux though.
In conclusion, if you want a short and direct answer, I think we can say as many cores as it has access to. If you give them one core or 200 cores, they will still work. However, performance may degrade if the processes are CPU intensive, so I recommend starting with one core on AWS, check performance, and upgrade if needed.
I'll try to do my own summary about "I just need to know how many vCPU cores the script uses".
There is no way to answer that properly other than running your app and monitoring its resource usage. Assuming your Python processes do not spawn subprocesses (which could even be multithreaded applications), all we can say is that your app won't utilize more than 6 cores (as per total number of processes). There's a ton of ways for program to under-utilize CPU cores, like waiting for I/O (disk or network) or interprocess synchronization (shared resources). So to get any kind of understanding of CPU utilization, you really need to measure the actual performance (e.g., with htop utility on Linux or macOS) and investigating the causes of underperforming (if any).
Hope it helps.

How to make sure subprocess Popen is spreading the created instances across many CPU cores in an optimize way in Windows?

I'm programming for windows, and in my python 3 app i am using subprocess Popen to run around 32 instances of an application.
now lets say i have a 32 core CPU, will the Windows operating system automatically spread the instances across the cores in an optimize way? or do i need to specify any flag in Popen or do something else for it to do so?
Basically i want to give these 32 instances the highest possible priority and want to make sure these 32 instances of the application are using the cores the best way, and for example if i have 32 core, neither 2 instances should go in one core, I'm just not sure how can i achieve this?
I believe the solution for your thinking would be Pool() in multiprocessing in Python. Please read the docs here, https://docs.python.org/3/library/multiprocessing.html , for detailed information.
Basically, the Pool object offers a convenient means of parallelizing the execution of a function(in your case we shall say an application) across multiple input values, distributing the input data across processes (data parallelism).
Below is a sample snippet that can used to create N number of parallel processes, which will be evenly distributed to all the cores of the CPU.
import datetime
from multiprocessing import Pool
import time
def init_application(x):
time.sleep(2)
# Initiate your application here, which will be added to the pool for processing.
print(datetime.datetime.now())
return x*x
if __name__ == '__main__':
p = Pool() # Distributes to all available CPU cores
print(p.map(init_application, range(10)))
Hope this clarifies your doubt. Also, the above snippet is a sample implementation of Pool() in python. Please modify according your requirement.
Thanks,
AG

Why does threading sorting functions work slower than sequentially running them in python? [duplicate]

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?
Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)
The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing
As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

Python multiprocessing.Pool() doesn't use 100% of each CPU

I am working on multiprocessing in Python.
For example, consider the example given in the Python multiprocessing documentation (I have changed 100 to 1000000 in the example, just to consume more time). When I run this, I do see that Pool() is using all the 4 processes but I don't see each CPU moving upto 100%. How to achieve the usage of each CPU by 100%?
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.map(f, range(10000000))
It is because multiprocessing requires interprocess communication between the main process and the worker processes behind the scene, and the communication overhead took more (wall-clock) time than the "actual" computation (x * x) in your case.
Try "heavier" computation kernel instead, like
def f(x):
return reduce(lambda a, b: math.log(a+b), xrange(10**5), x)
Update (clarification)
I pointed out that the low CPU usage observed by the OP was due to the IPC overhead inherent in multiprocessing but the OP didn't need to worry about it too much because the original computation kernel was way too "light" to be used as a benchmark. In other words, multiprocessing works the worst with such a way too "light" kernel. If the OP implements a real-world logic (which, I'm sure, will be somewhat "heavier" than x * x) on top of multiprocessing, the OP will achieve a decent efficiency, I assure. My argument is backed up by an experiment with the "heavy" kernel I presented.
#FilipMalczak, I hope my clarification makes sense to you.
By the way there are some ways to improve the efficiency of x * x while using multiprocessing. For example, we can combine 1,000 jobs into one before we submit it to Pool unless we are required to solve each job in real time (ie. if you implement a REST API server, we shouldn't do in this way).
You're asking wrong kind of question. multiprocessing.Process represents process as understood in your operating system. multiprocessing.Pool is just a simple way to run several processes to do your work. Python environment has nothing to do with balancing load on cores/processors.
If you want to control how will processor time be given to processes, you should try tweaking your OS, not python interpreter.
Of course, "heavier" computations will be recognised by system, and may look like they do just what you want to do, but in fact, you have almost no control on process handling.
"Heavier" functions will just look heavier to your OS, and his usual reaction will be assigning more processor time to your processes, but that doesn't mean you did what you wanted to - and that's good, because that the whole point of languages with VM - you specify logic, and VM takes care of mapping this logic onto operating system.

Categories