Python multiprocess pool processes count - python

I am using a linux server with 128 cores, but I'm not the only one using it so I want to make sure that I use a maximum of 60 cores for my program. The program I'm writing is a series of simulations, where each simulation is parallelized itself. The number of cores of such a single simulation can be chosen, and I typically use 12.
So in theory, I can run 5 of these simulations at the same time, which would result in (5x12) 60 cores used in total. I want start the simulations from python (since that's where all the preprocessing happens), and my eye has caught the multiprocessing library, especially the Pool class.
Now my question is: should I use
pool = Pool(processes=5)
or
pool = Pool(processes=60)
The idea being: does the processes argument signify the amount of workers used (where each worker is assigned 12 cores), or the total amount of processes available?

The argument 'processes' in Pool means the total subprocess you want to create in this program. So If you want to use all 60 cores, here should be 60.

Related

Is spliting my program into 8 separate processes the best approach (performance wise) when I have 8 logical cores in python?

Intro
I have rewritten one of my previously sequential algorithms to work in a parallel fashion (we are talking about real parallelism, not concurrency or threads). I run a batch script that runs my "worker" python nodes and each will perform the same task but on a different offset (no data sharing between processes). If it helps visualize imagine a dummy example with an API endpoint which on [GET] request sends me a number and I have to guess if it is even or odd so I run my workers. This example gets the point across as I can't share the algorithm but let's say that the routine for a single process is already optimized to the maximum.
Important: the processes are executed on Windows10 with admin privileges and real_time priority
Diagnostics
Is the optimal number of work node processes equal to the number of logical cores (i.e. 8)? When I use task manager I see my CPU hit 100% limit on all cores but when I look at the processes they each take about 6%? With 6% * 8 = 48% how does this make sense? On idle (without the processes) my CPU sits at about 0-5% total.
I've tried to diagnose it with Performance Monitor but the results were even more confusing:
Reasoning
I didn't know how to configure Performance Manager to track my processes across separate cores so I used total CPU time as the Y-axis. How can I have a minimum of 20% usage on 8 processes which means 120% utilization?
Question 1
This doesn't make much sense and the numbers are different from what the task manager shows. Worse of it all is the bolded blue line which shows total (average) CPU performance across all cores and this doesn't seem to exceed 70% when the task manager says all my cores run at 100%? What am I confusing here?
Question 2
Is running X processes where X is the number of logical cores on the system under real_time priority the best I can do? (and let the OS handle the scheduling logic)? In the second picture from the bar chart, I can see that it is doing a decent job as ideally, all those bars would be of an equal height which is roughly true.
I have found the answer to this question and have decided to post rather than delete. I used the psutil library to set the affinity of each worker process manually and distribute them instead of the OS. I have had MANY IO operations on the network and from debug prints which caused my processor cores to not be able to max out 100% (after all windows is no real-time operating system)
In addition to this, since I've tested the code on my laptop, I've encountered thermal throttling which caused disturbances in the %usage calculations.

How to do efficient multiprocessing?

I am using multiprocessing.Pool.map to run my code in parallel in my workstation which has got 10 physical cores (20 cores if I include the logical ones also).
To summarize my code, I have to do some calculations with 2080 matrices. So,I divide 2080 matrices to 130 groups each containing 16 matrices.
The calculation of these 16 matrices are then distributed over 16 cores (Should I use only 10 since I have only 10 physical cores?) using multiprocessing.Pool.map.
My questions are:
(1) When I monitor the usage of CPU in 'system monitor' in Ubuntu, I find many a times only 1 CPU usage is showing 100% instead of 16 CPU's showing 100% of usage. 16 CPU's show 100% usage only for short duration. Why does this happen? How to improve it?
(2) Will I be able to improve the calculation time by dividing 2080 matrices into 104 groups each having 20 matrices and then distribute the calculation of these 20 matrices over 10 or 16 cores?
My code snippet is as below:
def f(k):
adj=np.zeros((9045,9045),dtype='bool')
# Calculate the elements of the matrices
return adj
n_CPU=16
n_networks_window=16
window=int(2080/n_networks_window) #Dividing 2080 matrices into 130 segments having 16 matrices each
for i in range(window):
range_window=range(int(i*2080/window),int((i+1)*2080/window))
p=Pool(processes=n_CPU)
adj=p.map(f,range_window)
p.close()
p.join()
for k in range_window:
# Some calculations using adj
np.savetxt(') # saving the output as a txt file
Any help will be really useful as I am first time parallelizing a python code.
Thank you.
EDIT:
I tried the following chnages in the code and it is working fine now:
pool.imap_unordered(f,range(2080),chunksize=260)
Your problem is here:
for i in range(window):
# [snip]
p=Pool(processes=n_CPU)
adj=p.map(f,range_window)
p.close()
p.join()
# [snip]
You're creating a new Pool at every loop and submitting only a few jobs to it. In order for the loop to continue, the few jobs have to complete before more jobs can be executed. In other words, you're not using parallelism at its full potential.
What you should do is create a single Pool, submit all the jobs, and then, out of the loop, join:
p=Pool(processes=n_CPU)
for i in range(window):
# [snip]
p.map_async(f,range_window)
# [snip]
p.close()
p.join()
Note the use of map_async instead of map: this is, again, to avoid waiting for a small portion of jobs to complete before submitting new jobs.
An even better approach is to call map/map_async only once, constructing a single range object and avoiding the for-loop:
with Pool(processes=n_CPU) as p:
p.map(f, range(2080)) # This will block, but it's okay: we are
# submitting all the jobs at once
As for your question about the number of CPUs to use, first of all note that Pool will use all the CPUs available (as returned by os.cpu_count() by default if you don't specify the processes argument -- give it a try.
It's not clear to me what you mean by having 10 physical cores and 20 logical ones. If you're talking about hyperthreading, then it's fine: use them all. If instead you're saying that you're using a virtual machine with more virtual CPUs than the host CPUs, then using 20 instead of 10 won't make much difference.

Python multiprocessing ratio of processors and iteration

This is probably a stupid question. But, if I have a simple function and I want to run it say 100 times and I have 12 processors available, is it better to use 10 processors to run the multiprocessing code or 12?
Basically by using 12 cores will I be saving one iteration time? or it will run 10 iterations in 1st time and then 2 and then 10 and so on?
It's almost always better to use the number of processors available. However, some algorithms need processes to communicate partial results to achieve an end result (many image processing algorithms have this constraint). Those algorithms have a limit on the number of process that should be running in parallel, as beyond this limit, the cost of communication impairs performances.
However, it depends on a lot of things. Many algorithms are easily parallelizable, however, the cost of parallelism impair their acceleration. Basically, for parallelism to be worth anything, the actual work to be done must be an order of magnitude greater than the cost of parallelism.
In typical multi-threaded languages, you can easily reduce the cost of parallelism by re-using the same threads (thread pooling). However, python being python, you must use multi-processing to achieve true parallelism, which has a huge cost. However, there's a process pool if you wish to re-use processes.
You need to check how much time it takes to run your algorithm sequentially, how much time it takes to run one iteration, and how many iteration will you have. Only then will you know if parallelization is worth it. If it is worth it, then do tests for number of processes going from 1 to 100. This will allow you to find the sweet spot for your algorithm.

multiprocessing timing inconsistency

I have about 100 processes. Each process contains 10 inputs(logic expressions) and the task of each process is to find the fastest heuristic algorithm for solving each of the logic inputs(I have about 5 heuristic algorithms).
When I run each process separately the results are different from when I run all of the processes in parallel (using python p1.py & python p2.py &….. ). For example, when run the processes separately the input 1 (in p1) finds the first heuristic algorithms as the fastest method but when in parallel the same input finds the 5th heuristic algorithms faster!
Could the reason be that the CPU will switch between the parallel processes and messes up with the timing so it could not give the right time each heuristic algorithm spends to solve the input?
What is the solution? Can decreasing the number of processes to half reduce the false result? (I run my program on a server)
The operating system has to schedule all your processes on a much smaller amount of CPUs. In order to do so, it runs one process on each CPU for a small amount of time. After that, the operating system schedules the processes out to let the other processes run in order to give process their fair share of running time. Thus each process has to wait for a running slot on a CPU. Those waiting times depend on the amount of other processes waiting to run and almost unpredictable.
If you use clock time for your measurements, the waiting times will pollute your measurements. For a more precise measurement, you could ask the operating system how much CPU time the process used. The function time.process_time() does that.
Switching between processes costs time. Multiple processes accessing the same resources (file, hard disk, CPU caches, memory, ...) costs time. For CPU bound processes, having orders of magnitude more running processes than CPUs will slow down the execution. You'll get better results by starting slightly less processes than the amount of CPUs. The spare CPUs remain available for work needed by the operating system or some other unrelated programs.

Python multiprocessing: restrict number of cores used

I want to know how to distribute N independent tasks to exactly M processors on a machine that has L cores, where L>M. I don't want to use all the processors because I still want to have I/O available. The solutions I've tried seem to get distributed to all processors, bogging down the system.
I assume the multiprocessing module is the way to go.
I do numerical simulations. My background is in physics, not computer science, so unfortunately, I often don't fully understand discussions involving standard tasking models like server/client, producer/consumer, etc.
Here are some simplified models that I've tried:
Suppose I have a function run_sim(**kwargs) (see that further below) that runs a simulation, and a long list of kwargs for the simulations, and I have an 8 core machine.
from multiprocessing import Pool, Process
#using pool
p = Pool(4)
p.map(run_sim, kwargs)
# using process
number_of_live_jobs=0
all_jobs=[]
sim_index=0
while sim_index < len(kwargs)+1:
number_of_live_jobs = len([1 for job in all_jobs if job.is_alive()])
if number_of_live_jobs <= 4:
p = Process(target=run_sim, args=[], kwargs=kwargs[sim_index])
print "starting job", kwargs[sim_index]["data_file_name"]
print "number of live jobs: ", number_of_live_jobs
p.start()
p.join()
all_jobs.append(p)
sim_index += 1
When I look at the processor usage with "top" and then "1", All processors seem to get used anyway in either case. It is not out of the question that I am misinterpreting the output of "top", but if the run_simulation() is processor intensive, the machine bogs down heavily.
Hypothetical simulation and data:
# simulation kwargs
numbers_of_steps = range(0,10000000, 1000000)
sigmas = [x for x in range(11)]
kwargs = []
for number_of_steps in numbers_of_steps:
for sigma in sigmas:
kwargs.append(
dict(
number_of_steps=number_of_steps,
sigma=sigma,
# why do I need to cast to int?
data_file_name="walk_steps=%i_sigma=%i" % (number_of_steps, sigma),
)
)
import random, time
random.seed(time.time())
# simulation of random walk
def run_sim(kwargs):
number_of_steps = kwargs["number_of_steps"]
sigma = kwargs["sigma"]
data_file_name = kwargs["data_file_name"]
data_file = open(data_file_name+".dat", "w")
current_position = 0
print "running simulation", data_file_name
for n in range(int(number_of_steps)+1):
data_file.write("step number %i position=%f\n" % (n, current_position))
random_step = random.gauss(0,sigma)
current_position += random_step
data_file.close()
If you are on linux, use taskset when you launch the program
A child created via fork(2) inherits its parent’s CPU affinity mask. The affinity mask is preserved across an execve(2).
TASKSET(1)
Linux User’s Manual
TASKSET(1)
NAME
taskset - retrieve or set a process’s CPU affinity
SYNOPSIS
taskset [options] mask command [arg]...
taskset [options] -p [mask] pid
DESCRIPTION
taskset is used to set or retrieve the CPU affinity of a running
process given its PID or to launch a
new
COMMAND with a given CPU affinity. CPU affinity is a scheduler
property that "bonds" a process to a
given
set of CPUs on the system. The Linux scheduler will honor the
given CPU affinity and the process
will not
run on any other CPUs. Note that the Linux scheduler also
supports natural CPU affinity: the
scheduler
attempts to keep processes on the same CPU as long as practical for
performance reasons. Therefore,
forcing
a specific CPU affinity is useful only in certain applications.
The CPU affinity is represented as a bitmask, with the lowest order
bit corresponding to the first
logical
CPU and the highest order bit corresponding to the last logical CPU.
Not all CPUs may exist on a given sys‐
tem but a mask may specify more CPUs than are present. A retrieved
mask will reflect only the bits that
cor‐
respond to CPUs physically on the system. If an invalid mask is
given (i.e., one that corresponds to
no
valid CPUs on the current system) an error is returned. The
masks are typically given in
hexadecimal.
You might want to look into the following package:
http://pypi.python.org/pypi/affinity
It is a package that uses sched_setaffinity and sched _getaffinity.
The drawback is that it is highly Linux-specific.
On my dual-core machine the total number of processes is honoured, i.e. if I do
p = Pool(1)
Then I only see one CPU in use at any given time. The process is free to migrate to a different processor, but then the other processor is idle. I don't see how all your processors can be in use at the same time, so I don't follow how this can be related to your I/O issues. Of course, if your simulation is I/O bound, then you will see sluggish I/O regardless of core usage...
Probably a dumb observation, pls forgive my inexperience in Python.
But your while loop polling for the finished tasks is not going to sleep and is consuming one core all time, isn't it?
The other thing to notice is that if your tasks are I/O bound, you M should be adjusted to the number of parallel disks(?) you have ... if they are NFS mounted in different machine you could potentially have M>L.
g'luck!
you might try using pypar module.
I am not sure how to use affinity to set cpu affinity of to a certain core using affinity

Categories