Python multiprocessing: restrict number of cores used - python

I want to know how to distribute N independent tasks to exactly M processors on a machine that has L cores, where L>M. I don't want to use all the processors because I still want to have I/O available. The solutions I've tried seem to get distributed to all processors, bogging down the system.
I assume the multiprocessing module is the way to go.
I do numerical simulations. My background is in physics, not computer science, so unfortunately, I often don't fully understand discussions involving standard tasking models like server/client, producer/consumer, etc.
Here are some simplified models that I've tried:
Suppose I have a function run_sim(**kwargs) (see that further below) that runs a simulation, and a long list of kwargs for the simulations, and I have an 8 core machine.
from multiprocessing import Pool, Process
#using pool
p = Pool(4)
p.map(run_sim, kwargs)
# using process
number_of_live_jobs=0
all_jobs=[]
sim_index=0
while sim_index < len(kwargs)+1:
number_of_live_jobs = len([1 for job in all_jobs if job.is_alive()])
if number_of_live_jobs <= 4:
p = Process(target=run_sim, args=[], kwargs=kwargs[sim_index])
print "starting job", kwargs[sim_index]["data_file_name"]
print "number of live jobs: ", number_of_live_jobs
p.start()
p.join()
all_jobs.append(p)
sim_index += 1
When I look at the processor usage with "top" and then "1", All processors seem to get used anyway in either case. It is not out of the question that I am misinterpreting the output of "top", but if the run_simulation() is processor intensive, the machine bogs down heavily.
Hypothetical simulation and data:
# simulation kwargs
numbers_of_steps = range(0,10000000, 1000000)
sigmas = [x for x in range(11)]
kwargs = []
for number_of_steps in numbers_of_steps:
for sigma in sigmas:
kwargs.append(
dict(
number_of_steps=number_of_steps,
sigma=sigma,
# why do I need to cast to int?
data_file_name="walk_steps=%i_sigma=%i" % (number_of_steps, sigma),
)
)
import random, time
random.seed(time.time())
# simulation of random walk
def run_sim(kwargs):
number_of_steps = kwargs["number_of_steps"]
sigma = kwargs["sigma"]
data_file_name = kwargs["data_file_name"]
data_file = open(data_file_name+".dat", "w")
current_position = 0
print "running simulation", data_file_name
for n in range(int(number_of_steps)+1):
data_file.write("step number %i position=%f\n" % (n, current_position))
random_step = random.gauss(0,sigma)
current_position += random_step
data_file.close()

If you are on linux, use taskset when you launch the program
A child created via fork(2) inherits its parent’s CPU affinity mask. The affinity mask is preserved across an execve(2).
TASKSET(1)
Linux User’s Manual
TASKSET(1)
NAME
taskset - retrieve or set a process’s CPU affinity
SYNOPSIS
taskset [options] mask command [arg]...
taskset [options] -p [mask] pid
DESCRIPTION
taskset is used to set or retrieve the CPU affinity of a running
process given its PID or to launch a
new
COMMAND with a given CPU affinity. CPU affinity is a scheduler
property that "bonds" a process to a
given
set of CPUs on the system. The Linux scheduler will honor the
given CPU affinity and the process
will not
run on any other CPUs. Note that the Linux scheduler also
supports natural CPU affinity: the
scheduler
attempts to keep processes on the same CPU as long as practical for
performance reasons. Therefore,
forcing
a specific CPU affinity is useful only in certain applications.
The CPU affinity is represented as a bitmask, with the lowest order
bit corresponding to the first
logical
CPU and the highest order bit corresponding to the last logical CPU.
Not all CPUs may exist on a given sys‐
tem but a mask may specify more CPUs than are present. A retrieved
mask will reflect only the bits that
cor‐
respond to CPUs physically on the system. If an invalid mask is
given (i.e., one that corresponds to
no
valid CPUs on the current system) an error is returned. The
masks are typically given in
hexadecimal.

You might want to look into the following package:
http://pypi.python.org/pypi/affinity
It is a package that uses sched_setaffinity and sched _getaffinity.
The drawback is that it is highly Linux-specific.

On my dual-core machine the total number of processes is honoured, i.e. if I do
p = Pool(1)
Then I only see one CPU in use at any given time. The process is free to migrate to a different processor, but then the other processor is idle. I don't see how all your processors can be in use at the same time, so I don't follow how this can be related to your I/O issues. Of course, if your simulation is I/O bound, then you will see sluggish I/O regardless of core usage...

Probably a dumb observation, pls forgive my inexperience in Python.
But your while loop polling for the finished tasks is not going to sleep and is consuming one core all time, isn't it?
The other thing to notice is that if your tasks are I/O bound, you M should be adjusted to the number of parallel disks(?) you have ... if they are NFS mounted in different machine you could potentially have M>L.
g'luck!

you might try using pypar module.
I am not sure how to use affinity to set cpu affinity of to a certain core using affinity

Related

Is spliting my program into 8 separate processes the best approach (performance wise) when I have 8 logical cores in python?

Intro
I have rewritten one of my previously sequential algorithms to work in a parallel fashion (we are talking about real parallelism, not concurrency or threads). I run a batch script that runs my "worker" python nodes and each will perform the same task but on a different offset (no data sharing between processes). If it helps visualize imagine a dummy example with an API endpoint which on [GET] request sends me a number and I have to guess if it is even or odd so I run my workers. This example gets the point across as I can't share the algorithm but let's say that the routine for a single process is already optimized to the maximum.
Important: the processes are executed on Windows10 with admin privileges and real_time priority
Diagnostics
Is the optimal number of work node processes equal to the number of logical cores (i.e. 8)? When I use task manager I see my CPU hit 100% limit on all cores but when I look at the processes they each take about 6%? With 6% * 8 = 48% how does this make sense? On idle (without the processes) my CPU sits at about 0-5% total.
I've tried to diagnose it with Performance Monitor but the results were even more confusing:
Reasoning
I didn't know how to configure Performance Manager to track my processes across separate cores so I used total CPU time as the Y-axis. How can I have a minimum of 20% usage on 8 processes which means 120% utilization?
Question 1
This doesn't make much sense and the numbers are different from what the task manager shows. Worse of it all is the bolded blue line which shows total (average) CPU performance across all cores and this doesn't seem to exceed 70% when the task manager says all my cores run at 100%? What am I confusing here?
Question 2
Is running X processes where X is the number of logical cores on the system under real_time priority the best I can do? (and let the OS handle the scheduling logic)? In the second picture from the bar chart, I can see that it is doing a decent job as ideally, all those bars would be of an equal height which is roughly true.
I have found the answer to this question and have decided to post rather than delete. I used the psutil library to set the affinity of each worker process manually and distribute them instead of the OS. I have had MANY IO operations on the network and from debug prints which caused my processor cores to not be able to max out 100% (after all windows is no real-time operating system)
In addition to this, since I've tested the code on my laptop, I've encountered thermal throttling which caused disturbances in the %usage calculations.

Multithreading regressions in Python

I have a project in Python that requires regressing many variables against many others. I am using a Jupyter Notebook for clarity but am also willing to use another IDE if it's easier. My code looks something like:
for a in dependent_variables:
for b in independent_variables:
regress a on b
My current dataset isn't huge, so this whole thing takes maybe 30 seconds, but I will soon have a much larger dataset that will significantly increase time required. I'm curious if this is a situation suitable for parallelization. Specifically, if I have a dual-threaded eight-core processor (meaning 16 CPUs total), is it possible to run simultaneous processes where each process regresses one of the first variables against one of the second variables, allowing me to complete, say, eight of these regressions at a time (if I allocate half of the CPUs to this process)? I am not super familiar with parallelization and most other answers I've found have discussed the parallelization of a single function call, not the simultaneous execution of multiple similar functions. I appreciate the help!
Nominally, this is
import itertools
import multiprocessing as mp
def regress_me(vars):
ind_var, dep_var = vars
# your regression may be better than mine...
result = "{} {}".format(ind_var, dep_var)
return result
if __name__ == "__main__":
with mp.Pool(8) as pool:
analyse_this = list(itertools.product(independent_variables,
dependent_variables))
result = mp.map(regress_me, analyse_this)
A lot depends on what is being passed between parent and child and whether you are using a forking system like linux or a spawning system like windows. If these datasets are being fetched from disk, it may be better to do the read in the worker regress_me instead of passing it from the parent. You can read up on that with the standard python multiprocessing library.

ProcessPoolExecutor overhead ?? parallel processing takes more time than single process for large size matrix operation

My python code contains a numpy dot operation of huge size (over 2^(tens...)) matrix and vector.
To reduce the computing time, I applied parallel processing by dividing the matrix suited for the number of cpu cores.
I used concurrent.futures.ProcessPoolExecutor.
My issue is that the parallel processing takes much more time than single processing.
The following is my code.
single process code.
self._vector = np.dot(matrix, self._vector)
parallel processing code.
each_worker_coverage = int(self._dimension/self.workers)
args = []
for i in range(self.workers):
if (i+1)*each_worker_coverage < self._dimension:
arg = [i, gate[i*each_worker_coverage:(i+1)*each_worker_coverage], self._vector]
else:
arg = [i, gate[i*each_worker_coverage:self._dimension], self._vector]
args.append(arg)
pool = futures.ProcessPoolExecutor(self.workers)
results = list(pool.map(innerproduct, args, chunksize=1))
for result in results:
if (result[0]+1)*each_worker_coverage < self._dimension:
self._vector[result[0]*each_worker_coverage:(result[0]+1)*each_worker_coverage] = result[1]
else:
self._vector[result[0]*each_worker_coverage:self._dimension] = result[1]
The innerproduct function called in parallel is as follows.
def innerproduct(args):
answer = np.dot(args[1], args[2])
return args[0], answer
```
For a 2^14 x 2^14 matrix and a 2^14 vector, the single process code takes only 0.05 seconds, but the parallel processing code takes 6.2 seconds.
I also checked the time with the `innerproduct` method, and it only takes 0.02~0.03 sec.
I don't understand this situation.
Why does the parallel processing (multi-processing not multi-threading) take more time?
To exactly know the cause of the slowdown you would have to measure how long everything takes, and with multiprocessing and multithreading that can be tricky.
So what follows is my best guess. For multiprocessing to work, the parent process has to transfer the data used in the calculations to the worker processes. The time this takes depends on the amount of data. Transferring a 2^14 x 2^14 matrix is probably going to take a significant amount of time.
I suspect that this data transfer is what is taking the extra time.
If you are using an operating system that uses the fork startmethod for multiprocessing/concurrent.futures, there is a way around this data transfer. These operating systems are for example Linux, *BSD but not macOS and ms-windows).
On the abovementioned operating systems, multiprocessing uses the fork system call to create its workers. This system call creates a copy of the parent process as the child processes. So if you create the vectors and matrices before creating the ProcessPoolExecutor, the workers will inherit that data. This is not a very costly or time consuming operation because all these OS's use copy-on-write for managing memory pages. As long as the original matrix isn't changed, all programs using it are reading from the same memory pages. This inheriting of the data means you don't have to pass the data explicitly to the worker. You just have to pass a small data structure that says on which index ranges a worker has to operate.
Unfortunately, due to technical limitations of the platform, this doesn't work on macOS and ms-windows. What you could do on those systems is store the original matrix and vector memory mapped binary files before you create the Executor. If you tag these mappings with a name, the worker processes should be able to map the same data into their memory without having to transfer them. I think is it possible to instruct numpy to use such a raw binary array without recreating it.
On both platforms you could use the same technique to "send data back" to the parent process; save the data in shared memory and return the filename or tagname to the parent process.
If you are using modern versions of NumPy and OS, it's most likely that
self._vector = np.dot(matrix, self._vector)
is already optimized and uses all your CPU cores.
If np.show_config() displays openblas or MKL you may run a simple test:
a = np.random.rand(7000, 7000)
b = np.random.rand(7000, 7000)
np.dot(a, b)
It should use all CPU cores for a couple of seconds.
If it's not, you may install OpenBLAS or MKL and reinstall NumPy. See Using MKL to boost Numpy performance on Ubuntu

Python multiprocess pool processes count

I am using a linux server with 128 cores, but I'm not the only one using it so I want to make sure that I use a maximum of 60 cores for my program. The program I'm writing is a series of simulations, where each simulation is parallelized itself. The number of cores of such a single simulation can be chosen, and I typically use 12.
So in theory, I can run 5 of these simulations at the same time, which would result in (5x12) 60 cores used in total. I want start the simulations from python (since that's where all the preprocessing happens), and my eye has caught the multiprocessing library, especially the Pool class.
Now my question is: should I use
pool = Pool(processes=5)
or
pool = Pool(processes=60)
The idea being: does the processes argument signify the amount of workers used (where each worker is assigned 12 cores), or the total amount of processes available?
The argument 'processes' in Pool means the total subprocess you want to create in this program. So If you want to use all 60 cores, here should be 60.

multiprocessing timing inconsistency

I have about 100 processes. Each process contains 10 inputs(logic expressions) and the task of each process is to find the fastest heuristic algorithm for solving each of the logic inputs(I have about 5 heuristic algorithms).
When I run each process separately the results are different from when I run all of the processes in parallel (using python p1.py & python p2.py &….. ). For example, when run the processes separately the input 1 (in p1) finds the first heuristic algorithms as the fastest method but when in parallel the same input finds the 5th heuristic algorithms faster!
Could the reason be that the CPU will switch between the parallel processes and messes up with the timing so it could not give the right time each heuristic algorithm spends to solve the input?
What is the solution? Can decreasing the number of processes to half reduce the false result? (I run my program on a server)
The operating system has to schedule all your processes on a much smaller amount of CPUs. In order to do so, it runs one process on each CPU for a small amount of time. After that, the operating system schedules the processes out to let the other processes run in order to give process their fair share of running time. Thus each process has to wait for a running slot on a CPU. Those waiting times depend on the amount of other processes waiting to run and almost unpredictable.
If you use clock time for your measurements, the waiting times will pollute your measurements. For a more precise measurement, you could ask the operating system how much CPU time the process used. The function time.process_time() does that.
Switching between processes costs time. Multiple processes accessing the same resources (file, hard disk, CPU caches, memory, ...) costs time. For CPU bound processes, having orders of magnitude more running processes than CPUs will slow down the execution. You'll get better results by starting slightly less processes than the amount of CPUs. The spare CPUs remain available for work needed by the operating system or some other unrelated programs.

Categories