How to do efficient multiprocessing?

How to do efficient multiprocessing? - python

I am using multiprocessing.Pool.map to run my code in parallel in my workstation which has got 10 physical cores (20 cores if I include the logical ones also).
To summarize my code, I have to do some calculations with 2080 matrices. So,I divide 2080 matrices to 130 groups each containing 16 matrices.
The calculation of these 16 matrices are then distributed over 16 cores (Should I use only 10 since I have only 10 physical cores?) using multiprocessing.Pool.map.
My questions are:
(1) When I monitor the usage of CPU in 'system monitor' in Ubuntu, I find many a times only 1 CPU usage is showing 100% instead of 16 CPU's showing 100% of usage. 16 CPU's show 100% usage only for short duration. Why does this happen? How to improve it?
(2) Will I be able to improve the calculation time by dividing 2080 matrices into 104 groups each having 20 matrices and then distribute the calculation of these 20 matrices over 10 or 16 cores?
My code snippet is as below:
def f(k):
adj=np.zeros((9045,9045),dtype='bool')
# Calculate the elements of the matrices
return adj
n_CPU=16
n_networks_window=16
window=int(2080/n_networks_window) #Dividing 2080 matrices into 130 segments having 16 matrices each
for i in range(window):
range_window=range(int(i*2080/window),int((i+1)*2080/window))
p=Pool(processes=n_CPU)
adj=p.map(f,range_window)
p.close()
p.join()
for k in range_window:
# Some calculations using adj
np.savetxt(') # saving the output as a txt file
Any help will be really useful as I am first time parallelizing a python code.
Thank you.
EDIT:
I tried the following chnages in the code and it is working fine now:
pool.imap_unordered(f,range(2080),chunksize=260)

Your problem is here:
for i in range(window):
# [snip]
p=Pool(processes=n_CPU)
adj=p.map(f,range_window)
p.close()
p.join()
# [snip]
You're creating a new Pool at every loop and submitting only a few jobs to it. In order for the loop to continue, the few jobs have to complete before more jobs can be executed. In other words, you're not using parallelism at its full potential.
What you should do is create a single Pool, submit all the jobs, and then, out of the loop, join:
p=Pool(processes=n_CPU)
for i in range(window):
# [snip]
p.map_async(f,range_window)
# [snip]
p.close()
p.join()
Note the use of map_async instead of map: this is, again, to avoid waiting for a small portion of jobs to complete before submitting new jobs.
An even better approach is to call map/map_async only once, constructing a single range object and avoiding the for-loop:
with Pool(processes=n_CPU) as p:
p.map(f, range(2080)) # This will block, but it's okay: we are
# submitting all the jobs at once
As for your question about the number of CPUs to use, first of all note that Pool will use all the CPUs available (as returned by os.cpu_count() by default if you don't specify the processes argument -- give it a try.
It's not clear to me what you mean by having 10 physical cores and 20 logical ones. If you're talking about hyperthreading, then it's fine: use them all. If instead you're saying that you're using a virtual machine with more virtual CPUs than the host CPUs, then using 20 instead of 10 won't make much difference.

Related

How to force the submitted tasks to utilize all the available cpu cores

i am working on a computer with 8 cores.i am trying to parallelize the execution of piece of code. As shown below is how i setup the pool and the specifications of the worker processes.
the problem i have is, when i run the code, the cpu monitor in windows show that none of the cpus "the 7 cores" reachs the peak as shown in the image.
please let me know how to configure the pool and modify the code so that the execution utiluzes the 7 cores
dissection:
self.segmentsContainer is composed of 7 or 8 segements where each segment is composed of large number of rows
totalNumOfTasksSegments as the name implies, it is the number of rows each segment carries
run is the parallelized method. is does the following, it receives each segment in self.segmentsContainer, process it and return 28 lists
code:
NUM_OF_TASKS_SEGMENTS = cpu_count() - 1
with Pool(processes=GridCellsPool.NUM_OF_TASKS_SEGMENTS) as pool:
for res in pool.map(self.run,self.segmentsContainer,chunksize=self.totalNumOfTasksSegments):
mapResCnt+=1
self.NDVIsPer10mX10mList.append(res[0])
self.areasOfCoveragePerList.append(res[1])
self.interceptionList.append(res[2])
self.fourCornersInEPSG25832List.append(res[3])
image:

Is spliting my program into 8 separate processes the best approach (performance wise) when I have 8 logical cores in python?

Intro
I have rewritten one of my previously sequential algorithms to work in a parallel fashion (we are talking about real parallelism, not concurrency or threads). I run a batch script that runs my "worker" python nodes and each will perform the same task but on a different offset (no data sharing between processes). If it helps visualize imagine a dummy example with an API endpoint which on [GET] request sends me a number and I have to guess if it is even or odd so I run my workers. This example gets the point across as I can't share the algorithm but let's say that the routine for a single process is already optimized to the maximum.
Important: the processes are executed on Windows10 with admin privileges and real_time priority
Diagnostics
Is the optimal number of work node processes equal to the number of logical cores (i.e. 8)? When I use task manager I see my CPU hit 100% limit on all cores but when I look at the processes they each take about 6%? With 6% * 8 = 48% how does this make sense? On idle (without the processes) my CPU sits at about 0-5% total.
I've tried to diagnose it with Performance Monitor but the results were even more confusing:
Reasoning
I didn't know how to configure Performance Manager to track my processes across separate cores so I used total CPU time as the Y-axis. How can I have a minimum of 20% usage on 8 processes which means 120% utilization?
Question 1
This doesn't make much sense and the numbers are different from what the task manager shows. Worse of it all is the bolded blue line which shows total (average) CPU performance across all cores and this doesn't seem to exceed 70% when the task manager says all my cores run at 100%? What am I confusing here?
Question 2
Is running X processes where X is the number of logical cores on the system under real_time priority the best I can do? (and let the OS handle the scheduling logic)? In the second picture from the bar chart, I can see that it is doing a decent job as ideally, all those bars would be of an equal height which is roughly true.

I have found the answer to this question and have decided to post rather than delete. I used the psutil library to set the affinity of each worker process manually and distribute them instead of the OS. I have had MANY IO operations on the network and from debug prints which caused my processor cores to not be able to max out 100% (after all windows is no real-time operating system)
In addition to this, since I've tested the code on my laptop, I've encountered thermal throttling which caused disturbances in the %usage calculations.

If my 8 core CPU supports 16 threads, would 16 be a better number than 8 for number of processes in a Pool?

I am using multi-processing in python 3.7
Some articles say that a good number for number of processes to be used in Pool is the number of CPU cores.
My AMD Ryzen CPU has 8 cores and can run 16 threads.
So, should the number of processes be 8 or 16?
import multiprocessing as mp
pool = mp.Pool( processes = 16 ) # since 16 threads are supported?

Q : "So, should the number of processes be 8 or 16?"
So, should the herd of sub-processes distributed workloads are cache re-use intensive (not memory-I/O), the SpaceDOMAIN-constraints rule, as the size of the cache-able data will play cardinal role in deciding if 8 or 16.
Why ?
Because the costs of memory-I/O are about a thousand times more expensive in the TimeDOMAIN, paying about 3xx - 4xx [ns] per memory-I/O, compared to 0.1 ~ 0.4 [ns] for in-cache data.
How to Make The Decision ?
Make a small scale test, before deciding on production scale configuration.
So, should the herd of to-be distributed workloads are network-I/O, or other remarkable (locally non-singular) source of latency, dependent, the TimeDOMAIN may benefit from doing a latency-masking trick, running 16, 160 or merely 1600 threads ( not processes in this case ).
Why ?
Because the costs of doing the over-the-network-I/O provide so much waiting-time ( a few [ms] of network-I/O RTT latency are time enough to do about 1E7 ~ 10.000.000 per CPU-core uop-s, which is quite a lot of work. So, smart interleaving of even the whole processes, here also just using the latency-masked thread-based concurrent processing may fit ( as the threads waiting for the remote "answer" from over-the-network-I/O ought not fight for a GIL-lock, as they have nothing to compute until they receive their expected I/O-bytes back, have they? )
How to Make The Decision ?
Review the code to determine how many over-the-network-I/O fetches and how many about the cache-footprint sized reads are in the game (in 2020/Q2+ L1-caches grew to about a few [MB]-s). For those cases, where these operations repeat many times, do not hesitate to spin up one thread per each "slow" network-I/O target as the processing will benefit from the just by a coincidence created masking of the "long" waiting-times at a cost of just a cheap ("fast") and (due to "many" and "long" waiting times) rather sparse thread-switching or even the O/S-driven process-scheduler mapping the full sub-processes onto a free CPU-core.
So, should the herd of to-be distributed workloads is some mix of the above cases, there is no other way than to experiment on the actual hardware local / non-local resources.
Why ?
Because there is no rule of thumb to fine-tune the mapping of the workload processing onto the actual CPU-core resources.
Still,one may easily find to have paid way more than ever getting backThe known trapof achieving a SlowDown, instead of a ( just wished to get ) SpeedUp
In all cases, the overhead-strict, resources-aware and atomicity of workload respecting revised Amdahl's Law identifies a point-of-diminishing returns, after which any more workers ( CPU-core-s ) will not improve the wished to get Speedup. Many surprises of getting S << 1 are expressed in Stack Overflow posts, so one may read as many of what not to do (learning by anti-patterns) as one may wish.

Python multiprocess pool processes count

I am using a linux server with 128 cores, but I'm not the only one using it so I want to make sure that I use a maximum of 60 cores for my program. The program I'm writing is a series of simulations, where each simulation is parallelized itself. The number of cores of such a single simulation can be chosen, and I typically use 12.
So in theory, I can run 5 of these simulations at the same time, which would result in (5x12) 60 cores used in total. I want start the simulations from python (since that's where all the preprocessing happens), and my eye has caught the multiprocessing library, especially the Pool class.
Now my question is: should I use
pool = Pool(processes=5)
or
pool = Pool(processes=60)
The idea being: does the processes argument signify the amount of workers used (where each worker is assigned 12 cores), or the total amount of processes available?

The argument 'processes' in Pool means the total subprocess you want to create in this program. So If you want to use all 60 cores, here should be 60.

In what order does data get process from RDDs in Spark?

Context
Spark provides RDDs for which map functions can be used to lazily set up the operations for processing in parallel. RDD's can be created with a specified partitioning parameter that determines how many partitions to create per RDD, preferably this parameter equals the number of systems (Ex. You have 12 files to process, create an RDD with 3 partitions which splits the data into buckets of 4 each for 4 systems and all the files get processed concurrently in each system). It is my understand that these partitions control the portion of data that goes to each system for processing.
Issue
I need to fine tune and control how many functions run at same time per system. If 2 or more functions run on same GPU at the same time, the system will crash.
Question
If an RDD is not evenly nicely split (like in the example above), how many threads run concurrently on the system?
Example
In:
sample_files = ['one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'eleven.png','ten.png']
CLUSTER_SIZE = 3
example_rdd = sc.parallelize(sample_files, CLUSTER_SIZE)
example_partitions = example_rdd.glom().collect()
# Print elements per partition
for i, l in enumerate(example_partitions): print "parition #{} length: {}".format(i, len(l))
# Print partition distribution
print example_partitions
# How many map functions run concurrently when the action is called on this Transformation?
example_rdd.map(lambda s: (s, len(s))
action_results = example_rdd.reduceByKey(add)
Out:
parition #0 length: 8
parition #1 length: 8
parition #2 length: 10
[ ['one.jpg', 'free.jpg', 'two.png', 'zero.png', 'four.jpg', 'six.png', 'seven.png', 'eight.jpg'],
['nine.png', 'eleven.png', 'ten.png', 'ten.png', 'one.jpg', 'free.jpg', 'two.png', 'zero.png'],
['four.jpg', 'six.png', 'seven.png', 'eight.jpg', 'nine.png', 'eleven.png', 'ten.png', 'ten.png', 'eleven.png', 'ten.png'] ]
In Conclusion
What I need to know, is if the RDD is split the way it is, what controls how many threads are processed simultaneously? Is it the number of cores, or is there a global parameter that can be set so it only processes 4 at a time on each partition (system)?

In what order does data get process from RDDs in Spark?
Unless it is some border case, like only one partition, order is arbitrary or nondeterministic. This will depend on the cluster, on the data and on different runtime events.
A number of partitions sets only a limit of overall parallelism for a given stage or in other words it is a minimal unit of parallelism in Spark. No matter how much resources you allocate you a single stage should process more data than at the time. Once again there can be border cases when worker is not accessible and task is rescheduled on another machine.
Another possible limit you can think of is the number of the executor threads. Even if you increase the number of partitions a single executor thread will process only one at the time.
Neither of the above tell you where or when given partition will be processed. While you can use some dirty, inefficient and non-portable tricks at the configuration level (like single worker with a single executor thread per machine) to make sure that only a one partition is processed on a given machine at the time it is not particularly useful in general.
As a rule of thumb I would say that Spark code should never be concerned wit a time an place it is executed. There are some low level aspects of the API which provides means to set partition specific preferences but as far as I know these don't provide hard guarantees.
That being said one can think of at least few ways you can approach this problem:
long running executor threads with configuration level guarantees - it could be acceptable if Spark is responsible only for loading and saving data
singleton objects which control queuing jobs on the GPU
delegating GPU processing to specialized service which ensures proper access
On a side not you may be interested in Large Scale Distributed Deep Learning on Hadoop Clusters which roughly describes an architecture which can be applicable here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.