I have about 100 processes. Each process contains 10 inputs(logic expressions) and the task of each process is to find the fastest heuristic algorithm for solving each of the logic inputs(I have about 5 heuristic algorithms).
When I run each process separately the results are different from when I run all of the processes in parallel (using python p1.py & python p2.py &….. ). For example, when run the processes separately the input 1 (in p1) finds the first heuristic algorithms as the fastest method but when in parallel the same input finds the 5th heuristic algorithms faster!
Could the reason be that the CPU will switch between the parallel processes and messes up with the timing so it could not give the right time each heuristic algorithm spends to solve the input?
What is the solution? Can decreasing the number of processes to half reduce the false result? (I run my program on a server)
The operating system has to schedule all your processes on a much smaller amount of CPUs. In order to do so, it runs one process on each CPU for a small amount of time. After that, the operating system schedules the processes out to let the other processes run in order to give process their fair share of running time. Thus each process has to wait for a running slot on a CPU. Those waiting times depend on the amount of other processes waiting to run and almost unpredictable.
If you use clock time for your measurements, the waiting times will pollute your measurements. For a more precise measurement, you could ask the operating system how much CPU time the process used. The function time.process_time() does that.
Switching between processes costs time. Multiple processes accessing the same resources (file, hard disk, CPU caches, memory, ...) costs time. For CPU bound processes, having orders of magnitude more running processes than CPUs will slow down the execution. You'll get better results by starting slightly less processes than the amount of CPUs. The spare CPUs remain available for work needed by the operating system or some other unrelated programs.
Related
I recently started using ray ActorPool to parallelize my python code on my local computer (using the code below), and it's definitely working. Specifically, I used it to process a list of arguments and return a list of results (Note that depending on the inputs, the "process" function could take different amounts of time).
However, while testing the script, it seems in this way the processes are sort of "blocking" each other, in that if there's one process that takes a long time, it almost seems other cores would just stay more or less idle. Although it's definitely not completely blocking, as running it this way still saves a lot of time compared to just running on one core, I found that many of the processors would just stay idle (more than half cores with <20% usage) despite I'm running this script on all cores (16 cores). This is especially observable when there is a long process, in which case there are only one or two cores that are actually active. Also, the total amount of time saved is nowhere near 16x
pool = ActorPool(actors)
poolmap = pool.map(
lambda a, v: a.process.remote(arg),
args,
)
result_list = [a for a in tqdm(poolmap, total=length)]
I suspect this is because the way I used to get the result values is not optimal (last line), but not sure how to make it better. Could you guys help me improve it?
Intro
I have rewritten one of my previously sequential algorithms to work in a parallel fashion (we are talking about real parallelism, not concurrency or threads). I run a batch script that runs my "worker" python nodes and each will perform the same task but on a different offset (no data sharing between processes). If it helps visualize imagine a dummy example with an API endpoint which on [GET] request sends me a number and I have to guess if it is even or odd so I run my workers. This example gets the point across as I can't share the algorithm but let's say that the routine for a single process is already optimized to the maximum.
Important: the processes are executed on Windows10 with admin privileges and real_time priority
Diagnostics
Is the optimal number of work node processes equal to the number of logical cores (i.e. 8)? When I use task manager I see my CPU hit 100% limit on all cores but when I look at the processes they each take about 6%? With 6% * 8 = 48% how does this make sense? On idle (without the processes) my CPU sits at about 0-5% total.
I've tried to diagnose it with Performance Monitor but the results were even more confusing:
Reasoning
I didn't know how to configure Performance Manager to track my processes across separate cores so I used total CPU time as the Y-axis. How can I have a minimum of 20% usage on 8 processes which means 120% utilization?
Question 1
This doesn't make much sense and the numbers are different from what the task manager shows. Worse of it all is the bolded blue line which shows total (average) CPU performance across all cores and this doesn't seem to exceed 70% when the task manager says all my cores run at 100%? What am I confusing here?
Question 2
Is running X processes where X is the number of logical cores on the system under real_time priority the best I can do? (and let the OS handle the scheduling logic)? In the second picture from the bar chart, I can see that it is doing a decent job as ideally, all those bars would be of an equal height which is roughly true.
I have found the answer to this question and have decided to post rather than delete. I used the psutil library to set the affinity of each worker process manually and distribute them instead of the OS. I have had MANY IO operations on the network and from debug prints which caused my processor cores to not be able to max out 100% (after all windows is no real-time operating system)
In addition to this, since I've tested the code on my laptop, I've encountered thermal throttling which caused disturbances in the %usage calculations.
Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?
Threading
Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond
Multiprocessing
As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.
Vectorization
Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this
Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.
The single biggest performance factor in multithreaded execution when there no
locking/barriers involved is data access locality eg. matrix multiplication kernels.
Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.
The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in
size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.
Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.
This is probably a stupid question. But, if I have a simple function and I want to run it say 100 times and I have 12 processors available, is it better to use 10 processors to run the multiprocessing code or 12?
Basically by using 12 cores will I be saving one iteration time? or it will run 10 iterations in 1st time and then 2 and then 10 and so on?
It's almost always better to use the number of processors available. However, some algorithms need processes to communicate partial results to achieve an end result (many image processing algorithms have this constraint). Those algorithms have a limit on the number of process that should be running in parallel, as beyond this limit, the cost of communication impairs performances.
However, it depends on a lot of things. Many algorithms are easily parallelizable, however, the cost of parallelism impair their acceleration. Basically, for parallelism to be worth anything, the actual work to be done must be an order of magnitude greater than the cost of parallelism.
In typical multi-threaded languages, you can easily reduce the cost of parallelism by re-using the same threads (thread pooling). However, python being python, you must use multi-processing to achieve true parallelism, which has a huge cost. However, there's a process pool if you wish to re-use processes.
You need to check how much time it takes to run your algorithm sequentially, how much time it takes to run one iteration, and how many iteration will you have. Only then will you know if parallelization is worth it. If it is worth it, then do tests for number of processes going from 1 to 100. This will allow you to find the sweet spot for your algorithm.
I am using a linux server with 128 cores, but I'm not the only one using it so I want to make sure that I use a maximum of 60 cores for my program. The program I'm writing is a series of simulations, where each simulation is parallelized itself. The number of cores of such a single simulation can be chosen, and I typically use 12.
So in theory, I can run 5 of these simulations at the same time, which would result in (5x12) 60 cores used in total. I want start the simulations from python (since that's where all the preprocessing happens), and my eye has caught the multiprocessing library, especially the Pool class.
Now my question is: should I use
pool = Pool(processes=5)
or
pool = Pool(processes=60)
The idea being: does the processes argument signify the amount of workers used (where each worker is assigned 12 cores), or the total amount of processes available?
The argument 'processes' in Pool means the total subprocess you want to create in this program. So If you want to use all 60 cores, here should be 60.