At my workplace there is a shared powerful 24-core server on which we run our jobs. To utilize full power of the multi-core CPU I wrote a multi-threaded version of a long-running program such that 24 threads are run on each core simultaneously (via threading library in Jython).
The program runs speedily if there are no other jobs running. However, I was running a big job simultaneously on one core and as a result the thread running on that particular core took long amount of time, slowing down the entire program (as threads needed to join the data at the end). However the threads on other CPUs had long finished execution - so I basically had 23 cores idle and 1 core running the thread and the heavy job, or at least this is what my diagnosis is. This was further confirmed by looking at output of time command, sys time was very low compared to user time (which means there was lot of waiting).
Does operating system (Linux in this case) not switch jobs to different CPUs if one CPU is loaded while others are idle? If not, can I do that in my program (in Jython). It should not be difficult to query different CPU loads once in a while and then switch to one that is relatively free.
Thanks.
Source http://www.ibm.com/developerworks/linux/library/l-scheduler/:
To maintain a balanced workload across CPUs, work can be
redistributed, taking work from an overloaded CPU and giving it to an
underloaded one. The Linux 2.6 scheduler provides this functionality
by using load balancing. Every 200ms, a processor checks to see
whether the CPU loads are unbalanced; if they are, the processor
performs a cross-CPU balancing of tasks.
A negative aspect of this process is that the new CPU's cache is cold
for a migrated task (needing to pull its data into the cache).
Looks like Linux has been balancing threads across cores for a while now.
However, assuming Linux load balances instantly (which it doesn't), your problem still reduces to one where you have 23 cores and 24 tasks. In the worst case (where all tasks take equally long), this takes twice as long as having only 23 tasks because, if they all take equally long to complete, then the last task still has to wait for another task to run to completion before there is a free core.
If the wall-clock time of the program suffers by a slowdown of around 2x, this is probably the issue.
If it is drastically worse than 2x, then you may be on an older version of the Linux scheduler.
Related
I have a python program consisting of 5 processes outside of the main process. Now I'm looking to get an AWS server or something similar on which I can run the script. But how can I find out how many vCPU cores are used by the script/how many are needed? I have looked at:
import multiprocessing
multiprocessing.cpu_count()
But it seems that it just returns the CPU count that's on the system. I just need to know how many vCPU cores the script uses.
Thanks for your time.
EDIT:
Just for some more information. The Processes are running indefinitely.
On Linux you can use the "top" command at the command line to monitor the real-time activity of all threads of a process id:
top -H -p <process id>
Answer to this post probably lies in the following question:
Multiprocessing : More processes than cpu.count
In short, you have probably hundreds of processes running, but that doesn't mean you will use hundreds of cores. It all depends on utilization, and the workload of the processes.
You can also get some additional info from the psutil module
import psutil
print(psutil.cpu_percent())
print(psutil.cpu_stats())
print(psutil.cpu_freq())
or using OS to receive current cpu usage in python:
import os
import psutil
l1, l2, l3 = psutil.getloadavg()
CPU_use = (l3/os.cpu_count()) * 100
print(CPU_use)
Credit: DelftStack
Edit
There might be some information for you in the following medium article. Maybe there are some tools for CPU usage too.
https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba
Edit 2
A good guideline for how many processes to start depends on the amount of threads available. It's basically just Thread_Count + 1, this ensures your processor doesn't just 'sit around and wait', this however is best used when you are IO bound, think of waiting for files from disk. Once it waits, that process is locked, thus you have 8 others to take over. The one extra is redundancy, in case all 8 are locked, the one that's left can take over right away. You can however in- or decrease this if you see fit.
Your question uses some general terms and leaves much unspecified so answers must be general.
It is assumed you are managing the processes using either Process directly or ProcessPoolExecutor.
In some cases, vCPU is a logical processor but per the following link there are services offering configurations of fractional vCPUs such as those in shared environments...
What is vCPU in AWS
You mention/ask...
... Now I'm looking to get an AWS server or something similar on which I can run the script. ...
... But how can I find out how many vCPU cores are used by the script/how many are needed? ...
You state AWS or something like it. The answer would depend on what your subprocess do, and how much of a vCPU or factional vCPU each subprocess needs. Generally, a vCPU is analogous to a logical processor upon which a thread can execute. A fractional portion of a vCPU will be some limited usage (than some otherwise "full" or complete "usage") of a vCPU.
The meaning of one or more vCPUs (or fractional vCPUs thereto) to your subprocesses really depends on those subprocesses, what they do. If one subprocess is sitting waiting on I/O most of the time, you hardly need a dedicated vCPU for it.
I recommend starting with some minimal least expensive configuration and see how it works with your app's expected workload. If you are not happy, increase the configuration as needed.
If it helps...
I usually use subprocesses if I need simultaneous execution that avoids Python's GIL limitations by breaking things into subprocesses. I generally use a single active thread per subprocess, where any other threads in the same subprocess are usually at a wait, waiting for I/O or do not otherwise compete with the primary active thread of the subprocess. Of course, a subprocess could be dedicated to I/O if you want to separate such from other threads you place in other subprocesses.
Since we do not know your app's purpose, architecture and many other factors, it's hard to say more than the generalities above.
Your computer has hundreds if not thousands of processes running at any given point. How does it handle all of those if it only has 5 cores? The thing is, each core takes a process for a certain amount of time or until it has nothing left to do inside that process.
For example, if I create a script that calculates the square root of all numbers from 1 to say a billion, you will see that a single core will hit max usage, then a split second later another core hits max while the first drops to normal and so on until the calculation is done.
Or if the process waits for an I/O process, then the core has nothing to do, so it drops the process, and goes to another process, when the I/O operation is done, the core can pick the process back, and get back to work.
You can run your multiprocessing python code on a single core, or on 100 cores, you can't really do much about it. However, on windows, you can set affinity of a process, which gives the process access to certain cores only. So, when the processes start, you can go to each one and set the affinity to say core 1 or each one to a separate core. Not sure how you do that on Linux though.
In conclusion, if you want a short and direct answer, I think we can say as many cores as it has access to. If you give them one core or 200 cores, they will still work. However, performance may degrade if the processes are CPU intensive, so I recommend starting with one core on AWS, check performance, and upgrade if needed.
I'll try to do my own summary about "I just need to know how many vCPU cores the script uses".
There is no way to answer that properly other than running your app and monitoring its resource usage. Assuming your Python processes do not spawn subprocesses (which could even be multithreaded applications), all we can say is that your app won't utilize more than 6 cores (as per total number of processes). There's a ton of ways for program to under-utilize CPU cores, like waiting for I/O (disk or network) or interprocess synchronization (shared resources). So to get any kind of understanding of CPU utilization, you really need to measure the actual performance (e.g., with htop utility on Linux or macOS) and investigating the causes of underperforming (if any).
Hope it helps.
time.clock() measures cpu time. Let's say there are multiple processes running on the system then in that case its possible our python process might be swapped out at various time during execution by scheduler. So there will be a period when our python process will be in waiting state.
In that case can we still measure cpu ticks just for our process-ignoring everything else thats scheduled to be run on the system?
Basically the results should not change depending on what else is running on the cpu. the numbers should be same as there was single core and our process was the only process bound to that core. I want to profile from within the code.
So this is more or less a theoretical question. I have a single core machine which is supposedly powerful but nevertheless only one core. Now I have two choices to make :
Multithreading: As far as my knowledge is concerned I cannot make use of multiple cores in my machines even if I had them because of GIL. Hence in this situation, it does not make any difference.
Multiprocessing: This is where I have a doubt. Can I do multiprocessing on a single core machine? Or everytime I have to check the cores available in my machine and then run exactly the same or less number of processes?
Can someone please guide me on the relation between multiprocessing and cores in a machine.
I know this is a theoretical question but my concepts are not very clear on this.
This is a big topic but here are some pointers.
Think of threads as processes that share the same address space and can access the same memory. Communication is done by shared variables. Multiple threads can run within the same process.
Processes (in this context, and roughly speaking) have their own private data and if two processes want to communicate that communication has to be done more explicitly.
When you are writing a program where the bottleneck is CPU cycles, neither threads or processes will give you a speedup on a single core machine.
Processes and threads are still useful for multitasking (rapid switching between (sub)programs) - this is what your operating system does because it runs far more processes than you have cores.
Processes and threads (or even coroutines!) can give you considerable speedup even on a single core machine if the tasks you are executing are I/O bound - think of fetching data from a network. For example, instead of actively waiting for data to be sent or to arrive, another process or thread can initiate the next network operation.
Threads are preferable over processes when you don't need explicit encapsulation due to their lower overhead. For most CPU-bound concurrent problems, and especially the large subset of "embarassingly parallel" ones, it does not make much sense to spawn more processes than you have processors.
The Python GIL prevents two threads in the same process from running in parallel, i.e. from multiple cores executing instructions literally at the same time.
Therefore threads in Python are relatively useless for speeding up CPU-bound tasks, but can still be very useful for I/O bound tasks, because blocking operations (e.g. waiting for network data) release the GIL such that another thread can run while the other waits.
If you have multiple processors, you can have true parallelism by spawning multiple processes despite the GIL. This is only worth it for CPU bound tasks, and often you have to consider the overhead of spawning processes and the communication cost between processes.
You CAN use both multithreading and multiprocessing in single core systems.
The GIL limits the usefulness of multithreading in pure Python for computation-bound tasks, no matter your underlying architecture. For I/O-bound tasks, they do work perfectly fine. Had they had not any use, they would not have been implemented in the first place, probably.
For pure Python software, multiprocessing is always a safer choice when it comes to parallel computing. Of course, multiple processes are more expensive than multiple threads (since processes do not share memory, contrarily to threads; also, processes come with slightly higher overhead compared to threads).
For single processor machines, however, multiprocessing (and multithreading) buys you little to no extra speed for computationally heavy tasks, and they should actually even slow you down a bit. But, if the OS supports them (which is pretty common for desktop, workstation, clusters, etc, but may not be common for embedded systems), they allow you to effectively run simultaneously multiple I/O-bound programs.
Long story short, it depends a bit on what you are doing...
multiprocessing module basically spawns up multiple instances of python interpreter, so there is no worry of GIL.
multiprocessing uses the same API used by threading module if you have used it previously.
You seem to be confused between multiprocessing, threading (you referring as multithreading) and X-core processor.
No matter what, when you start Python (CPython implementation) it will only use one core of your processor.
Threading is distributing the load between the different component of the script. Suppose you have to interact with an external API, your script has to wait for communication to finish until it proceeds next. You have are making multiple similar calls, it will take linear time. Whereas if you use threading, you can do those calls parallelly.
See also: PyPy implementation of Python
I have an AWS instance has 32 CPUS:
ubuntu#ip-122-00-18-114:~$ cat /proc/cpuinfo | grep processor | wc -l
32
My question is how can I make use of Python's multiprocessing
so that each command runs on every CPU.
For example with the following code, will each command run on every single CPU available?
import multiprocessing
import os
POOL_SIZE = 32
cmdlist = []
for param in items:
cmd = """./cool_command %s""" % (param)
cmdlist.append(cmd)
p = multiprocessing.Pool(POOL_SIZE)
p.map(os.system, cmdlist)
If not, what's the right way to do it?
And what happened if I set POOL_SIZE > # Processors (CPUs)?
First a little correction on your wording. A CPU has different cores and each cores has hyperthreads. Each hyperthread is the logical unit which runs a processor. On Amazon you have 32 vCPUs which correspond to hyperthreads, not CPUs or cores. This is not important for this question but just in case if you do any further research it is important to have the wording right. I'll refer to this "lowest logical processing unit" of hyperthread as vCPU below
If you do not specify the pool size:
p = multiprocessing.Pool()
p.map(os.system, cmdlist)
then python will find out the number of available logical processors (in your case 32 vCPUs) itself (via os.cpu_count()).
In normal circumstances, all 32 processes run on separate vCPUs because Linux tries to balance the load evenly between them.
If, however there are other heavy processes running at the same time, then two processes might run on the same vCPU.
The key to understand here is how the Linux scheduler works: It periodically reschedules processes so all processing units are utilized about the same. That means if you start only 16 processes then they will spread out to all 32 vCPUs and utilize them about the same (use htop to see how the load spreads).
And what happened if I set POOL_SIZE > # Processors (CPUs)?
If you start more processes than the available vCPUs, then a few processes need to share a vCPU. That means that they a process is periodically switched out in the context switch by the scheduler. If your process is CPU bound (utilized 100% cpu, e.g. when you do number crunching) then having more processes than vCPUs will slow down the overall process as you'll have the context switches which slow down and if you have communication between the processes (not in your example, but something you'd normally do when doing multiprocessing) which slow down as well.
However. If your processes are not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway.
Easy question is "Not exactly". You may get cpu count with os.cpu_count() function and run this number of processes. But only operating system assigns process to the CPU. And more than that - it might switch it to another cpu in some time. I won't explain how preemptive multitasking works here.
If you have other "heavy" processes running on this server - for example database or even web server - they might need some cpu time for their execution as well.
Some good news are that there is a thing named Process Affinity exists that could be of use for your needs. But it is a kind of fine tuning the os.
When using Python multiprocessing Pool should the number of worker processes be the same as the number of CPU's or cores?
This article http://www.howtogeek.com/194756/cpu-basics-multiple-cpus-cores-and-hyper-threading-explained/ says each core is really a central processing unit on a CPU chip. And thus it seems like there should not be a problem with having 1 process/core
e.g. If I have a single CPU chip with 4 cores, can 1 process/core for a total of 4 processes be ran without the possibility of slowing performance.
From what I've learned regarding python and multiprocessing, the best course of action is...
One process per core, but skip logical ones.
Hyperthreading is no help for python. It'll actually hurt performance in many cases, but test it yourself first of course.
Use the affinity (pip install affinity) module to stick each process to a specific core.
At least tested extensively on windows using 32bit python, not doing this will hurt performance significantly due to constant trashing of the cache. And again: skip logical cores! Logical ones, assuming you have an intel cpu with hyperthreading, are 1,3,5,7, etc.
More threads than real cores will help you nothing, unless there's also IO happening, which it shouldn't if you're crunching numbers. Test my claim yourself, especially if you use Linux, as I didn't get to test in Linux at all.
It really depends on your workload. Case by case, the best approach is to run some benchmark test and see what is the result.
Scheduling processes is an expensive operation, the more running processes, the more you need to change context.
If most of your processes are not running (they are waiting for IO for example) then overcommitting might prove beneficial. On the opposite, if your processes are running most of the time, adding more of them contending your CPU is going to be detrimental.