Why are threads spread between CPUs?

Why are threads spread between CPUs? - python

I am trying to get my head around threading vs. CPU usage. There are plenty of discussions about threading vs. multiprocessing (a good overview being this answer) so I decided to test this out by launching a maximum number of threads on my 8 CPU laptop running Windows 10, Python 3.4.
My assumption was that all the threads would be bound to a single CPU.
EDIT: it turns out that it was not a good assumption. I now understand that for multithreaded code, only one piece of python code can run at once (no matter where/on which core). This is different for multiprocessing code (where processes are independent and run indeed independently).
While I read about these differences, it is one answer which actually clarified this point.
I think it also explains the CPU view below: that it is an average view of many threads spread out on many CPUs, but only one of them running at one given time (which "averages" to all of them running all the time).
It is not a duplicate of the linked question (which addresses the opposite problem, i.e. all threads on one core) and I will leave it hanging in case someone has a similar question one day and is hopefully helped by my enlightenment.
The code
import threading
import time
def calc():
time.sleep(5)
while True:
a = 2356^36
n = 0
while True:
try:
n += 1
t = threading.Thread(target=calc)
t.start()
except RuntimeError:
print("max threads: {n}".format(n=n))
break
else:
print('.')
time.sleep(100000)
Led to 889 threads being started.
The load on the CPUs was however distributed (and surprisingly low for a pure CPU calculation, the laptop is otherwise idle with an empty load when not running my script):
Why is it so? Are the threads constantly moved as a pack between CPUs and what I see is just an average (the reality being that at a given moment all threads are on one CPU)? Or are they indeed distributed?

As of today it is still the case that 'one thread holds the GIL'. So one thread is running at a time.
The threads are managed on the operating system level. What happens is that every 100 'ticks' (=interpreter instruction) the running thread releases the GIL and resets the tick counter.
Because the threads in this example do continuous calculations, the tick limit of 100 instructions is reached very fast, leading to an almost immediate release of the GIL and a 'battle' between threads starts to acquire the GIL.
So, my assumption is that your operating system has a higher than expected load , because of (too) fast thread switching + almost continuous releasing and acquiring the GIL. The OS spends more time on switching than actually doing any useful calculation.
As you mention yourself, for using more than one core at a time, it's better to look at multiprocessing modules (joblib/Parallel).
Interesting read:
http://www.dabeaz.com/python/UnderstandingGIL.pdf

Um. The point of multithreading is to make sure they work gets spread out. A really easy cheat is to use as many threads as you have CPU cores. The point is they are all independent so they can actually run at the same time. If they were on the same core only one thread at a time could really run at all. They'd pass that core back and forth for processing at the OS level.
Your assumption is wrong and bizarre. What would ever lead you to think they should run on the same CPU and consequently go at 1/8th speed? As the only reason to thread them is typically to get the whole batch to go faster than a single core alone.
In fact, what the hell do you think writing parallel code is for if not to run independently on several cores at the same time? Like this would be pointless and hard to do, let's make complex fetching, branching, and forking routines to accomplish things slower than one core just plugging away at the data?

Related

Python: How many cores are used by my python program with five processes?

I have a python program consisting of 5 processes outside of the main process. Now I'm looking to get an AWS server or something similar on which I can run the script. But how can I find out how many vCPU cores are used by the script/how many are needed? I have looked at:
import multiprocessing
multiprocessing.cpu_count()
But it seems that it just returns the CPU count that's on the system. I just need to know how many vCPU cores the script uses.
Thanks for your time.
EDIT:
Just for some more information. The Processes are running indefinitely.

On Linux you can use the "top" command at the command line to monitor the real-time activity of all threads of a process id:
top -H -p <process id>

Answer to this post probably lies in the following question:
Multiprocessing : More processes than cpu.count
In short, you have probably hundreds of processes running, but that doesn't mean you will use hundreds of cores. It all depends on utilization, and the workload of the processes.
You can also get some additional info from the psutil module
import psutil
print(psutil.cpu_percent())
print(psutil.cpu_stats())
print(psutil.cpu_freq())
or using OS to receive current cpu usage in python:
import os
import psutil
l1, l2, l3 = psutil.getloadavg()
CPU_use = (l3/os.cpu_count()) * 100
print(CPU_use)
Credit: DelftStack
Edit
There might be some information for you in the following medium article. Maybe there are some tools for CPU usage too.
https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba
Edit 2
A good guideline for how many processes to start depends on the amount of threads available. It's basically just Thread_Count + 1, this ensures your processor doesn't just 'sit around and wait', this however is best used when you are IO bound, think of waiting for files from disk. Once it waits, that process is locked, thus you have 8 others to take over. The one extra is redundancy, in case all 8 are locked, the one that's left can take over right away. You can however in- or decrease this if you see fit.

Your question uses some general terms and leaves much unspecified so answers must be general.
It is assumed you are managing the processes using either Process directly or ProcessPoolExecutor.
In some cases, vCPU is a logical processor but per the following link there are services offering configurations of fractional vCPUs such as those in shared environments...
What is vCPU in AWS
You mention/ask...
... Now I'm looking to get an AWS server or something similar on which I can run the script. ...
... But how can I find out how many vCPU cores are used by the script/how many are needed? ...
You state AWS or something like it. The answer would depend on what your subprocess do, and how much of a vCPU or factional vCPU each subprocess needs. Generally, a vCPU is analogous to a logical processor upon which a thread can execute. A fractional portion of a vCPU will be some limited usage (than some otherwise "full" or complete "usage") of a vCPU.
The meaning of one or more vCPUs (or fractional vCPUs thereto) to your subprocesses really depends on those subprocesses, what they do. If one subprocess is sitting waiting on I/O most of the time, you hardly need a dedicated vCPU for it.
I recommend starting with some minimal least expensive configuration and see how it works with your app's expected workload. If you are not happy, increase the configuration as needed.
If it helps...
I usually use subprocesses if I need simultaneous execution that avoids Python's GIL limitations by breaking things into subprocesses. I generally use a single active thread per subprocess, where any other threads in the same subprocess are usually at a wait, waiting for I/O or do not otherwise compete with the primary active thread of the subprocess. Of course, a subprocess could be dedicated to I/O if you want to separate such from other threads you place in other subprocesses.
Since we do not know your app's purpose, architecture and many other factors, it's hard to say more than the generalities above.

Your computer has hundreds if not thousands of processes running at any given point. How does it handle all of those if it only has 5 cores? The thing is, each core takes a process for a certain amount of time or until it has nothing left to do inside that process.
For example, if I create a script that calculates the square root of all numbers from 1 to say a billion, you will see that a single core will hit max usage, then a split second later another core hits max while the first drops to normal and so on until the calculation is done.
Or if the process waits for an I/O process, then the core has nothing to do, so it drops the process, and goes to another process, when the I/O operation is done, the core can pick the process back, and get back to work.
You can run your multiprocessing python code on a single core, or on 100 cores, you can't really do much about it. However, on windows, you can set affinity of a process, which gives the process access to certain cores only. So, when the processes start, you can go to each one and set the affinity to say core 1 or each one to a separate core. Not sure how you do that on Linux though.
In conclusion, if you want a short and direct answer, I think we can say as many cores as it has access to. If you give them one core or 200 cores, they will still work. However, performance may degrade if the processes are CPU intensive, so I recommend starting with one core on AWS, check performance, and upgrade if needed.

I'll try to do my own summary about "I just need to know how many vCPU cores the script uses".
There is no way to answer that properly other than running your app and monitoring its resource usage. Assuming your Python processes do not spawn subprocesses (which could even be multithreaded applications), all we can say is that your app won't utilize more than 6 cores (as per total number of processes). There's a ton of ways for program to under-utilize CPU cores, like waiting for I/O (disk or network) or interprocess synchronization (shared resources). So to get any kind of understanding of CPU utilization, you really need to measure the actual performance (e.g., with htop utility on Linux or macOS) and investigating the causes of underperforming (if any).
Hope it helps.

Multiprocessing multithreading GIL?

So, since several days I do a lot of research about multiprocessing and multithreading on python and i'm very confused about many thing. So many times I see someone talking about GIL something that doesn't allow Python code to execute on several cpu cores, but when I code a program who create many threads I can see several cpu cores are active.
1st question: What's is really GIL? does it work? I think about something like when a process create too many thread the OS distributed task on multi cpu. Am I right?
Other thing, I want take advantage of my cpus. I think about something like create as much process as cpu core and on this each process create as much thread as cpu core. Am I on the right lane?

To start with, GIL only ensures that only one cpython bytecode instruction will run at any given time. It does not care about which CPU core runs the instruction. That is the job of the OS kernel.
So going over your questions:
GIL is just a piece of code. The CPython Virtual machine is the process which first compiles the code to Cpython bytecode but it's normal job is to interpret the CPython bytecode. GIL is a piece of code that ensures a single line of bytecode runs at a time no matter how many threads are running. Cpython Bytecode instructions is what constitutes the virtual machine stack. So in a way, GIL will ensure that only one thread holds the GIL at any given point of time. (also that it keeps releasing the GIL for other threads and not starve them.)
Now coming to your actual confusion. You mention that when you run a program with many threads, you can see multiple (may be all) CPU cores firing up. So I did some experimentation and found that your findings are right (which is obvious) but the behaviour is similar in a non threaded version too.
def do_nothing(i):
time.sleep(0.0001)
return i*2
ThreadPool(20).map(do_nothing, range(10000))
def do_nothing(i):
time.sleep(0.0001)
return i*2
[do_nothing(i) for i in range(10000)]
The first one in multithreaded and the second one is not. When you compare the CPU usage by by both the programs, you will find that in both the cases multiple CPU cores will fire up. So what you noticed, although right, has not much to do with GIL or threading. CPU usage going high in multiple cores is simply because OS kernel will distribute the execution of code to different cores based on availability.
Your last question is more of an experimental thing as different programs have different CPU/io usage. You just have to be aware of the cost of creation of a thread and a process and the working of GIL & PVM and optimize the number of threads and processes to get the maximum perf out.
You can go through this talk by David Beazley to understand how multithreading can make your code perform worse (or better).

There are answers about what the Global Interpreter Lock (GIL) is here. Buried among the answers is mention of Python "bytecode", which is central to the issue. When your program is compiled, the output is bytecode, i.e. low-level computer instructions for a fictitious "Python" computer, that gets interpreted by the Python interpreter. When the interpreter is executing a bytecode, it serializes execution by acquiring the Global Interpreter Lock. This means that two threads cannot be executing bytecode concurrently on two different cores. This also means that true multi-threading is not implemented. But does this mean that there is no reason to use threading? No! Here are a couple of situations where threading is still useful:
For certain operations the interpreter will release the GIL, i.e. when doing I/O. So consider as an example the case where you want to fetch a lot of URLs from different websites. Most of the time is spent waiting for a response to be returned once the request is made and this waiting can be overlapped even if formulating the requests has to be done serially.
Many Python functions and modules are implemented in the C language and are free to release the GIL after being called if their processing requirements allow it. The numpy module is one such highly optimized package.
Consequently, threading is best used when the tasks are not CPU-intensive, i.e. they do a lot of waiting for I/O to complete, or they do a lot of sleeping, etc.

Why do we blame GIL if CPU can execute one process (light weight) at a time? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Fastest way to handle threading with callback in Python

I'm playing around with some measurement instruments through PyVisa (Python wrapper for VISA). Specifically I need to read the measurement value from four instruments, similar to this:
current1 = instrument1.ask("READ?")
current2 = instrument2.ask("READ?")
current3 = instrument3.ask("READ?")
current4 = instrument4.ask("READ?")
For my application, speed is a must. Individually, I can get between 50 and 200 measurements per second from the four instruments, but unfortunately my current code evaluates the four instruments serially.
From what I've read, there's some options with Threading and Multiprocessing in Python, but it's not obvious what the best, and fastest, option for me.
Best case scenario I will be spawning ~4x50 threads per second, so overhead is a bit of a concern.
The task is in no way CPU intensive, it is simply a matter of waiting for the readout from the instrument.
Any advice on what the proper course of action might be?
Many thanks in advance!

Try using the multiprocessing module in python. Because of the way the python interpreter is written, the threading module for python actually slows down your application when using multiple threads on a computer with multiple processors. This is due to something called the Global Interpreter Lock (GIL) which only allows one python thread per process to be run at a time. This makes things like memory management and concurrency easier for the interpreter.
I would suggest trying to create a process for each instrument (producer process) and send the results to a shared queue that gets processed by a single process (consumer process).
I would also limit the number of processes that you create to the number of processors on your computer. Creating too many processes introduces a lot of overhead.
If you really feel like you need the 200 threads you suggested in your post, I would program this in Java or C++.

I'm not sure if this fits your requirements I'm not too familiar with pyVisa, however have you looked into pypy's stackless examples the speed increase is pretty impressive, and you're able to spawn almost infinite amount of 'threadlets' without much hindrance to your system / program.
pypy is kinda limited in terms of supporting various packages but looks like pyVisa isn't one of them.
https://pypi.python.org/pypi/PyVISA

Can standard C Python has more than one thread running at the same time? [duplicate]

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.