1) I have read that if I import the threading module in python, CPU bound loads won't see much benefit from using this library because the GIL forces threads to run 1 at a time even if I run code on a multi-core machine. If this is the case what sort of code would benefit from using Python's threading library?
2) If that's the case for the threading library, then to perform CPU intensive tasks in parallel, such as cross-correlations of two signals, is the multiprocessing module the best module to use?
To make this more concrete, let's say the task I'd like to parallelize is the for loop in the following code, and my machine has only 12 cores. Suppose my template has a length of ~1000, my image has a length of ~2000 and I have ~1000 signals to sort through:
import numpy as np
###2-D array of shape (points, signals)
signals = np.load('signals.npy')
###1-D template array for cross correlation
templateSignal = np.load('template.npy')
for s in range(signals.shape[2]):
xcorr = np.correlate(templateSignal, signal[:,s])
Even with the GIL, threading in Python is useful because input/output operations don't block the program. You can perform operations while waiting for the completion of a disk operation or while waiting for a network event.
Threads also play a role in GUI applications, where a program can stay responsive to user input while performing calculations in the backgrounds (thanks #FogleBird)
As for 2), you are correct in your assumption that you can spread CPU-intensive programs to multiple cores with the multiprocessing module. Be aware of the costs of communication between processes.
Related
Update: To save your time, I give the answer directly here. Python can not utilize multi cpu cores at the same time if the you use pure Python to write your code. But Python can utilize multi cores at the same time when it calls some functions or packages which are written in C, like Numpy, etc.
I have heard that "multithreading in python is not the real multithreading, because of GIL". And I also heard that "python multithreading is okay to handle IO intensive task instead of computationally intensive task, because there is only one threading running at the same time".
But my experience made me rethink this question. My experience shows that even for computationally intensive task, python multithreading can accelerate the computation nearly learly. (Befor multithreading, it cost me 300 seconds to run the following program, after I use multithreading, it cost me 100 seconds.)
The following figures shows that 5 threads were created by python with CPython as the compiler with package threading and all the cpu cores are nearly 100% percentage.
I think the screenshots can prove that the 5 cpu cores are running at the same time.
So can anyone give me the explanation? Can I apply multithreading for computationally intensive task in python? Or can multi threads/cores run at the same time in python?
My code:
import threading
import time
import numpy as np
from scipy import interpolate
number_list = list(range(10))
def image_interpolation():
while True:
number = None
with threading.Lock():
if len(number_list):
number = number_list.pop()
if number is not None:
# Make a fake image - you can use yours.
image = np.ones((20000, 20000))
# Make your orig array (skipping the extra dimensions).
orig = np.random.rand(12800, 16000)
# Make its coordinates; x is horizontal.
x = np.linspace(0, image.shape[1], orig.shape[1])
y = np.linspace(0, image.shape[0], orig.shape[0])
# Make the interpolator function.
f = interpolate.interp2d(x, y, orig, kind='linear')
else:
return 1
workers=5
thd_list = []
t1 = time.time()
for i in range(workers):
thd = threading.Thread(target=image_interpolation)
thd.start()
thd_list.append(thd)
for thd in thd_list:
thd.join()
t2 = time.time()
print("total time cost with multithreading: " + str(t2-t1))
number_list = list(range(10))
for i in range(10):
image_interpolation()
t3 = time.time()
print("total time cost without multithreading: " + str(t3-t2))
output is:
total time cost with multithreading: 112.71922039985657
total time cost without multithreading: 328.45561170578003
screenshot of top during multithreading
screenshot of top -H during multithreading
screenshot of top then press 1. during multithreading
screenshot of top -H without multithreading
As you mentioned, Python has a "global interpreter lock" (GIL) that prevents two threads of Python code running at the same time. The reason that multi-threading can speed up IO bound tasks is that Python releases the GIL when, for example, listening on a network socket or waiting for a disk read. So the GIL does not prevent two lots of work being done simultaneously by your computer, it prevents two Python threads in the same Python process being run simultaneously.
In your example, you use numpy and scipy. These are largely written in C and utilise libraries (BLAS, LAPACK, etc) written in C/Fortran/Assembly. When you perform operations on numpy arrays, it is akin to listening on socket in that the GIL is released. When the GIL is released and the numpy array operations called, numpy gets to decide how to perform the work. If it wants, it can spawn other threads or processes and the BLAS subroutines it calls might spawn other threads. Precisely if/how this is done can be configured at build time if you want to compile numpy from source.
So, to summarise, you've have found an exception to the rule. If you were to repeat the experiment using only pure Python functions, you would get quite different results (e.g. see the "Comparison" section of the page linked to above).
Python threading is a real threading, just that no two threads at once can be in the interpreter (and this is what GIL is about). The native part of the code can well run in parallel without contention on multiple threads, only when diving back into the interpreter they'll have to serialize among each other.
The fact that you have all CPU cores loaded to 100% alone is not a proof that you are using the machine "efficiently". You need to make sure that the CPU usage is not due to the context switching.
If you switch to multiprocessing instead of threading (they are very similar), you won't have to double guess, but then you'll have to marshal the payload when passing between threads.
So need to measure everything anyway.
I'm using Pool from multiprocessing package (from multiprocessing.dummy import Pool).
I wrote a function that reads a text file and preprocessing it for a future function.
I have about 20,000 such text files, thus I wanted to parallelize the process- and for this I used the pool.
I have 32 cores on my remote server that is running the code, thus I tried to open 70 process (I also tried less, the problem remains) - this is how my system monitor looks like:
As one can see, 16 out of 32 cores don't work at all!
Any help would be appreciated.
As I said in my comment, all multiprocessing.dummy structures are intended to simulate the multiprocessing interface using regular threads which can be quite useful for testing, debugging, profiling etc. Or, as the official docs say:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
While Python (CPython) threading uses real system threads, and hence it is in theory possible to have your threaded code execute on different CPUs, due to the dreaded GIL no two of those threads will ever run simultaneously. There are exceptions to that rule, tho - all tasks that are abstracting system calls and wait for an event (like I/O) can execute in parallel but the moment processing moves to the Python domain it will be locked out by GIL and will not be allowed to continue execution until the opt-code counter switches its context.
Long story short, if you want to utilize multiple cores through a multiprocessing pool, do not use the adaptations and abstractions in the multiprocessing.dummy (that reigns true for other dummy packages, too) and use the root multiprocessing module itself - in your case, multiprocessing.pool.Pool.
That being said, given that the threading module doesn't come with a pool interface I often find myself using multiprocessing.dummy.Pool (or multiprocessing.pool.ThreadPool) instead for I/O heavy stuff (i.e. not restricted by the GIL) when shared memory is more important than shared processing and the overhead that it incurs. It's quite possible that even with a switch to multiprocessing.pool.Pool you won't notice much of a difference if you don't do heavy post-processing when you grab the files.
I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.
The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.
Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.
I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")
I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?
Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)
The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing
As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.
I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.
The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.
Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.
I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")