Can I apply multithreading for computationally intensive task in python? - python

Update: To save your time, I give the answer directly here. Python can not utilize multi cpu cores at the same time if the you use pure Python to write your code. But Python can utilize multi cores at the same time when it calls some functions or packages which are written in C, like Numpy, etc.
I have heard that "multithreading in python is not the real multithreading, because of GIL". And I also heard that "python multithreading is okay to handle IO intensive task instead of computationally intensive task, because there is only one threading running at the same time".
But my experience made me rethink this question. My experience shows that even for computationally intensive task, python multithreading can accelerate the computation nearly learly. (Befor multithreading, it cost me 300 seconds to run the following program, after I use multithreading, it cost me 100 seconds.)
The following figures shows that 5 threads were created by python with CPython as the compiler with package threading and all the cpu cores are nearly 100% percentage.
I think the screenshots can prove that the 5 cpu cores are running at the same time.
So can anyone give me the explanation? Can I apply multithreading for computationally intensive task in python? Or can multi threads/cores run at the same time in python?
My code:
import threading
import time
import numpy as np
from scipy import interpolate
number_list = list(range(10))
def image_interpolation():
while True:
number = None
with threading.Lock():
if len(number_list):
number = number_list.pop()
if number is not None:
# Make a fake image - you can use yours.
image = np.ones((20000, 20000))
# Make your orig array (skipping the extra dimensions).
orig = np.random.rand(12800, 16000)
# Make its coordinates; x is horizontal.
x = np.linspace(0, image.shape[1], orig.shape[1])
y = np.linspace(0, image.shape[0], orig.shape[0])
# Make the interpolator function.
f = interpolate.interp2d(x, y, orig, kind='linear')
else:
return 1
workers=5
thd_list = []
t1 = time.time()
for i in range(workers):
thd = threading.Thread(target=image_interpolation)
thd.start()
thd_list.append(thd)
for thd in thd_list:
thd.join()
t2 = time.time()
print("total time cost with multithreading: " + str(t2-t1))
number_list = list(range(10))
for i in range(10):
image_interpolation()
t3 = time.time()
print("total time cost without multithreading: " + str(t3-t2))
output is:
total time cost with multithreading: 112.71922039985657
total time cost without multithreading: 328.45561170578003
screenshot of top during multithreading
screenshot of top -H during multithreading
screenshot of top then press 1. during multithreading
screenshot of top -H without multithreading

As you mentioned, Python has a "global interpreter lock" (GIL) that prevents two threads of Python code running at the same time. The reason that multi-threading can speed up IO bound tasks is that Python releases the GIL when, for example, listening on a network socket or waiting for a disk read. So the GIL does not prevent two lots of work being done simultaneously by your computer, it prevents two Python threads in the same Python process being run simultaneously.
In your example, you use numpy and scipy. These are largely written in C and utilise libraries (BLAS, LAPACK, etc) written in C/Fortran/Assembly. When you perform operations on numpy arrays, it is akin to listening on socket in that the GIL is released. When the GIL is released and the numpy array operations called, numpy gets to decide how to perform the work. If it wants, it can spawn other threads or processes and the BLAS subroutines it calls might spawn other threads. Precisely if/how this is done can be configured at build time if you want to compile numpy from source.
So, to summarise, you've have found an exception to the rule. If you were to repeat the experiment using only pure Python functions, you would get quite different results (e.g. see the "Comparison" section of the page linked to above).

Python threading is a real threading, just that no two threads at once can be in the interpreter (and this is what GIL is about). The native part of the code can well run in parallel without contention on multiple threads, only when diving back into the interpreter they'll have to serialize among each other.
The fact that you have all CPU cores loaded to 100% alone is not a proof that you are using the machine "efficiently". You need to make sure that the CPU usage is not due to the context switching.
If you switch to multiprocessing instead of threading (they are very similar), you won't have to double guess, but then you'll have to marshal the payload when passing between threads.
So need to measure everything anyway.

Related

Why don't I get faster run-times with ThreadPoolExecutor?

In order to understand how threads work in Python, I wrote the following simple function:
def sum_list(thelist:list, start:int, end:int):
s = 0
for i in range(start,end):
s += thelist[i]**3//10
return s
Then I created a list and tested how much time it takes to compute its sum:
LISTSIZE = 5000000
big_list = list(range(LISTSIZE))
start = time.perf_counter()
big_sum=sum_list(big_list, 0, LISTSIZE)
print(f"One thread: sum={big_sum}, time={time.perf_counter()-start} sec")
It took about 2 seconds.
Then I tried to partition the computation into threads, such that each thread computes the function on a subset of the list:
THREADCOUNT=4
SUBLISTSIZE = LISTSIZE//THREADCOUNT
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(THREADCOUNT) as executor:
futures = [executor.submit(sum_list, big_list, i*SUBLISTSIZE, (i+1)*SUBLISTSIZE) for i in range(THREADCOUNT)]
big_sum = 0
for res in concurrent.futures.as_completed(futures): # return each result as soon as it is completed:
big_sum += res.result()
print(f"{THREADCOUNT} threads: sum={big_sum}, time={time.perf_counter()-start} sec")
Since I have a 4-cores CPU, I expected it to run 4 times faster. But it did not: it ran in about 1.8 seconds on my Ubuntu machine (on my Windows machine, with 8 cores, it ran even slower than the single-thread version: about 2.2 seconds).
Is there a way to use ThreadPoolExecutor (or another threads-based mechanism in Python) so that I can compute this function faster?
The problem is that the function you are trying to make faster is CPU-bound and the Python Global Interpreter Lock (GIL) prevents any performance gain from parallelisation of such code.
In Python, threads are wrapper around genuine OS thread. However, in order to avoid race conditions due to concurrent execution, only one thread can access the Python interpreter to execute bytecode at a time. This restriction is enforced by a lock called the GIL.
Thus in Python, true multithreading cannot be achieved and multiprocessing should be used instead. However, note that the GIL is not locked by IO operations (file reading, networking, etc.) and some library code (numpy, etc.) so these operations can still benefit from Python multithreading.
The function sum_list used neither of those operations so it will not benefit from Python multithreading.
You can use ProcessPoolExecutor to effectively get parallelism but this may copy the input list in your case. Multiprocessing is equivalent to launching multiple independent Python interpreters, thus the GIL's (one per intepreter) is not an issue anymore. However, multiprocessing incurs performance penalties during inter-process communication.
Since I have a 4-cores CPU, I expected it to run 4 times faster.
ThreadPoolExecutor does not use multiple CPUs, so this isn't a sensible expectation. All worker threads are executing on the same CPU.
Threads generally only help when you're IO bound, large calculations are CPU bound.
To take advantage of multiple CPUs, you may want to look at a ProcessPoolExecutor instead. However, spawning/forking additional processes has a much higher overhead than threading, and any objects sent across the process boundary need to be picklable. Since the workers in your example code all reference the same big_list instance, it may not work well with multiprocessing either - it will be copying the entire list in each worker process, even though the worker only intends to use a small segment of the list.
You can refactor it to only send the data you need for the calculation (easy), or you can use shared memory (difficult).

Python multithreading/multiprocessing very slow with concurrent.futures

I am trying to use multithreading and/or multiprocessing to speed up my script somewhat. Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.
My base code is as follows and executes in roughly 300ms:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
for y in acls:
convertToIP(y['srcSubnet'])
If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty. Code is as follows
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
The server I am running this on has 28 physical cores.
Any suggestions as to what I might be doing wrong will be gratefully received!
If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.
You might try following:
Just to create two processes (not threads!!!), one treating the first 5000 subnets, the other the the other 5000 subnets.
There you might be able to see some performance improvement. but the tasks you perform are not that CPU or IO intensive, so not sure it will work.
Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.
The reason is the infamous GIL (global interpreter lock). In python you can never execute two python byte codes in parallel within the same process.
Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. numpy for example releases the GIL and is thus a good candidate for multi threading

Why does threading sorting functions work slower than sequentially running them in python? [duplicate]

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?
Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)
The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing
As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

Python multiprocessing.Pool() doesn't use 100% of each CPU

I am working on multiprocessing in Python.
For example, consider the example given in the Python multiprocessing documentation (I have changed 100 to 1000000 in the example, just to consume more time). When I run this, I do see that Pool() is using all the 4 processes but I don't see each CPU moving upto 100%. How to achieve the usage of each CPU by 100%?
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.map(f, range(10000000))
It is because multiprocessing requires interprocess communication between the main process and the worker processes behind the scene, and the communication overhead took more (wall-clock) time than the "actual" computation (x * x) in your case.
Try "heavier" computation kernel instead, like
def f(x):
return reduce(lambda a, b: math.log(a+b), xrange(10**5), x)
Update (clarification)
I pointed out that the low CPU usage observed by the OP was due to the IPC overhead inherent in multiprocessing but the OP didn't need to worry about it too much because the original computation kernel was way too "light" to be used as a benchmark. In other words, multiprocessing works the worst with such a way too "light" kernel. If the OP implements a real-world logic (which, I'm sure, will be somewhat "heavier" than x * x) on top of multiprocessing, the OP will achieve a decent efficiency, I assure. My argument is backed up by an experiment with the "heavy" kernel I presented.
#FilipMalczak, I hope my clarification makes sense to you.
By the way there are some ways to improve the efficiency of x * x while using multiprocessing. For example, we can combine 1,000 jobs into one before we submit it to Pool unless we are required to solve each job in real time (ie. if you implement a REST API server, we shouldn't do in this way).
You're asking wrong kind of question. multiprocessing.Process represents process as understood in your operating system. multiprocessing.Pool is just a simple way to run several processes to do your work. Python environment has nothing to do with balancing load on cores/processors.
If you want to control how will processor time be given to processes, you should try tweaking your OS, not python interpreter.
Of course, "heavier" computations will be recognised by system, and may look like they do just what you want to do, but in fact, you have almost no control on process handling.
"Heavier" functions will just look heavier to your OS, and his usual reaction will be assigning more processor time to your processes, but that doesn't mean you did what you wanted to - and that's good, because that the whole point of languages with VM - you specify logic, and VM takes care of mapping this logic onto operating system.

Benefit of using threads/Threading module python 2.7

1) I have read that if I import the threading module in python, CPU bound loads won't see much benefit from using this library because the GIL forces threads to run 1 at a time even if I run code on a multi-core machine. If this is the case what sort of code would benefit from using Python's threading library?
2) If that's the case for the threading library, then to perform CPU intensive tasks in parallel, such as cross-correlations of two signals, is the multiprocessing module the best module to use?
To make this more concrete, let's say the task I'd like to parallelize is the for loop in the following code, and my machine has only 12 cores. Suppose my template has a length of ~1000, my image has a length of ~2000 and I have ~1000 signals to sort through:
import numpy as np
###2-D array of shape (points, signals)
signals = np.load('signals.npy')
###1-D template array for cross correlation
templateSignal = np.load('template.npy')
for s in range(signals.shape[2]):
xcorr = np.correlate(templateSignal, signal[:,s])
Even with the GIL, threading in Python is useful because input/output operations don't block the program. You can perform operations while waiting for the completion of a disk operation or while waiting for a network event.
Threads also play a role in GUI applications, where a program can stay responsive to user input while performing calculations in the backgrounds (thanks #FogleBird)
As for 2), you are correct in your assumption that you can spread CPU-intensive programs to multiple cores with the multiprocessing module. Be aware of the costs of communication between processes.

Categories