I have found the following lines of code to compute the array mx by the repeated calling of a function called fun.
However, I would like to understand better what it does.
Also, I assigned 16 cores to the parallel pool, however, I noticed that during computations no more than 2 cores are running at the same time.
Could someone explain what this code does and why it could be that only part of the threads is working?
Thank you!
from tqdm import tqdm
from multiprocessing import Pool
from functools import partial
with Pool(processes = 16) as p_mx:
mx = tqdm(p_mx.imap(partial(fun, L), nodes), total = n)
multiprocessing.Pool() slower than just using ordinary functions
The function you are trying to parallelize doesn't require enough CPU
resources (i.e. CPU time) to rationalize parallelization!
And may caused by the way Python handle multi-threading and multi-processing with the GIL:
When to use threading and how many threads to use
Look at the GIL, you will have a better understanding of why.
If you want concurrent code in Python 3.8, you have CPU-bound concurrency problems then this could be the ticket!
Related
Update: To save your time, I give the answer directly here. Python can not utilize multi cpu cores at the same time if the you use pure Python to write your code. But Python can utilize multi cores at the same time when it calls some functions or packages which are written in C, like Numpy, etc.
I have heard that "multithreading in python is not the real multithreading, because of GIL". And I also heard that "python multithreading is okay to handle IO intensive task instead of computationally intensive task, because there is only one threading running at the same time".
But my experience made me rethink this question. My experience shows that even for computationally intensive task, python multithreading can accelerate the computation nearly learly. (Befor multithreading, it cost me 300 seconds to run the following program, after I use multithreading, it cost me 100 seconds.)
The following figures shows that 5 threads were created by python with CPython as the compiler with package threading and all the cpu cores are nearly 100% percentage.
I think the screenshots can prove that the 5 cpu cores are running at the same time.
So can anyone give me the explanation? Can I apply multithreading for computationally intensive task in python? Or can multi threads/cores run at the same time in python?
My code:
import threading
import time
import numpy as np
from scipy import interpolate
number_list = list(range(10))
def image_interpolation():
while True:
number = None
with threading.Lock():
if len(number_list):
number = number_list.pop()
if number is not None:
# Make a fake image - you can use yours.
image = np.ones((20000, 20000))
# Make your orig array (skipping the extra dimensions).
orig = np.random.rand(12800, 16000)
# Make its coordinates; x is horizontal.
x = np.linspace(0, image.shape[1], orig.shape[1])
y = np.linspace(0, image.shape[0], orig.shape[0])
# Make the interpolator function.
f = interpolate.interp2d(x, y, orig, kind='linear')
else:
return 1
workers=5
thd_list = []
t1 = time.time()
for i in range(workers):
thd = threading.Thread(target=image_interpolation)
thd.start()
thd_list.append(thd)
for thd in thd_list:
thd.join()
t2 = time.time()
print("total time cost with multithreading: " + str(t2-t1))
number_list = list(range(10))
for i in range(10):
image_interpolation()
t3 = time.time()
print("total time cost without multithreading: " + str(t3-t2))
output is:
total time cost with multithreading: 112.71922039985657
total time cost without multithreading: 328.45561170578003
screenshot of top during multithreading
screenshot of top -H during multithreading
screenshot of top then press 1. during multithreading
screenshot of top -H without multithreading
As you mentioned, Python has a "global interpreter lock" (GIL) that prevents two threads of Python code running at the same time. The reason that multi-threading can speed up IO bound tasks is that Python releases the GIL when, for example, listening on a network socket or waiting for a disk read. So the GIL does not prevent two lots of work being done simultaneously by your computer, it prevents two Python threads in the same Python process being run simultaneously.
In your example, you use numpy and scipy. These are largely written in C and utilise libraries (BLAS, LAPACK, etc) written in C/Fortran/Assembly. When you perform operations on numpy arrays, it is akin to listening on socket in that the GIL is released. When the GIL is released and the numpy array operations called, numpy gets to decide how to perform the work. If it wants, it can spawn other threads or processes and the BLAS subroutines it calls might spawn other threads. Precisely if/how this is done can be configured at build time if you want to compile numpy from source.
So, to summarise, you've have found an exception to the rule. If you were to repeat the experiment using only pure Python functions, you would get quite different results (e.g. see the "Comparison" section of the page linked to above).
Python threading is a real threading, just that no two threads at once can be in the interpreter (and this is what GIL is about). The native part of the code can well run in parallel without contention on multiple threads, only when diving back into the interpreter they'll have to serialize among each other.
The fact that you have all CPU cores loaded to 100% alone is not a proof that you are using the machine "efficiently". You need to make sure that the CPU usage is not due to the context switching.
If you switch to multiprocessing instead of threading (they are very similar), you won't have to double guess, but then you'll have to marshal the payload when passing between threads.
So need to measure everything anyway.
I am trying to use multithreading and/or multiprocessing to speed up my script somewhat. Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.
My base code is as follows and executes in roughly 300ms:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
for y in acls:
convertToIP(y['srcSubnet'])
If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty. Code is as follows
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
The server I am running this on has 28 physical cores.
Any suggestions as to what I might be doing wrong will be gratefully received!
If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.
You might try following:
Just to create two processes (not threads!!!), one treating the first 5000 subnets, the other the the other 5000 subnets.
There you might be able to see some performance improvement. but the tasks you perform are not that CPU or IO intensive, so not sure it will work.
Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.
The reason is the infamous GIL (global interpreter lock). In python you can never execute two python byte codes in parallel within the same process.
Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. numpy for example releases the GIL and is thus a good candidate for multi threading
Just being noob in this context:
I am try to run one function in multiple processes so I can process a huge file in shorter time
I tried
for file_chunk in file_chunks:
p = Process(target=my_func, args=(file_chunk, my_arg2))
p.start()
# without .join(), otherwise main proc has to wait
# for proc1 to finish so it can start proc2
but it seemed not so really fast enough
now I ask myself, if it is really running the jobs parallelly. I thought about Pool also, but I am using python2 and it is ugly to make it map two arguments to the function.
am I missing something in my code above or the processes that are created this way (like above) run really paralelly?
The speedup is proportional to the amount of CPU cores your PC has, not the amount of chunks.
Ideally, if you have 4 CPU cores, you should see a 4x speedup. Yet other factors such as IPC overhead must be taken into account when considering the performance improvement.
Spawning too many processes will also negatively affect your performance as they will compete against each other for the CPU.
I'd recommend to use a multiprocessing.Pool to deal with most of the logic. If you have multiple arguments, just use the apply_async method.
from multiprocessing import Pool
pool = Pool()
for file_chunk in file_chunks:
pool.apply_async(my_func, args=(file_chunk, arg1, arg2))
I am not an expert either, but what you should try is using joblib Parallel
from joblib import Parallel, delayed
import multiprocessing as mp
def random_function(args):
pass
proc = mp.cpu_count()
Parallel(n_jobs=proc)(delayed(random_function)(args) for args in args_list)
This will run a certain function (random_function) using a number of available cpus (n_jobs).
Feel free to read the docs!
I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?
Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)
The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing
As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.
I am working on multiprocessing in Python.
For example, consider the example given in the Python multiprocessing documentation (I have changed 100 to 1000000 in the example, just to consume more time). When I run this, I do see that Pool() is using all the 4 processes but I don't see each CPU moving upto 100%. How to achieve the usage of each CPU by 100%?
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.map(f, range(10000000))
It is because multiprocessing requires interprocess communication between the main process and the worker processes behind the scene, and the communication overhead took more (wall-clock) time than the "actual" computation (x * x) in your case.
Try "heavier" computation kernel instead, like
def f(x):
return reduce(lambda a, b: math.log(a+b), xrange(10**5), x)
Update (clarification)
I pointed out that the low CPU usage observed by the OP was due to the IPC overhead inherent in multiprocessing but the OP didn't need to worry about it too much because the original computation kernel was way too "light" to be used as a benchmark. In other words, multiprocessing works the worst with such a way too "light" kernel. If the OP implements a real-world logic (which, I'm sure, will be somewhat "heavier" than x * x) on top of multiprocessing, the OP will achieve a decent efficiency, I assure. My argument is backed up by an experiment with the "heavy" kernel I presented.
#FilipMalczak, I hope my clarification makes sense to you.
By the way there are some ways to improve the efficiency of x * x while using multiprocessing. For example, we can combine 1,000 jobs into one before we submit it to Pool unless we are required to solve each job in real time (ie. if you implement a REST API server, we shouldn't do in this way).
You're asking wrong kind of question. multiprocessing.Process represents process as understood in your operating system. multiprocessing.Pool is just a simple way to run several processes to do your work. Python environment has nothing to do with balancing load on cores/processors.
If you want to control how will processor time be given to processes, you should try tweaking your OS, not python interpreter.
Of course, "heavier" computations will be recognised by system, and may look like they do just what you want to do, but in fact, you have almost no control on process handling.
"Heavier" functions will just look heavier to your OS, and his usual reaction will be assigning more processor time to your processes, but that doesn't mean you did what you wanted to - and that's good, because that the whole point of languages with VM - you specify logic, and VM takes care of mapping this logic onto operating system.