Why don't I get faster run-times with ThreadPoolExecutor?

Why don't I get faster run-times with ThreadPoolExecutor? - python

In order to understand how threads work in Python, I wrote the following simple function:
def sum_list(thelist:list, start:int, end:int):
s = 0
for i in range(start,end):
s += thelist[i]**3//10
return s
Then I created a list and tested how much time it takes to compute its sum:
LISTSIZE = 5000000
big_list = list(range(LISTSIZE))
start = time.perf_counter()
big_sum=sum_list(big_list, 0, LISTSIZE)
print(f"One thread: sum={big_sum}, time={time.perf_counter()-start} sec")
It took about 2 seconds.
Then I tried to partition the computation into threads, such that each thread computes the function on a subset of the list:
THREADCOUNT=4
SUBLISTSIZE = LISTSIZE//THREADCOUNT
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(THREADCOUNT) as executor:
futures = [executor.submit(sum_list, big_list, i*SUBLISTSIZE, (i+1)*SUBLISTSIZE) for i in range(THREADCOUNT)]
big_sum = 0
for res in concurrent.futures.as_completed(futures): # return each result as soon as it is completed:
big_sum += res.result()
print(f"{THREADCOUNT} threads: sum={big_sum}, time={time.perf_counter()-start} sec")
Since I have a 4-cores CPU, I expected it to run 4 times faster. But it did not: it ran in about 1.8 seconds on my Ubuntu machine (on my Windows machine, with 8 cores, it ran even slower than the single-thread version: about 2.2 seconds).
Is there a way to use ThreadPoolExecutor (or another threads-based mechanism in Python) so that I can compute this function faster?

The problem is that the function you are trying to make faster is CPU-bound and the Python Global Interpreter Lock (GIL) prevents any performance gain from parallelisation of such code.
In Python, threads are wrapper around genuine OS thread. However, in order to avoid race conditions due to concurrent execution, only one thread can access the Python interpreter to execute bytecode at a time. This restriction is enforced by a lock called the GIL.
Thus in Python, true multithreading cannot be achieved and multiprocessing should be used instead. However, note that the GIL is not locked by IO operations (file reading, networking, etc.) and some library code (numpy, etc.) so these operations can still benefit from Python multithreading.
The function sum_list used neither of those operations so it will not benefit from Python multithreading.
You can use ProcessPoolExecutor to effectively get parallelism but this may copy the input list in your case. Multiprocessing is equivalent to launching multiple independent Python interpreters, thus the GIL's (one per intepreter) is not an issue anymore. However, multiprocessing incurs performance penalties during inter-process communication.

Since I have a 4-cores CPU, I expected it to run 4 times faster.
ThreadPoolExecutor does not use multiple CPUs, so this isn't a sensible expectation. All worker threads are executing on the same CPU.
Threads generally only help when you're IO bound, large calculations are CPU bound.
To take advantage of multiple CPUs, you may want to look at a ProcessPoolExecutor instead. However, spawning/forking additional processes has a much higher overhead than threading, and any objects sent across the process boundary need to be picklable. Since the workers in your example code all reference the same big_list instance, it may not work well with multiprocessing either - it will be copying the entire list in each worker process, even though the worker only intends to use a small segment of the list.
You can refactor it to only send the data you need for the calculation (easy), or you can use shared memory (difficult).

Related

Python multithreading/multiprocessing very slow with concurrent.futures

I am trying to use multithreading and/or multiprocessing to speed up my script somewhat. Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.
My base code is as follows and executes in roughly 300ms:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
for y in acls:
convertToIP(y['srcSubnet'])
If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty. Code is as follows
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
The server I am running this on has 28 physical cores.
Any suggestions as to what I might be doing wrong will be gratefully received!

If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.
You might try following:
Just to create two processes (not threads!!!), one treating the first 5000 subnets, the other the the other 5000 subnets.
There you might be able to see some performance improvement. but the tasks you perform are not that CPU or IO intensive, so not sure it will work.
Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.
The reason is the infamous GIL (global interpreter lock). In python you can never execute two python byte codes in parallel within the same process.
Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. numpy for example releases the GIL and is thus a good candidate for multi threading

What is the difference between ProcessPoolExecutor and ThreadPoolExecutor?

I read the docs trying to get a basic understanding but it only shows that ProcessPoolExecutor allows to side-step the Global Interpreter Lock which I think is the way to lock a variable or function so that parallel processes do not update its value at the same time.
What I am looking for is when to use ProcessPoolExecutor and when to use ThreadPoolExecutor and what should I keep in mind while using each approach!

ProcessPoolExecutor runs each of your workers in its own separate child process.
ThreadPoolExecutor runs each of your workers in separate threads within the main process.
The Global Interpreter Lock (GIL) doesn't just lock a variable or function; it locks the entire interpreter. This means that every builtin operation, including things like listodicts[3]['spam'] = eggs, is automatically thread-safe.
But it also means that if your code is CPU-bound (that is, it spends its time doing calculations rather than, e.g., waiting on network responses), and not spending most of its time in an external library designed to release the GIL (like NumPy), only one thread can own the GIL at a time. So, if you've got 4 threads, even if you have 4 or even 16 cores, most of the time, 3 of them will be sitting around waiting for the GIL. So, instead of getting 4x faster, your code gets a bit slower.
Again, for I/O-bound code (e.g., waiting on a bunch of servers to respond to a bunch of HTTP requests you made), threads are just fine; it's only for CPU-bound code that this is an issue.
Each separate child process has its own separate GIL, so this problem goes away—even if your code is CPU-bound, using 4 child processes can still make it run almost 4x as fast.
But child processes don't share any variables. Normally, this is a good thing—you pass (copies of) values in as the arguments to your function, and return (copies of) values back, and the process isolation guarantees that you're doing this safely. But occasionally (usually for performance reasons, but also sometimes because you're passing around objects that can't be copied via pickle), this is not acceptable, so you either need to use threads, or use the more complicated explicit shared data wrappers in the multiprocessing module.

ProcessPool is for CPU bound tasks so you can benefit from multiple CPU.
Threads is for io bound tasks so you can benefit from io wait.

Multithreaded parsing is slower then sequential

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?

Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic

Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.

If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

Why do we blame GIL if CPU can execute one process (light weight) at a time? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Can standard C Python has more than one thread running at the same time? [duplicate]

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.