I am using Python threading to do some jobs at the same time. I leave the main thread to perform task_A, and create one thread to perform task_B at the same time. Below is the simplified version of the code I am working on:
import threading
import numpy as np
def task_B(inc):
for elem in array:
value = elem + inc
if __name__ == '__main__':
array = np.random.rand(10)
t1 = threading.Thread(target=task_B, args=(1))
t1.start()
# task_A
array_copy = list()
for elem in array:
array_copy.append(elem)
t1.join()
I know the above code doesn't do something meaningful. Please think of it as a simplified example. As you can see, variable array is read-only both in the main thread and the newly created thread t1. Therefore, there is no need to lock array in both the main thread and the t1 thread, since none of them modifies (or writes) the variable. However, when I timed the code, it seems that Python threading automatically locks variables that are shared between threads, even though they are read-only. Is there a way to make each thread run simultaneously without locking the read-only variables? I've found this code, but cannot figure out how to apply it to my situation.
You are correct saying that in this case "there is no need for a lock", but the CPython interpreter (that I guess you use to run your Python code) is not that smart.
Python code always execute while holding the GIL, so that both threads execute exclusively from one another (instead of concurrently), although in an interleaved manner (which would not be the case without threads, the execution would be purely sequential).
That's the reason why performance-critical code is often offloaded to other *processes (using the multiprocessing library) or written in Cython (here an example solving a problem similar to yours).
See that question for a little more details on why the GIL is there : Is there a way to release the GIL for pure functions using pure python?.
There is hope that in the future (2022+) the Gil may be relaxed, but for now you are stuck with it, so work around it.
Related
I'm trying to parallelize a for loop to speed-up my code, since the loop processing operations are all independent. Following online tutorials, it seems the standard multiprocessing library in Python is a good start, and I've got this working for basic examples.
However, for my actual use case, I find that parallel processing (using a dual core machine) is actually a little (<5%) slower, when run on Windows. Running the same code on Linux, however, results in a parallel processing speed-up of ~25%, compared to serial execution.
From the docs, I believe this may relate to Window's lack of fork() function, which means the process needs to be initialised fresh each time. However, I don't fully understand this and wonder if anyone can confirm this please?
Particularly,
--> Does this mean that all code in the calling python file gets run for each parallel process on Windows, even initialising classes and importing packages?
--> If so, can this be avoided by somehow passing a copy (e.g. using deepcopy) of the class into the new processes?
--> Are there any tips / other strategies for efficient parallelisation of code design for both unix and windows.
My exact code is long and uses many files, so I have created a pseucode-style example structure which hopefully shows the issue.
# Imports
from my_package import MyClass
imports many other packages / functions
# Initialization (instantiate class and call slow functions that get it ready for processing)
my_class = Class()
my_class.set_up(input1=1, input2=2)
# Define main processing function to be used in loop
def calculation(_input_data):
# Perform some functions on _input_data
......
# Call method of instantiate class to act on data
return my_class.class_func(_input_data)
input_data = np.linspace(0, 1, 50)
output_data = np.zeros_like(input_data)
# For Loop (SERIAL implementation)
for i, x in enumerate(input_data):
output_data[i] = calculation(x)
# PARALLEL implementation (this doesn't work well!)
with multiprocessing.Pool(processes=4) as pool:
results = pool.map_async(calculation, input_data)
results.wait()
output_data = results.get()
EDIT: I do not believe the question is a duplicate of the one suggested, since this relates to a difference in Windows and Linunx, which is not mentioned at all in the suggested duplicate question.
NT Operating Systems lack the UNIX fork primitive. When a new process is created, it starts as a blank process. It's responsibility of the parent to instruct the new process on how to bootstrap.
Python multiprocessing APIs abstracts the process creation trying to give the same feeling for the fork, forkserver and spawn start methods.
When you use the spawn starting method, this is what happens under the hood.
A blank process is created
The blank process starts a brand new Python interpreter
The Python interpreter is given the MFA (Module Function Arguments) you specified via the Process class initializer
The Python interpreter loads the given module resolving all the imports
The target function is looked up within the module and called with the given args and kwargs
The above flow brings few implications.
As you noticed yourself, it is a much more taxing operation compared to fork. That's why you notice such a difference in performance.
As the module gets imported from scratch in the child process, all import side effects are executed anew. This means that constants, global variables, decorators and first level instructions will be executed again.
On the other side, initializations made during the parent process execution will not be propagated to the child. See this example.
This is why in the multiprocessing documentation they added a specific paragraph for Windows in the Programming Guidelines. I highly recommend to read the Programming Guidelines as they already include all the required information to write portable multi-processing code.
for the following script (python 3.6, windows anaconda), I noticed that the libraries are imported as many as the number of the processors were invoked. And print('Hello') are also executed multiple same amount of times.
I thought the processors will only be invoked for func1 call rather than the whole program. The actual func1 is a heavy cpu bounded task which will be executed for millions of times.
Is this the right choice of framework for such task?
import pandas as pd
import numpy as np
from concurrent.futures import ProcessPoolExecutor
print("Hello")
def func1(x):
return x
if __name__ == '__main__':
print(datetime.datetime.now())
print('test start')
with ProcessPoolExecutor() as executor:
results = executor.map(func1, np.arange(1,1000))
for r in results:
print(r)
print('test end')
print(datetime.datetime.now())
concurrent.futures.ProcessPoolExecutor uses the multiprocessing module to do its multiprocessing.
And, as explained in the Programming guidelines, this means you have to protect any top-level code you don't want to run in every process in your __main__ block:
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
... one should protect the “entry point” of the program by using if __name__ == '__main__':…
Notice that this is only necessary if using the spawn or forkserver start methods. But if you're on Windows, spawn is the default. And, at any rate, it never hurts to do this, and usually makes the code clearer, so it's worth doing anyway.
You probably don't want to protect your imports this way. After all, the cost of calling import pandas as pd once per core may seem nontrivial, but that only happens at startup, and the cost of running a heavy CPU-bound function millions of times will completely swamp it. (If not, you probably didn't want to use multiprocessing in the first place…) And usually, the same goes for your def and class statements (especially if they're not capturing any closure variables or anything). It's only setup code that's incorrect to run multiple times (like that print('hello') in your example) that needs to be protected.
The examples in the concurrent.futures doc (and in PEP 3148) all handle this by using the "main function" idiom:
def main():
# all of your top-level code goes here
if __name__ == '__main__':
main()
This has the added benefit of turning your top-level globals into locals, to make sure you don't accidentally share them (which can especially be a problem with multiprocessing, where they get actually shared with fork, but copied with spawn, so the same code may work when testing on one platform, but then fail when deployed on the other).
If you want to know why this happens:
With the fork start method, multiprocessing creates each new child process by cloning the parent Python interpreter and then just starting the pool-servicing function up right where you (or concurrent.futures) created the pool. So, top-level code doesn't get re-run.
With the spawn start method, multiprocessing creates each new child process by starting a clean new Python interpreter, importing your code, and then starting the pool-servicing function. So, top-level code gets re-run as part of the import.
I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.
The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.
Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.
I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")
There were several question on this topic but I couldn't find answer for my questions. Even python docs isn't that descriptive.
My problem is simple: I want to break up a huge list into pieces and process each piece in parallel.
So my question is whether the interpreter waits till all threads are finished before it starts the downstream lines of the program (in my case- consolidation of the processed list) or do I have to define the downstream process as a separate thread and use join.
Although, I read the post on the topic (Thread vs. Threading) I couldn't still much understand what is the difference between thread and threading.
Please direct me to a good text on the topic. The docs are not very informative.
PS (#zzk)
So even if I use multiprocessing, how will I execute a common code after all processes end? For e.g. 5 processes produce 5 lists. And now I have to merge these lists, sort and write to a file.
[the code is not exact and is just for explaining the situation]
def fun(x,y):
y=someprocessing(x) #type(y)=List
if __name__ == '__main__':
for i in listofprocesses:
p = Process(target=fun, args=(i,y))
p.start()
# DOWNSTREAM CODE#
yy=y1+y2+y3+y4+y5;
yy.sort()
for j in yy:
outfile.write(j)
I want to combine y produced from different processes to be merged.
There are two doubts here:
since the variable name is the same, do I have to pass the output list (y) as an argument
Assuming so, and all the processed lists are saved as y1,y2,y3,y4& y5, will the downstream code be executed. How to make sure that all the processes have ended?
Threading or thread won't help you help due to the GIL.
In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.
You may need multiprocessing
I am attempting to solve multiple linear systems using python and scipy using threads. I am an absolute beginner when it comes to python threads. I have attached code which distils what I'm trying to accomplish. This code works but the execution time actually increases as one increases totalThreads. My guess is that spsolve is being treated as a critical section and is not actually being run concurrently.
My questions are as follows:
Is spsolve thread-safe?
If spsolve is blocking, is there a way around it?
Is there another linear solver package that I can be using which parallelizes better?
Is there a better way to write this code segment that will increase performance?
I have been searching the web for answers but with no luck. Perhaps, I am just using the wrong keywords. Thanks for everyone's help.
def Worker(threadnum, totalThreads):
for i in range(threadnum,N,totalThreads):
x[:,i] = sparse.linalg.spsolve( A, b[:,i] )
threads = []
for threadnum in range(totalThreads):
t = threading.Thread(target=Worker, args=(threadnum, totalThreads))
threads.append(t)
t.start()
for threadnum in range(totalThreads): threads[threadnum].join()
The first thing you should understand is that, counterintuitively, Python's threading module will not let you take advantage of multiple cores. This is due to something called the Global Interpreter Lock (GIL), which is a critical part of the standard cPython implementation. See here for more info: What is a global interpreter lock (GIL)?
You should consider using the multiprocessing module instead, which gets around the GIL by spinning up multiple independent Python processes. This is a bit more difficult to work with, because different processes have different memory spaces, so you can't just share a thread-safe object between all processes and expect the object to stay synchronized between all processes. Here is a great introduction to multiprocessing: http://www.doughellmann.com/PyMOTW/multiprocessing/