Multiple processes with multiple threads in Python

Multiple processes with multiple threads in Python - python

I've heard something about "If you want to get maximum performance from parallel application, you should create as many processes as your computer has CPUs, and in each process -- create some (how many?) threads".
Is it true?
I wrote a piece of code implementing this idiom:
import multiprocessing, threading
number_of_processes = multiprocessing.cpu_count()
number_of_threads_in_process = 25 # some constant
def one_thread():
# very heavyweight function with lots of CPU/IO/network usage
do_main_work()
def one_process():
for _ in range(number_of_threads_in_process):
t = threading.Thread(target=one_thread, args=())
t.start()
for _ in range(number_of_processes):
p = multiprocessing.Process(target=one_process, args=())
p.start()
Is it correct? Will my do_main_work function really run in parallel, not facing any GIL-issues?
Thank you.

It really depends very much on what you're doing.
Keep in mind that in CPython, only one thread at a time can be executing Python bytecode (because of the GIL). So for a computation-intensive problem in CPython threads won't help you that much.
One way to spread out work that can be done in parallel is to use a multiprocessing.Pool. By default this does not use more processes that your CPU has cores. Using many more processes will mainly have them fighting over resources (CPU, memory) than getting useful work done.
But taking advantage of multiple processors requires that you have work for them to do! In other words, if the problem cannot be divided into smaller pieces that can be calculated separately and in parallel, many CPU cores will not be of much use.
Additionally, not al problems are bound by the amount of calculation that has to be done.
The RAM of a computer is much slower than the CPU. If the data-set that you're working on is much bigger than the CPU's caches, reading data from and returning the results to RAM might become the speed limit. This is called memory bound.
And if you are working on much more data than can fit in the machine's memory, your program will be doing a lot of reading and writing from disk. A disk is slow compared to RAM and very slow compared to a CPU, so your program becomes I/O-bound.

# very heavyweight function with lots of CPU/IO/network usage
Lots of CPU will suffer because of GIL, so you'll only get benefit from multiple processes.
IO and network (in fact network is also kind of IO) won't be affected too much by GIL because lock is released explicitly and obtained again after IO operation is completed. There are macro-definitions in CPython for this:
Py_BEGIN_ALLOW_THREADS
... Do some blocking I/O operation ...
Py_END_ALLOW_THREADS
There still be a performance hit because of GIL being utilized in wrapping code, but you still get better performance with multiple threads.
Finally - and this is a general rule - not only for Python: Optimal number of threads/processes depends on what the program is actually doing. Generally if it utilizes CPU intensively, there is almost no performance boost if number of processes is greater than number of CPU cores. For example Gentoo documentation says that optimal number of threads for compiler is CPU cores + 1.

I think the number of threads you are using per process is too high.Usually for any Intel Processor the number of threads per process is 2.The number of cores vary from 2(Intel core i3) to 6(Intel core i7).So at a time when all the processes are running the maximum number of threads will be 6*2=12.

Related

Why is ThreadPoolExecutor's default max_workers decided based on the number of CPUs?

The documentation for concurrent.futures.ThreadPoolExecutor says:
Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
I want to understand why the default max_workers value depends on the number of CPUs. Regardless of how many CPUs I have, only one Python thread can run at any point in time.
Let us assume each thread is I/O intensive and it spends only 10% of its time in the CPU and 90% of its time waiting for I/O. Let us then assume we have 2 CPUs. We can only run 10 threads to utilize 100% CPU. We can't utilize any more CPU because only one thread runs at any point in time. This holds true even if there are 4 CPUs.
So why is the default max_workers decided based on the number of CPUs?

It's a lot easier to check the number of processors than to check how I/O bound your program is, especially at thread pool startup, when your program hasn't really started working yet. There isn't really anything better to base the default on.
Also, adding the default was a pretty low-effort, low-discussion change. (Previously, there was no default.) Trying to get fancy would have been way more work.
That said, getting fancier might pay off. Maybe some kind of dynamic system that adjusts thread count based on load, so you don't have to decide the count at the time when you have the least information. It won't happen unless someone writes it, though.

CPython thread implementation is light-weight. It mostly ships the thing to the os with some accounting for GIL (and signal handling). Increasing number of threads proportional to cores usually does not work out. Since the threads are managed by the os, with many cores, the os gets greedy and try to run as many ready threads as possible if there is a thread context switch. All of them try to acquire the GIL and only one succeeds. This leads to a lot of waste - worse than the linear calculation of assuming only one thread can run at a given time. If you are using pure CPU-bound threads in the executor, there is no reason to link it to cores because of this. But we should not deprive users who really want the CPU power and are okay with a GIL release to utilise the cores. So the arguably, the default value should be linked to the number of cores in this case - if you assume most people running Python know what they are doing.
Now if the threads in the executor are I/O-bound, then you rightly mentioned the max capacity is 1/p, where p is fraction of CPU each thread needs. For deciding the default, it is impossible to know what p is beforehand. The default minimum of 0.2 (min 5 threads) does not look too bad. But usually my guess is this p will be much lower, so the limiting factor may never be the CPU (but if it is, again we get in to the CPU thrashing problem of multiple cores as above). So the linking to number of cores will probably not end up being unsafe (unless the threads have heavy processing or you have too many cores!).

Can I run multiprocessing Python programs on a single core machine?

So this is more or less a theoretical question. I have a single core machine which is supposedly powerful but nevertheless only one core. Now I have two choices to make :
Multithreading: As far as my knowledge is concerned I cannot make use of multiple cores in my machines even if I had them because of GIL. Hence in this situation, it does not make any difference.
Multiprocessing: This is where I have a doubt. Can I do multiprocessing on a single core machine? Or everytime I have to check the cores available in my machine and then run exactly the same or less number of processes?
Can someone please guide me on the relation between multiprocessing and cores in a machine.
I know this is a theoretical question but my concepts are not very clear on this.

This is a big topic but here are some pointers.
Think of threads as processes that share the same address space and can access the same memory. Communication is done by shared variables. Multiple threads can run within the same process.
Processes (in this context, and roughly speaking) have their own private data and if two processes want to communicate that communication has to be done more explicitly.
When you are writing a program where the bottleneck is CPU cycles, neither threads or processes will give you a speedup on a single core machine.
Processes and threads are still useful for multitasking (rapid switching between (sub)programs) - this is what your operating system does because it runs far more processes than you have cores.
Processes and threads (or even coroutines!) can give you considerable speedup even on a single core machine if the tasks you are executing are I/O bound - think of fetching data from a network. For example, instead of actively waiting for data to be sent or to arrive, another process or thread can initiate the next network operation.
Threads are preferable over processes when you don't need explicit encapsulation due to their lower overhead. For most CPU-bound concurrent problems, and especially the large subset of "embarassingly parallel" ones, it does not make much sense to spawn more processes than you have processors.
The Python GIL prevents two threads in the same process from running in parallel, i.e. from multiple cores executing instructions literally at the same time.
Therefore threads in Python are relatively useless for speeding up CPU-bound tasks, but can still be very useful for I/O bound tasks, because blocking operations (e.g. waiting for network data) release the GIL such that another thread can run while the other waits.
If you have multiple processors, you can have true parallelism by spawning multiple processes despite the GIL. This is only worth it for CPU bound tasks, and often you have to consider the overhead of spawning processes and the communication cost between processes.

You CAN use both multithreading and multiprocessing in single core systems.
The GIL limits the usefulness of multithreading in pure Python for computation-bound tasks, no matter your underlying architecture. For I/O-bound tasks, they do work perfectly fine. Had they had not any use, they would not have been implemented in the first place, probably.
For pure Python software, multiprocessing is always a safer choice when it comes to parallel computing. Of course, multiple processes are more expensive than multiple threads (since processes do not share memory, contrarily to threads; also, processes come with slightly higher overhead compared to threads).
For single processor machines, however, multiprocessing (and multithreading) buys you little to no extra speed for computationally heavy tasks, and they should actually even slow you down a bit. But, if the OS supports them (which is pretty common for desktop, workstation, clusters, etc, but may not be common for embedded systems), they allow you to effectively run simultaneously multiple I/O-bound programs.
Long story short, it depends a bit on what you are doing...

multiprocessing module basically spawns up multiple instances of python interpreter, so there is no worry of GIL.
multiprocessing uses the same API used by threading module if you have used it previously.
You seem to be confused between multiprocessing, threading (you referring as multithreading) and X-core processor.
No matter what, when you start Python (CPython implementation) it will only use one core of your processor.
Threading is distributing the load between the different component of the script. Suppose you have to interact with an external API, your script has to wait for communication to finish until it proceeds next. You have are making multiple similar calls, it will take linear time. Whereas if you use threading, you can do those calls parallelly.
See also: PyPy implementation of Python

When using multiprocessing Pool should the number of worker processes be the same as the number of CPU's or cores?

When using Python multiprocessing Pool should the number of worker processes be the same as the number of CPU's or cores?
This article http://www.howtogeek.com/194756/cpu-basics-multiple-cpus-cores-and-hyper-threading-explained/ says each core is really a central processing unit on a CPU chip. And thus it seems like there should not be a problem with having 1 process/core
e.g. If I have a single CPU chip with 4 cores, can 1 process/core for a total of 4 processes be ran without the possibility of slowing performance.

From what I've learned regarding python and multiprocessing, the best course of action is...
One process per core, but skip logical ones.
Hyperthreading is no help for python. It'll actually hurt performance in many cases, but test it yourself first of course.
Use the affinity (pip install affinity) module to stick each process to a specific core.
At least tested extensively on windows using 32bit python, not doing this will hurt performance significantly due to constant trashing of the cache. And again: skip logical cores! Logical ones, assuming you have an intel cpu with hyperthreading, are 1,3,5,7, etc.
More threads than real cores will help you nothing, unless there's also IO happening, which it shouldn't if you're crunching numbers. Test my claim yourself, especially if you use Linux, as I didn't get to test in Linux at all.

It really depends on your workload. Case by case, the best approach is to run some benchmark test and see what is the result.
Scheduling processes is an expensive operation, the more running processes, the more you need to change context.
If most of your processes are not running (they are waiting for IO for example) then overcommitting might prove beneficial. On the opposite, if your processes are running most of the time, adding more of them contending your CPU is going to be detrimental.

Why are threads spread between CPUs?

I am trying to get my head around threading vs. CPU usage. There are plenty of discussions about threading vs. multiprocessing (a good overview being this answer) so I decided to test this out by launching a maximum number of threads on my 8 CPU laptop running Windows 10, Python 3.4.
My assumption was that all the threads would be bound to a single CPU.
EDIT: it turns out that it was not a good assumption. I now understand that for multithreaded code, only one piece of python code can run at once (no matter where/on which core). This is different for multiprocessing code (where processes are independent and run indeed independently).
While I read about these differences, it is one answer which actually clarified this point.
I think it also explains the CPU view below: that it is an average view of many threads spread out on many CPUs, but only one of them running at one given time (which "averages" to all of them running all the time).
It is not a duplicate of the linked question (which addresses the opposite problem, i.e. all threads on one core) and I will leave it hanging in case someone has a similar question one day and is hopefully helped by my enlightenment.
The code
import threading
import time
def calc():
time.sleep(5)
while True:
a = 2356^36
n = 0
while True:
try:
n += 1
t = threading.Thread(target=calc)
t.start()
except RuntimeError:
print("max threads: {n}".format(n=n))
break
else:
print('.')
time.sleep(100000)
Led to 889 threads being started.
The load on the CPUs was however distributed (and surprisingly low for a pure CPU calculation, the laptop is otherwise idle with an empty load when not running my script):
Why is it so? Are the threads constantly moved as a pack between CPUs and what I see is just an average (the reality being that at a given moment all threads are on one CPU)? Or are they indeed distributed?

As of today it is still the case that 'one thread holds the GIL'. So one thread is running at a time.
The threads are managed on the operating system level. What happens is that every 100 'ticks' (=interpreter instruction) the running thread releases the GIL and resets the tick counter.
Because the threads in this example do continuous calculations, the tick limit of 100 instructions is reached very fast, leading to an almost immediate release of the GIL and a 'battle' between threads starts to acquire the GIL.
So, my assumption is that your operating system has a higher than expected load , because of (too) fast thread switching + almost continuous releasing and acquiring the GIL. The OS spends more time on switching than actually doing any useful calculation.
As you mention yourself, for using more than one core at a time, it's better to look at multiprocessing modules (joblib/Parallel).
Interesting read:
http://www.dabeaz.com/python/UnderstandingGIL.pdf

Um. The point of multithreading is to make sure they work gets spread out. A really easy cheat is to use as many threads as you have CPU cores. The point is they are all independent so they can actually run at the same time. If they were on the same core only one thread at a time could really run at all. They'd pass that core back and forth for processing at the OS level.
Your assumption is wrong and bizarre. What would ever lead you to think they should run on the same CPU and consequently go at 1/8th speed? As the only reason to thread them is typically to get the whole batch to go faster than a single core alone.
In fact, what the hell do you think writing parallel code is for if not to run independently on several cores at the same time? Like this would be pointless and hard to do, let's make complex fetching, branching, and forking routines to accomplish things slower than one core just plugging away at the data?

Threads writing each their own file is slower than writing all files sequentially

I am learning about threading in Python, and wrote a short test program which creates 10 csv-files and writes 100k lines in each of the files. I assumed it would be faster to let 10 threads write each their own file, but for some reason it is 2x slower than simply writing all files in sequence.
I think this might have to do with the way the threading is treated by the OS, but not sure. I am running this on Linux.
I would greatly appreciate if someone could shed some light on why this is the case.
Multi-thread version:
import thread, csv
N = 10 #number of threads
exitmutexes = [False]*N
def filewriter(id_):
with open('files/'+str(id_)+'.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for i in xrange(100000):
writer.writerow(["efweef", "wefwef", "666w6efw", "6555555"])
exitmutexes[id_] = True
for i in range(N):
thread.start_new_thread(filewriter, (i,))
while False in exitmutexes: #checks whether all threads are done
pass
Note: I have tried to include a sleep in the while-loop so that main thread is free at intervals, but this had no substantial effect.
Regular version:
import time, csv
for i in range(10):
with open('files2/'+str(i)+'.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for i in xrange(100000):
writer.writerow(["efweef", "wefwef", "666w6efw", "6555555"])

There are several issues:
Due to Global Interpreter Lock (GIL), Python will not use more than one CPU core at a time for the data generation part, so your data generation won't be sped up by running multiple threads. You'll need multi processing to improve CPU bound operation.
But that's not really the core of the problem here, because the GIL is released when you do I/O like writing to disk. The core of the problem is that you're writing to ten different places at a time, which most likely causes the harddisk head to thrash around as the hard disk head switches around between ten different places in the disk. Serial writes is almost always fastest in a hard disk.
Even if you have CPU bound operation and use multiprocessing, using ten thread won't give you any significant advantage in data generation unless you actually have ten CPU cores. If you use more threads than the number of CPU cores, you'll pay the cost of thread switching, but you'll never speed up the total runtime of a CPU bound operation.
If you use more threads than available CPU, the total run time always increases or at most stay the same. The only reason to use more threads than CPU cores is if you are consuming the result of the threads interactively or in a pipeline with other systems. There are edge cases where you can speed up a poorly designed, I/O bound program by using threads. But a well designed single thread program will most likely perform just as well or better.

Sounds like the dreaded GIL (Global Interpreter Lock)
"In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)"
This essentially means each python interpreter (and thus script) is locked to one logical core on your machine, and no two threads will be executed simultaneously, unless you decide to spawn to separate processes.
Consult this page for more details:
https://wiki.python.org/moin/GlobalInterpreterLock

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.