Python multiprocessing.Pool() doesn't use 100% of each CPU - python

I am working on multiprocessing in Python.
For example, consider the example given in the Python multiprocessing documentation (I have changed 100 to 1000000 in the example, just to consume more time). When I run this, I do see that Pool() is using all the 4 processes but I don't see each CPU moving upto 100%. How to achieve the usage of each CPU by 100%?
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.map(f, range(10000000))

It is because multiprocessing requires interprocess communication between the main process and the worker processes behind the scene, and the communication overhead took more (wall-clock) time than the "actual" computation (x * x) in your case.
Try "heavier" computation kernel instead, like
def f(x):
return reduce(lambda a, b: math.log(a+b), xrange(10**5), x)
Update (clarification)
I pointed out that the low CPU usage observed by the OP was due to the IPC overhead inherent in multiprocessing but the OP didn't need to worry about it too much because the original computation kernel was way too "light" to be used as a benchmark. In other words, multiprocessing works the worst with such a way too "light" kernel. If the OP implements a real-world logic (which, I'm sure, will be somewhat "heavier" than x * x) on top of multiprocessing, the OP will achieve a decent efficiency, I assure. My argument is backed up by an experiment with the "heavy" kernel I presented.
#FilipMalczak, I hope my clarification makes sense to you.
By the way there are some ways to improve the efficiency of x * x while using multiprocessing. For example, we can combine 1,000 jobs into one before we submit it to Pool unless we are required to solve each job in real time (ie. if you implement a REST API server, we shouldn't do in this way).

You're asking wrong kind of question. multiprocessing.Process represents process as understood in your operating system. multiprocessing.Pool is just a simple way to run several processes to do your work. Python environment has nothing to do with balancing load on cores/processors.
If you want to control how will processor time be given to processes, you should try tweaking your OS, not python interpreter.
Of course, "heavier" computations will be recognised by system, and may look like they do just what you want to do, but in fact, you have almost no control on process handling.
"Heavier" functions will just look heavier to your OS, and his usual reaction will be assigning more processor time to your processes, but that doesn't mean you did what you wanted to - and that's good, because that the whole point of languages with VM - you specify logic, and VM takes care of mapping this logic onto operating system.

Related

Can I apply multithreading for computationally intensive task in python?

Update: To save your time, I give the answer directly here. Python can not utilize multi cpu cores at the same time if the you use pure Python to write your code. But Python can utilize multi cores at the same time when it calls some functions or packages which are written in C, like Numpy, etc.
I have heard that "multithreading in python is not the real multithreading, because of GIL". And I also heard that "python multithreading is okay to handle IO intensive task instead of computationally intensive task, because there is only one threading running at the same time".
But my experience made me rethink this question. My experience shows that even for computationally intensive task, python multithreading can accelerate the computation nearly learly. (Befor multithreading, it cost me 300 seconds to run the following program, after I use multithreading, it cost me 100 seconds.)
The following figures shows that 5 threads were created by python with CPython as the compiler with package threading and all the cpu cores are nearly 100% percentage.
I think the screenshots can prove that the 5 cpu cores are running at the same time.
So can anyone give me the explanation? Can I apply multithreading for computationally intensive task in python? Or can multi threads/cores run at the same time in python?
My code:
import threading
import time
import numpy as np
from scipy import interpolate
number_list = list(range(10))
def image_interpolation():
while True:
number = None
with threading.Lock():
if len(number_list):
number = number_list.pop()
if number is not None:
# Make a fake image - you can use yours.
image = np.ones((20000, 20000))
# Make your orig array (skipping the extra dimensions).
orig = np.random.rand(12800, 16000)
# Make its coordinates; x is horizontal.
x = np.linspace(0, image.shape[1], orig.shape[1])
y = np.linspace(0, image.shape[0], orig.shape[0])
# Make the interpolator function.
f = interpolate.interp2d(x, y, orig, kind='linear')
else:
return 1
workers=5
thd_list = []
t1 = time.time()
for i in range(workers):
thd = threading.Thread(target=image_interpolation)
thd.start()
thd_list.append(thd)
for thd in thd_list:
thd.join()
t2 = time.time()
print("total time cost with multithreading: " + str(t2-t1))
number_list = list(range(10))
for i in range(10):
image_interpolation()
t3 = time.time()
print("total time cost without multithreading: " + str(t3-t2))
output is:
total time cost with multithreading: 112.71922039985657
total time cost without multithreading: 328.45561170578003
screenshot of top during multithreading
screenshot of top -H during multithreading
screenshot of top then press 1. during multithreading
screenshot of top -H without multithreading
As you mentioned, Python has a "global interpreter lock" (GIL) that prevents two threads of Python code running at the same time. The reason that multi-threading can speed up IO bound tasks is that Python releases the GIL when, for example, listening on a network socket or waiting for a disk read. So the GIL does not prevent two lots of work being done simultaneously by your computer, it prevents two Python threads in the same Python process being run simultaneously.
In your example, you use numpy and scipy. These are largely written in C and utilise libraries (BLAS, LAPACK, etc) written in C/Fortran/Assembly. When you perform operations on numpy arrays, it is akin to listening on socket in that the GIL is released. When the GIL is released and the numpy array operations called, numpy gets to decide how to perform the work. If it wants, it can spawn other threads or processes and the BLAS subroutines it calls might spawn other threads. Precisely if/how this is done can be configured at build time if you want to compile numpy from source.
So, to summarise, you've have found an exception to the rule. If you were to repeat the experiment using only pure Python functions, you would get quite different results (e.g. see the "Comparison" section of the page linked to above).
Python threading is a real threading, just that no two threads at once can be in the interpreter (and this is what GIL is about). The native part of the code can well run in parallel without contention on multiple threads, only when diving back into the interpreter they'll have to serialize among each other.
The fact that you have all CPU cores loaded to 100% alone is not a proof that you are using the machine "efficiently". You need to make sure that the CPU usage is not due to the context switching.
If you switch to multiprocessing instead of threading (they are very similar), you won't have to double guess, but then you'll have to marshal the payload when passing between threads.
So need to measure everything anyway.

Thread does not run fast enough when other applications are running

I am using the threading module to run a function in the background while the rest of my script executes. The threaded function contains a for loop which waits for external 5 volt triggers, occuring every 15 ms, before continuing to the next loop iteration.
When this code is the only thing running on the PC, everything works as expected. However when I run other necessary applications, putting strain on the cpu, the For loop in the threaded function only executes and continues to the next iteration within the 15 ms time window about 90% of the time.
The input to the threaded function is a list of ctypes pointers.
I am running the threaded function from within a class, so using multiprocessing is tricky (if that would help at all i'm not sure).
I've tried to illustrate the problem below with a skeleton of the two classes
import ctypes
import Write_transient_frames_func
import SendScriptCommands
from threading import Thread
class SlmInterface():
def __init__(self,sdk):
self.sdk = sdk
def precalculate_masks(self, mask_list):
'''takes input mask_list, a list of numpy arrays containing phase masks
outputs pointers to memory location of masks
'''
#list of pointers to locations of phase mask arrays in memory
mask_pointers = [mask.ctypes.data_as(POINTER(c_ubyte)) for mask in mask_list]
return mask_pointers
def load_precalculated_triggered(self, mask_pointers):
okay = True
print('Ready to trigger')
for arr in mask_pointers:
okay = self.Write_transient_frames_func(self.sdk, c_int(1), arr, c_bool(1), c_bool(1), c_uint(0))
assert okay, 'Failed to write frames to board'
print('completed trigger sequence')
class Experiment():
def run_experiment(self, sdk, mask_list):
slm = SlmInterface(sdk)
#list of ctypes pointers
mask_pointers = slm.precalculate_masks(mask_list)
##the threaded function
slm_thread = Thread(target=slm.load_precalculated_triggered, args = [mask_pointers])
slm_thread.start()
time.sleep(0.1)
# this function loads the 15ms trigger sequences to the hardware and begins the sequence
self.mp_output = SendScriptCommands()
Is it possible to speed up execution of the threaded function? Would parallel processing help? Or am i fundamentally limited by my cpu?
Unfortunately, Python will likely not be able to do much better. Python has a global interpreter lock, which means that multithreading doesn't work the way it does in other languages.
You should be aware of the fact that multithreading in python makes the application run slower. The good alternative is using asyncio, because it allows cooperative multitasking of several tasks within one thread (-> the os doesn't need to actually switch a thread -> less overhead -> faster execution). If you havn't used that before it's kind of weired to use at first but it's actually really nice.
However, your task really seems to be cpu bound. So maybe the only option is multiprocessing in python.
Probably Python isn't really the culprit here. The point is, with general purpose, preemptive, multiuser operating systems you are not going to get the guarantee of running continuatively enough to catch triggers any 15 ms. CPU is allocated in quanta of generally some tens of ms, and the OS can - and will - let your thread run more or less frequently depending on the CPU load, in an effort to give to each process its fair share of available CPU time.
You may increase the priority of your thread to ask for it to have the precedence over the others, or, in the extreme case, change it to real-time priority to let it hog the CPU indefinitely (and potentially hang the system if stuff goes awry).
But really, the actually solution is to handle this at lower level, either in kernel mode or in hardware. Polling at those rates from user mode is unadvisable if you cannot miss a signal, so you should probably investigate if your hardware/driver provides some higher level interface - for example, an interrupt (translated e.g. to unlocking some blocking call, or producing a signal or something) on trigger.

Extracting outputs from a multiprocessed python function

I am wondering how to extract outputs from a multiprocessed function in Python. I am new to multiprocessing and have limited understanding of how it all works (not for lack of trying though).
I need to run the optimization with 31 different inputs for InfForecast and InitialStorage (for now... could be up to 10,000 inputs and independent optimizations being performed). I was hoping I could speed things up using multiprocessing to process more than one of these independent optimizations at a time. What I want is for the outputs (5 values for each optimization) to be put into the array "Nextday" which should have dimensions of (5,31). It seems the output Nextday as I've got the code written is either empty or not accessible. How do I extract/access the values and place them into Nextday?
Note: The function main(...) is a highly complex optimization problem. I hope the problem is easy enough to understand without providing it. It works when I loop over it and call it for each i in range(31).
from multiprocessing.pool import ThreadPool as Pool
Nextday=np.zeros((5,31))
pool_size = 4 # Should I set this to the number of cores my machine has?
pool = Pool(pool_size)
def optimizer(InfForecast, InitialStorage):
O=main(InfForecast,InitialStorage)
return [O[0][0], O[0][1], O[0][2], O[0][3], O[0][4]]
for i in range(31):
pool.apply_async(optimizer, (InfForecast[i],InitialStorage[i]))
pool.close()
Nextday=pool.join()
In addition to this, I'm not sure whether this is the best way to do things. If it's working (which I'm not sure it is) it sure seems slow. I was reading that it may be better to do multiprocessing vs threading and this seems to be threading? Forgive me if I'm wrong.
I am also curious about how to select pool_size as you can see in my comment in the code. I may be running this on a cloud server eventually, so I expect the pool_size I would want to use there would be slightly different than the number I will be using on my own machine. Is it just the number of cores?
Any advice would be appreciated.
You should use
from multiprocessing.pool import Pool
if you want to do multiprocessing.
Pool size should start out as multiprocessing.cpu_count() if you have the machine to yourself, and adjusted manually for best effect. If your processes are cpu-bound, then leaving a core available will make your machine more responsive -- if your code is not cpu-bound you can have more processes than cores (tuning this is finicky though, you'll just have to try).
You shouldn't have any code at the top-most level in your file when doing multiprocessing (or any other time really). Put everything in functions and call the start function from:
if __name__ == "__main__":
my_start_function()
(digression: using capital oh as a variable name is really bad, and you get statements that are almost unreadable in certain fonts like O[0][0]).
In regular python, the map function is "defined" by this equality:
map(fn, lst) == [fn(item) for item in lst]
so the Pool methods (imap/imap_unordered/map/map_async) has similar semantics, and in your case you would call them like:
def my_start_function():
...
results = pool.map(optimizer, zip(InfForecast, InitialStorage))
Since the map-functions take a function and a list, I've used the zip function to creates a list where each item has one element from each of its arguments (it functions as like a zipper).

Multiprocessing: use only the physical cores?

I have a function foo which consumes a lot of memory and which I would like to run several instances of in parallel.
Suppose I have a CPU with 4 physical cores, each with two logical cores.
My system has enough memory to accommodate 4 instances of foo in parallel but not 8. Moreover, since 4 of these 8 cores are logical ones anyway, I also do not expect using all 8 cores will provide much gains above and beyond using the 4 physical ones only.
So I want to run foo on the 4 physical cores only. In other words, I would like to ensure that doing multiprocessing.Pool(4) (4 being the maximum number of concurrent run of the function I can accommodate on this machine due to memory limitations) dispatches the job to the four physical cores (and not, for example, to a combo of two physical cores and their two logical offsprings).
How to do that in python?
Edit:
I earlier used a code example from multiprocessing but I am library agnostic ,so to avoid confusion, I removed that.
I know the topic is quite old now, but as it still appears as the first answer when typing 'multiprocessing logical core' in google... I feel like I have to give an additional answer because I can see that it would be possible for people in 2018 (or even later..) to get easily confused here (some answers are indeed a little bit confusing)
I can see no better place than here to warn readers about some of the answers above, so sorry for bringing the topic back to life.
--> TO COUNT THE CPUs (LOGICAL/PHYSICAL) USE THE PSUTIL MODULE
For a 4 physical core / 8 thread i7 for ex it will return
import psutil
psutil.cpu_count(logical = False)
4
psutil.cpu_count(logical = True)
8
As simple as that.
There you won't have to worry about the OS, the platform, the hardware itself or whatever. I am convinced it is much better than multiprocessing.cpu_count() which can sometimes give weird results, from my own experience at least.
--> TO USE N PHYSICAL CORES (up to your choice) USE THE MULTIPROCESSING MODULE DESCRIBED BY YUGI
Just count how many physical processes you have, launch a multiprocessing.Pool of 4 workers.
Or you can also try to use the joblib.Parallel() function
joblib in 2018 is not part of the standard distribution of python, but is just a wrapper of the multiprocessing module as described by Yugi.
--> MOST OF THE TIME, DON'T USE MORE CORES THAN AVAILABLE (unless you have benchmarked a very specific code and proved it was worth it)
Misinformation abounds that "the OS will handle things if you specify more cores than are available". It is absolutely 100% false. If you use more cores than available, you will face huge performance drops. The exception would be if the worker processes are IO bound. Because the OS scheduler will try its best to work on every task with the same attention, switching regularly from one to another, and depending on the OS, it can spend up to 100% of its working time to just switching between processes, which would be disastrous.
Don't just trust me: try it, benchmark it, you will see how clear it is.
IS IT POSSIBLE TO DECIDE WHETHER THE CODE WILL BE EXECUTED ON LOGICAL OR PHYSICAL CORE?
If you are asking this question, this means you don't understand the way physical and logical cores are designed, so maybe you should check a little bit more about a processor's architecture.
If you want to run on core 3 rather than core 1 for example, Well I guess there are indeed some solutions, but available only if you know how to code an OS's kernel and scheduler, which I think is not the case if you're asking this question.
If you launch 4 CPU-intensive processes on a 4 physical / 8 logical processor, the scheduler will attribute each of your processes to 1 distinct physical core (and 4 logical core will remain not/poorly used). But on a 4 logical / 8 thread proc, if the processing units are (0,1) (1,2) (2,3) (4,5) (5,6) (6,7), then it makes no difference if the process is executed on 0 or 1 : it is the same processing unit.
From my knowledge at least (but an expert could confirm, maybe it differs from very specific hardware specifications also) I think there is no or very little difference between executing a code on 0 or 1. In the processing unit (0,1), I am not sure that 0 is the logical whereas 1 is the physical, or vice-versa. From my understanding (which can be wrong), both are processors from the same processing unit, and they just share their cache memory / access to the hardware (RAM included), and 0 is not more a physical unit than 1.
More than that you should let the OS decide. Because the OS scheduler can take advantage of a hardware logical-core turbo boost that exist on some platforms (ex i7, i5, i3...), something else that you have no power over, and that could be truly helpful to you.
If you launch 5 CPU-intensive tasks on a 4 physical / 8 logical core, the behaviour will be chaotic, almost unpredictable, mostly dependent on your hardware and OS. The scheduler will try its best. Almost every time, you will face really bad performance.
Let's presume for a moment that we are still talking about a 4(8) classical architecture: Because the scheduler tries its best (and therefore often switches the attributions), depending on the process you are executing, it could be even worse to launch on 5 logical cores than on 8 logical cores (where at least he knows everything will be used at 100% anyway, so lost for lost he won't try much to avoid it, won't switch too often, and therefore won't lose too much time by switching).
It is 99% sure however (but benchmark it on your hardware to be sure) that almost any multiprocessing program will run slower if you use more physical core than available.
A lot of things can intervene... The program, the hardware, the state of the OS, the scheduler it uses, the fruit you ate this morning, your sister's name... In case you doubt about something, just benchmark it, there is no other easy way to see whether you are losing performances or not. Sometimes informatics can be really weird.
--> MOST OF THE TIME, ADDITIONAL LOGICAL CORES ARE INDEED USELESS IN PYTHON (but not always)
There are 2 main ways of doing really parallel tasks in python.
multiprocessing (cannot take advantage of logical cores)
multithreading (can take advantage of logical cores)
For example to run 4 tasks in parallel
--> multiprocessing will create 4 different python interpreter. For each of them you have to start a python interpreter, define the rights of reading/writing, define the environment, allocate a lot of memory, etc. Let's say it as it is: You will start a whole new program instance from 0. It can take a huge amount of time, so you have to be sure that this new program will work long enough so that it is worth it.
If your program has enough work (let's say, a few seconds of work at least), then because the OS allocates CPU-consuming processes on different physical cores, it works, and you can gain a lot of performances, which is great. And because the OS almost always allows processes to communicate between them (although it is slow) they can even exchange (a little bit of) data.
--> multithreading is different. Within your python interpreter, it will just create a small amount of memory that many CPU will be available to share, and work on it at the same time. It is WAY much quicker to spawn (where spawning a new process on an old computer can take many seconds sometimes, spawning a thread is done within a ridiculously small fraction of time). You don't create new processes, but "threads" which are much lighter.
Threads can share memory between threads very quickly, because they literally work together on the same memory (while it has to be copied/exchanged when working with different processes).
BUT: WHY CANNOT WE USE MULTITHREADING IN MOST SITUATIONS ? IT LOOKS VERY CONVENIENT ?
There is a very BIG limitation in python: Only one python line can be executed at a time in a python interpreter, which is called the GIL (Global Interpreter Lock). So most of the time, you will even LOSE performances by using multithreading, because different threads will have to wait to access to the same resource. For pure computational processing (with no IO), multithreading is USELESS and even WORSE if your code is pure python. However, if your threads involve any waiting for IO, multithreading can be very beneficial.
--> WHY SHOULDN'T I USE LOGICAL CORES WHEN USING MULTIPROCESSING ?
Logical cores don't have their own memory access. They can only work on the memory access and on the cache of its hosting physical processor. For example it is very likely (and often used indeed) that the logical and the physical core of a same processing unit both use the same C/C++ function on different emplacements of the cache memory at the same time. Making the treatment hugely faster indeed.
But... these are C/C++ functions ! Python is a big C/C++ wrapper, that needs much more memory and CPU than its equivalent C++ code. It is very likely in 2018 that, whatever you want to do, 2 big python processes will need much, much more memory and cache reading/writing than what a single physical+logical unit can afford, and much more that what the equivalent C/C++ truly-multithreaded code would consume. This once again, would almost always cause performances to drop. Remember that every variable that is not available in the processor's cache, will take x1000 time to read in the memory. If your cache is already completely full for 1 single python process, guess what will happened if you force 2 processes to use it: They will use it one at the time, and switch permanently, causing data to be stupidly flushed and re-read every time it switches. When the data is being read or written from memory, you might think that your CPU "is" working but it's not. It's waiting for the data ! By doing nothing.
--> HOW CAN YOU TAKE ADVANTAGE OF LOGICAL CORES THEN ?
Like I said there is no true multithreading (so no true usage of logical cores) in default python, because of the global interpreter lock. You can force the GIL to be removed during some parts of the program, but I think it would be a wise advise that you don't touch to it if you don't know exactly what you are doing.
Removing the GIL definitely has been a subject of a lot of research (see the experimental PyPy or Cython projects that both try to do so).
For now, no real solution exists for it, as it is a much more complex problem than it seems.
There is, I admit, another solution that can work:
Code your function in C
Wrap it in python with ctype
Use the python multithreading module to call your wrapped C function
This will work 100%, and you will be able to use all the logical cores, in python, with multithreading, and for real. The GIL won't bother you, because you won't be executing true python functions, but C functions instead.
For example, some libraries like Numpy can work on all available threads, because they are coded in C. But if you come to this point, I always thought it could be wise to think about doing your program in C/C++ directly because it is a consideration very far from the original pythonic spirit.
**--> DON'T ALWAYS USE ALL AVAILABLE PHYSICAL CORES **
I often see people be like "Ok I have 8 physical core, so I will take 8 core for my job". It often works, but sometimes turns out to be a poor idea, especially if your job needs a lot of I/O.
Try with N-1 cores (once again, especially for highly I/O-demanding tasks), and you will see that 100% of time, on per-task/average, single tasks will always run faster on N-1 core. Indeed, your computer makes a lot of different things: USB, mouse, keyboard, network, Hard drive, etc... Even on a working station, periodical tasks are performed anytime in the background that you have no idea about. If you don't let 1 physical core to manage those tasks, your calculation will be regularly interrupted (flushed out from the memory / replaced back in memory) which can also lead to performance issues.
You might think "Well, background tasks will use only 5% of CPU-time so there is 95% left". But it's not the case.
The processor handles one task at a time. And every time it switches, a considerably high amount of time is wasted to place everything back at its place in the memory cache/registries. Then, if for some weird reason the OS scheduler does this switching too often (something you have no control on), all of this computing time is lost forever and there's nothing you can do about it.
If (and it sometimes happen) for some unknown reason this scheduler problem impacts the performances of not 1 but 30 tasks, it can result in really intriguing situations where working on 29/30 physical core can be significantly faster than on 30/30
MORE CPU IS NOT ALWAYS THE BEST
It is very frequent, when you use a multiprocessing.Pool, to use a multiprocessing.Queue or manager queue, shared between processes, to allow some basic communication between them. Sometimes (I must have said 100 times but I repeat it), in an hardware-dependent manner, it can occur (but you should benchmark it for your specific application, your code implementation and your hardware) that using more CPU might create a bottleneck when you make processes communicate / synchronize. In those specific cases, it could be interesting to run on a lower CPU number, or even try to deport the synchronization task on a faster processor (here I'm talking about scientific intensive calculation ran on a cluster of course). As multiprocessing is often meant to be used on clusters, you have to notice that clusters often are underclocked in frequency for energy-saving purposes. Because of that, single-core performances can be really bad (balanced by a way-much higher number of CPUs), making the problem even worse when you scale your code from your local computer (few cores, high single-core performance) to a cluster (lot of cores, lower single-core performance), because your code bottleneck according to the single_core_perf/nb_cpu ratio, making it sometimes really annoying
Everyone has the temptation to use as many CPU as possible. But benchmark for those cases is mandatory.
The typical case (in data science for ex) is to have N processes running in parallel and you want to summarize the results in one file. Because you cannot wait the job to be done, you do it through a specific writer process. The writer will write in the outputfile everything that is pushed in his multiprocessing.Queue (single-core and hard-drive limited process). The N processes fill the multiprocessing.Queue.
It is easy then to imagine that if you have 31 CPU writing informations to one really slow CPU, then your performances will drop (and possibly something will crash if you overcome the system's capability to handle temporary data)
--> Take home message
Use psutil to count logical/physical processors, rather than multiprocessing.cpu_count() or whatsoever
Multiprocessing can only work on physical core (or at least benchmark it to prove it is not true in your case)
Multithreading will work on logical core BUT you will have to code and wrap your functions in C, or remove the global lock interpreter (and every time you do so, one kitten atrociously dies somewhere in the world)
If you are trying to run multithreading on pure python code, you will have huge performance drops, so you should 99% of the time use multiprocessing instead
Unless your processes/threads are having long pauses that you can exploit, never use more core than available, and benchmark properly if you want to try
If your task is I/O intensive, you should let 1 physical core to handle the I/O, and if you have enough physical core, it will be worth it. For multiprocessing implementations it needs to use N-1 physical core. For a classical 2-way multithreading, it means to use N-2 logical core.
If you have need for more performances, try PyPy (not production ready) or Cython, or even to code it in C
Last but not least, and the most important of all: If you are really seeking for performance, you should absolutely, always, always benchmark, and not guess anything. Benchmark often reveal strange platform/hardware/driver very specific behaviour that you would have no idea about.
Note: This approach doesn't work on windows and it is tested only on linux.
Using multiprocessing.Process:
Assigning a physical core to each process is quite easy when using Process(). You can create a for loop that iterates trough each core and assigns the new process to the new core using taskset -p [mask] [pid] :
import multiprocessing
import os
def foo():
return
if __name__ == "__main__" :
for process_idx in range(multiprocessing.cpu_count()):
p = multiprocessing.Process(target=foo)
os.system("taskset -p -c %d %d" % (process_idx % multiprocessing.cpu_count(), os.getpid()))
p.start()
I have 32 cores on my workstation so I'll put partial results here:
pid 520811's current affinity list: 0-31
pid 520811's new affinity list: 0
pid 520811's current affinity list: 0
pid 520811's new affinity list: 1
pid 520811's current affinity list: 1
pid 520811's new affinity list: 2
pid 520811's current affinity list: 2
pid 520811's new affinity list: 3
pid 520811's current affinity list: 3
pid 520811's new affinity list: 4
pid 520811's current affinity list: 4
pid 520811's new affinity list: 5
...
As you see, the previous and new affinity of each process here. The first one is for all cores (0-31) and is then assigned to core 0, second process is by default assigned to core0 and then its affinity is changed to the next core (1), and so forth.
Using multiprocessing.Pool:
Warning: This approach needs tweaking the pool.py module since there is no way that I know of that you can extract the pid from the Pool(). Also this changes have been tested on python 2.7 and multiprocessing.__version__ = '0.70a1'.
In Pool.py, find the line where the _task_handler_start() method is being called. In the next line, you can assign the process in the pool to each "physical" core using (I put the import os here so that the reader doesn't forget to import it):
import os
for worker in range(len(self._pool)):
p = self._pool[worker]
os.system("taskset -p -c %d %d" % (worker % cpu_count(), p.pid))
and you're done. Test:
import multiprocessing
def foo(i):
return
if __name__ == "__main__" :
pool = multiprocessing.Pool(multiprocessing.cpu_count())
pool.map(foo,'iterable here')
result:
pid 524730's current affinity list: 0-31
pid 524730's new affinity list: 0
pid 524731's current affinity list: 0-31
pid 524731's new affinity list: 1
pid 524732's current affinity list: 0-31
pid 524732's new affinity list: 2
pid 524733's current affinity list: 0-31
pid 524733's new affinity list: 3
pid 524734's current affinity list: 0-31
pid 524734's new affinity list: 4
pid 524735's current affinity list: 0-31
pid 524735's new affinity list: 5
...
Note that this modification to pool.py assign the jobs to the cores round-robinly. So if you assign more jobs than the cpu-cores, you will end up having multiple of them on the same core.
EDIT:
What OP is looking for is to have a pool() that is capable of staring the pool on specific cores. For this more tweaks on multiprocessing are needed (undo the above-mentioned changes first).
Warning:
Don't try to copy-paste the function definitions and function calls. Only copy paste the part that is supposed to be added after self._worker_handler.start() (you'll see it below). Note that my multiprocessing.__version__ tells me the version is '0.70a1', but it doesn't matter as long as you just add what you need to add:
multiprocessing's pool.py:
add a cores_idx = None argument to __init__() definition. In my version it looks like this after adding it:
def __init__(self, processes=None, initializer=None, initargs=(),
maxtasksperchild=None,cores_idx=None)
also you should add the following code after self._worker_handler.start():
if not cores_idx is None:
import os
for worker in range(len(self._pool)):
p = self._pool[worker]
os.system("taskset -p -c %d %d" % (cores_idx[worker % (len(cores_idx))], p.pid))
multiprocessing's __init__.py:
Add a cores_idx=None argument to definition of the Pool() in as well as the other Pool() function call in the the return part. In my version it looks like:
def Pool(processes=None, initializer=None, initargs=(), maxtasksperchild=None,cores_idx=None):
'''
Returns a process pool object
'''
from multiprocessing.pool import Pool
return Pool(processes, initializer, initargs, maxtasksperchild,cores_idx)
And you're done. The following example runs a pool of 5 workers on cores 0 and 2 only:
import multiprocessing
def foo(i):
return
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=5,cores_idx=[0,2])
pool.map(foo,'iterable here')
result:
pid 705235's current affinity list: 0-31
pid 705235's new affinity list: 0
pid 705236's current affinity list: 0-31
pid 705236's new affinity list: 2
pid 705237's current affinity list: 0-31
pid 705237's new affinity list: 0
pid 705238's current affinity list: 0-31
pid 705238's new affinity list: 2
pid 705239's current affinity list: 0-31
pid 705239's new affinity list: 0
Of course you can still have the usual functionality of the multiprocessing.Poll() as well by removing the cores_idx argument.
I found a solution that doesn't involve changing the source code of a python module. It uses the approach suggested here. One can check that only
the physical cores are active after running that script by doing:
lscpu
in the bash returns:
CPU(s): 8
On-line CPU(s) list: 0,2,4,6
Off-line CPU(s) list: 1,3,5,7
Thread(s) per core: 1
[One can run the script linked above from within python]. In any case, after running the script above, typing these commands in python:
import multiprocessing
multiprocessing.cpu_count()
returns 4.

Why does threading sorting functions work slower than sequentially running them in python? [duplicate]

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?
Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)
The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing
As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

Categories