Multiprocessing: use only the physical cores?

Multiprocessing: use only the physical cores? - python

I have a function foo which consumes a lot of memory and which I would like to run several instances of in parallel.
Suppose I have a CPU with 4 physical cores, each with two logical cores.
My system has enough memory to accommodate 4 instances of foo in parallel but not 8. Moreover, since 4 of these 8 cores are logical ones anyway, I also do not expect using all 8 cores will provide much gains above and beyond using the 4 physical ones only.
So I want to run foo on the 4 physical cores only. In other words, I would like to ensure that doing multiprocessing.Pool(4) (4 being the maximum number of concurrent run of the function I can accommodate on this machine due to memory limitations) dispatches the job to the four physical cores (and not, for example, to a combo of two physical cores and their two logical offsprings).
How to do that in python?
Edit:
I earlier used a code example from multiprocessing but I am library agnostic ,so to avoid confusion, I removed that.

I know the topic is quite old now, but as it still appears as the first answer when typing 'multiprocessing logical core' in google... I feel like I have to give an additional answer because I can see that it would be possible for people in 2018 (or even later..) to get easily confused here (some answers are indeed a little bit confusing)
I can see no better place than here to warn readers about some of the answers above, so sorry for bringing the topic back to life.
--> TO COUNT THE CPUs (LOGICAL/PHYSICAL) USE THE PSUTIL MODULE
For a 4 physical core / 8 thread i7 for ex it will return
import psutil
psutil.cpu_count(logical = False)
4
psutil.cpu_count(logical = True)
8
As simple as that.
There you won't have to worry about the OS, the platform, the hardware itself or whatever. I am convinced it is much better than multiprocessing.cpu_count() which can sometimes give weird results, from my own experience at least.
--> TO USE N PHYSICAL CORES (up to your choice) USE THE MULTIPROCESSING MODULE DESCRIBED BY YUGI
Just count how many physical processes you have, launch a multiprocessing.Pool of 4 workers.
Or you can also try to use the joblib.Parallel() function
joblib in 2018 is not part of the standard distribution of python, but is just a wrapper of the multiprocessing module as described by Yugi.
--> MOST OF THE TIME, DON'T USE MORE CORES THAN AVAILABLE (unless you have benchmarked a very specific code and proved it was worth it)
Misinformation abounds that "the OS will handle things if you specify more cores than are available". It is absolutely 100% false. If you use more cores than available, you will face huge performance drops. The exception would be if the worker processes are IO bound. Because the OS scheduler will try its best to work on every task with the same attention, switching regularly from one to another, and depending on the OS, it can spend up to 100% of its working time to just switching between processes, which would be disastrous.
Don't just trust me: try it, benchmark it, you will see how clear it is.
IS IT POSSIBLE TO DECIDE WHETHER THE CODE WILL BE EXECUTED ON LOGICAL OR PHYSICAL CORE?
If you are asking this question, this means you don't understand the way physical and logical cores are designed, so maybe you should check a little bit more about a processor's architecture.
If you want to run on core 3 rather than core 1 for example, Well I guess there are indeed some solutions, but available only if you know how to code an OS's kernel and scheduler, which I think is not the case if you're asking this question.
If you launch 4 CPU-intensive processes on a 4 physical / 8 logical processor, the scheduler will attribute each of your processes to 1 distinct physical core (and 4 logical core will remain not/poorly used). But on a 4 logical / 8 thread proc, if the processing units are (0,1) (1,2) (2,3) (4,5) (5,6) (6,7), then it makes no difference if the process is executed on 0 or 1 : it is the same processing unit.
From my knowledge at least (but an expert could confirm, maybe it differs from very specific hardware specifications also) I think there is no or very little difference between executing a code on 0 or 1. In the processing unit (0,1), I am not sure that 0 is the logical whereas 1 is the physical, or vice-versa. From my understanding (which can be wrong), both are processors from the same processing unit, and they just share their cache memory / access to the hardware (RAM included), and 0 is not more a physical unit than 1.
More than that you should let the OS decide. Because the OS scheduler can take advantage of a hardware logical-core turbo boost that exist on some platforms (ex i7, i5, i3...), something else that you have no power over, and that could be truly helpful to you.
If you launch 5 CPU-intensive tasks on a 4 physical / 8 logical core, the behaviour will be chaotic, almost unpredictable, mostly dependent on your hardware and OS. The scheduler will try its best. Almost every time, you will face really bad performance.
Let's presume for a moment that we are still talking about a 4(8) classical architecture: Because the scheduler tries its best (and therefore often switches the attributions), depending on the process you are executing, it could be even worse to launch on 5 logical cores than on 8 logical cores (where at least he knows everything will be used at 100% anyway, so lost for lost he won't try much to avoid it, won't switch too often, and therefore won't lose too much time by switching).
It is 99% sure however (but benchmark it on your hardware to be sure) that almost any multiprocessing program will run slower if you use more physical core than available.
A lot of things can intervene... The program, the hardware, the state of the OS, the scheduler it uses, the fruit you ate this morning, your sister's name... In case you doubt about something, just benchmark it, there is no other easy way to see whether you are losing performances or not. Sometimes informatics can be really weird.
--> MOST OF THE TIME, ADDITIONAL LOGICAL CORES ARE INDEED USELESS IN PYTHON (but not always)
There are 2 main ways of doing really parallel tasks in python.
multiprocessing (cannot take advantage of logical cores)
multithreading (can take advantage of logical cores)
For example to run 4 tasks in parallel
--> multiprocessing will create 4 different python interpreter. For each of them you have to start a python interpreter, define the rights of reading/writing, define the environment, allocate a lot of memory, etc. Let's say it as it is: You will start a whole new program instance from 0. It can take a huge amount of time, so you have to be sure that this new program will work long enough so that it is worth it.
If your program has enough work (let's say, a few seconds of work at least), then because the OS allocates CPU-consuming processes on different physical cores, it works, and you can gain a lot of performances, which is great. And because the OS almost always allows processes to communicate between them (although it is slow) they can even exchange (a little bit of) data.
--> multithreading is different. Within your python interpreter, it will just create a small amount of memory that many CPU will be available to share, and work on it at the same time. It is WAY much quicker to spawn (where spawning a new process on an old computer can take many seconds sometimes, spawning a thread is done within a ridiculously small fraction of time). You don't create new processes, but "threads" which are much lighter.
Threads can share memory between threads very quickly, because they literally work together on the same memory (while it has to be copied/exchanged when working with different processes).
BUT: WHY CANNOT WE USE MULTITHREADING IN MOST SITUATIONS ? IT LOOKS VERY CONVENIENT ?
There is a very BIG limitation in python: Only one python line can be executed at a time in a python interpreter, which is called the GIL (Global Interpreter Lock). So most of the time, you will even LOSE performances by using multithreading, because different threads will have to wait to access to the same resource. For pure computational processing (with no IO), multithreading is USELESS and even WORSE if your code is pure python. However, if your threads involve any waiting for IO, multithreading can be very beneficial.
--> WHY SHOULDN'T I USE LOGICAL CORES WHEN USING MULTIPROCESSING ?
Logical cores don't have their own memory access. They can only work on the memory access and on the cache of its hosting physical processor. For example it is very likely (and often used indeed) that the logical and the physical core of a same processing unit both use the same C/C++ function on different emplacements of the cache memory at the same time. Making the treatment hugely faster indeed.
But... these are C/C++ functions ! Python is a big C/C++ wrapper, that needs much more memory and CPU than its equivalent C++ code. It is very likely in 2018 that, whatever you want to do, 2 big python processes will need much, much more memory and cache reading/writing than what a single physical+logical unit can afford, and much more that what the equivalent C/C++ truly-multithreaded code would consume. This once again, would almost always cause performances to drop. Remember that every variable that is not available in the processor's cache, will take x1000 time to read in the memory. If your cache is already completely full for 1 single python process, guess what will happened if you force 2 processes to use it: They will use it one at the time, and switch permanently, causing data to be stupidly flushed and re-read every time it switches. When the data is being read or written from memory, you might think that your CPU "is" working but it's not. It's waiting for the data ! By doing nothing.
--> HOW CAN YOU TAKE ADVANTAGE OF LOGICAL CORES THEN ?
Like I said there is no true multithreading (so no true usage of logical cores) in default python, because of the global interpreter lock. You can force the GIL to be removed during some parts of the program, but I think it would be a wise advise that you don't touch to it if you don't know exactly what you are doing.
Removing the GIL definitely has been a subject of a lot of research (see the experimental PyPy or Cython projects that both try to do so).
For now, no real solution exists for it, as it is a much more complex problem than it seems.
There is, I admit, another solution that can work:
Code your function in C
Wrap it in python with ctype
Use the python multithreading module to call your wrapped C function
This will work 100%, and you will be able to use all the logical cores, in python, with multithreading, and for real. The GIL won't bother you, because you won't be executing true python functions, but C functions instead.
For example, some libraries like Numpy can work on all available threads, because they are coded in C. But if you come to this point, I always thought it could be wise to think about doing your program in C/C++ directly because it is a consideration very far from the original pythonic spirit.
**--> DON'T ALWAYS USE ALL AVAILABLE PHYSICAL CORES **
I often see people be like "Ok I have 8 physical core, so I will take 8 core for my job". It often works, but sometimes turns out to be a poor idea, especially if your job needs a lot of I/O.
Try with N-1 cores (once again, especially for highly I/O-demanding tasks), and you will see that 100% of time, on per-task/average, single tasks will always run faster on N-1 core. Indeed, your computer makes a lot of different things: USB, mouse, keyboard, network, Hard drive, etc... Even on a working station, periodical tasks are performed anytime in the background that you have no idea about. If you don't let 1 physical core to manage those tasks, your calculation will be regularly interrupted (flushed out from the memory / replaced back in memory) which can also lead to performance issues.
You might think "Well, background tasks will use only 5% of CPU-time so there is 95% left". But it's not the case.
The processor handles one task at a time. And every time it switches, a considerably high amount of time is wasted to place everything back at its place in the memory cache/registries. Then, if for some weird reason the OS scheduler does this switching too often (something you have no control on), all of this computing time is lost forever and there's nothing you can do about it.
If (and it sometimes happen) for some unknown reason this scheduler problem impacts the performances of not 1 but 30 tasks, it can result in really intriguing situations where working on 29/30 physical core can be significantly faster than on 30/30
MORE CPU IS NOT ALWAYS THE BEST
It is very frequent, when you use a multiprocessing.Pool, to use a multiprocessing.Queue or manager queue, shared between processes, to allow some basic communication between them. Sometimes (I must have said 100 times but I repeat it), in an hardware-dependent manner, it can occur (but you should benchmark it for your specific application, your code implementation and your hardware) that using more CPU might create a bottleneck when you make processes communicate / synchronize. In those specific cases, it could be interesting to run on a lower CPU number, or even try to deport the synchronization task on a faster processor (here I'm talking about scientific intensive calculation ran on a cluster of course). As multiprocessing is often meant to be used on clusters, you have to notice that clusters often are underclocked in frequency for energy-saving purposes. Because of that, single-core performances can be really bad (balanced by a way-much higher number of CPUs), making the problem even worse when you scale your code from your local computer (few cores, high single-core performance) to a cluster (lot of cores, lower single-core performance), because your code bottleneck according to the single_core_perf/nb_cpu ratio, making it sometimes really annoying
Everyone has the temptation to use as many CPU as possible. But benchmark for those cases is mandatory.
The typical case (in data science for ex) is to have N processes running in parallel and you want to summarize the results in one file. Because you cannot wait the job to be done, you do it through a specific writer process. The writer will write in the outputfile everything that is pushed in his multiprocessing.Queue (single-core and hard-drive limited process). The N processes fill the multiprocessing.Queue.
It is easy then to imagine that if you have 31 CPU writing informations to one really slow CPU, then your performances will drop (and possibly something will crash if you overcome the system's capability to handle temporary data)
--> Take home message
Use psutil to count logical/physical processors, rather than multiprocessing.cpu_count() or whatsoever
Multiprocessing can only work on physical core (or at least benchmark it to prove it is not true in your case)
Multithreading will work on logical core BUT you will have to code and wrap your functions in C, or remove the global lock interpreter (and every time you do so, one kitten atrociously dies somewhere in the world)
If you are trying to run multithreading on pure python code, you will have huge performance drops, so you should 99% of the time use multiprocessing instead
Unless your processes/threads are having long pauses that you can exploit, never use more core than available, and benchmark properly if you want to try
If your task is I/O intensive, you should let 1 physical core to handle the I/O, and if you have enough physical core, it will be worth it. For multiprocessing implementations it needs to use N-1 physical core. For a classical 2-way multithreading, it means to use N-2 logical core.
If you have need for more performances, try PyPy (not production ready) or Cython, or even to code it in C
Last but not least, and the most important of all: If you are really seeking for performance, you should absolutely, always, always benchmark, and not guess anything. Benchmark often reveal strange platform/hardware/driver very specific behaviour that you would have no idea about.

Note: This approach doesn't work on windows and it is tested only on linux.
Using multiprocessing.Process:
Assigning a physical core to each process is quite easy when using Process(). You can create a for loop that iterates trough each core and assigns the new process to the new core using taskset -p [mask] [pid] :
import multiprocessing
import os
def foo():
return
if __name__ == "__main__" :
for process_idx in range(multiprocessing.cpu_count()):
p = multiprocessing.Process(target=foo)
os.system("taskset -p -c %d %d" % (process_idx % multiprocessing.cpu_count(), os.getpid()))
p.start()
I have 32 cores on my workstation so I'll put partial results here:
pid 520811's current affinity list: 0-31
pid 520811's new affinity list: 0
pid 520811's current affinity list: 0
pid 520811's new affinity list: 1
pid 520811's current affinity list: 1
pid 520811's new affinity list: 2
pid 520811's current affinity list: 2
pid 520811's new affinity list: 3
pid 520811's current affinity list: 3
pid 520811's new affinity list: 4
pid 520811's current affinity list: 4
pid 520811's new affinity list: 5
...
As you see, the previous and new affinity of each process here. The first one is for all cores (0-31) and is then assigned to core 0, second process is by default assigned to core0 and then its affinity is changed to the next core (1), and so forth.
Using multiprocessing.Pool:
Warning: This approach needs tweaking the pool.py module since there is no way that I know of that you can extract the pid from the Pool(). Also this changes have been tested on python 2.7 and multiprocessing.__version__ = '0.70a1'.
In Pool.py, find the line where the _task_handler_start() method is being called. In the next line, you can assign the process in the pool to each "physical" core using (I put the import os here so that the reader doesn't forget to import it):
import os
for worker in range(len(self._pool)):
p = self._pool[worker]
os.system("taskset -p -c %d %d" % (worker % cpu_count(), p.pid))
and you're done. Test:
import multiprocessing
def foo(i):
return
if __name__ == "__main__" :
pool = multiprocessing.Pool(multiprocessing.cpu_count())
pool.map(foo,'iterable here')
result:
pid 524730's current affinity list: 0-31
pid 524730's new affinity list: 0
pid 524731's current affinity list: 0-31
pid 524731's new affinity list: 1
pid 524732's current affinity list: 0-31
pid 524732's new affinity list: 2
pid 524733's current affinity list: 0-31
pid 524733's new affinity list: 3
pid 524734's current affinity list: 0-31
pid 524734's new affinity list: 4
pid 524735's current affinity list: 0-31
pid 524735's new affinity list: 5
...
Note that this modification to pool.py assign the jobs to the cores round-robinly. So if you assign more jobs than the cpu-cores, you will end up having multiple of them on the same core.
EDIT:
What OP is looking for is to have a pool() that is capable of staring the pool on specific cores. For this more tweaks on multiprocessing are needed (undo the above-mentioned changes first).
Warning:
Don't try to copy-paste the function definitions and function calls. Only copy paste the part that is supposed to be added after self._worker_handler.start() (you'll see it below). Note that my multiprocessing.__version__ tells me the version is '0.70a1', but it doesn't matter as long as you just add what you need to add:
multiprocessing's pool.py:
add a cores_idx = None argument to __init__() definition. In my version it looks like this after adding it:
def __init__(self, processes=None, initializer=None, initargs=(),
maxtasksperchild=None,cores_idx=None)
also you should add the following code after self._worker_handler.start():
if not cores_idx is None:
import os
for worker in range(len(self._pool)):
p = self._pool[worker]
os.system("taskset -p -c %d %d" % (cores_idx[worker % (len(cores_idx))], p.pid))
multiprocessing's __init__.py:
Add a cores_idx=None argument to definition of the Pool() in as well as the other Pool() function call in the the return part. In my version it looks like:
def Pool(processes=None, initializer=None, initargs=(), maxtasksperchild=None,cores_idx=None):
'''
Returns a process pool object
'''
from multiprocessing.pool import Pool
return Pool(processes, initializer, initargs, maxtasksperchild,cores_idx)
And you're done. The following example runs a pool of 5 workers on cores 0 and 2 only:
import multiprocessing
def foo(i):
return
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=5,cores_idx=[0,2])
pool.map(foo,'iterable here')
result:
pid 705235's current affinity list: 0-31
pid 705235's new affinity list: 0
pid 705236's current affinity list: 0-31
pid 705236's new affinity list: 2
pid 705237's current affinity list: 0-31
pid 705237's new affinity list: 0
pid 705238's current affinity list: 0-31
pid 705238's new affinity list: 2
pid 705239's current affinity list: 0-31
pid 705239's new affinity list: 0
Of course you can still have the usual functionality of the multiprocessing.Poll() as well by removing the cores_idx argument.

I found a solution that doesn't involve changing the source code of a python module. It uses the approach suggested here. One can check that only
the physical cores are active after running that script by doing:
lscpu
in the bash returns:
CPU(s): 8
On-line CPU(s) list: 0,2,4,6
Off-line CPU(s) list: 1,3,5,7
Thread(s) per core: 1
[One can run the script linked above from within python]. In any case, after running the script above, typing these commands in python:
import multiprocessing
multiprocessing.cpu_count()
returns 4.

Related

Python: How many cores are used by my python program with five processes?

I have a python program consisting of 5 processes outside of the main process. Now I'm looking to get an AWS server or something similar on which I can run the script. But how can I find out how many vCPU cores are used by the script/how many are needed? I have looked at:
import multiprocessing
multiprocessing.cpu_count()
But it seems that it just returns the CPU count that's on the system. I just need to know how many vCPU cores the script uses.
Thanks for your time.
EDIT:
Just for some more information. The Processes are running indefinitely.

On Linux you can use the "top" command at the command line to monitor the real-time activity of all threads of a process id:
top -H -p <process id>

Answer to this post probably lies in the following question:
Multiprocessing : More processes than cpu.count
In short, you have probably hundreds of processes running, but that doesn't mean you will use hundreds of cores. It all depends on utilization, and the workload of the processes.
You can also get some additional info from the psutil module
import psutil
print(psutil.cpu_percent())
print(psutil.cpu_stats())
print(psutil.cpu_freq())
or using OS to receive current cpu usage in python:
import os
import psutil
l1, l2, l3 = psutil.getloadavg()
CPU_use = (l3/os.cpu_count()) * 100
print(CPU_use)
Credit: DelftStack
Edit
There might be some information for you in the following medium article. Maybe there are some tools for CPU usage too.
https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba
Edit 2
A good guideline for how many processes to start depends on the amount of threads available. It's basically just Thread_Count + 1, this ensures your processor doesn't just 'sit around and wait', this however is best used when you are IO bound, think of waiting for files from disk. Once it waits, that process is locked, thus you have 8 others to take over. The one extra is redundancy, in case all 8 are locked, the one that's left can take over right away. You can however in- or decrease this if you see fit.

Your question uses some general terms and leaves much unspecified so answers must be general.
It is assumed you are managing the processes using either Process directly or ProcessPoolExecutor.
In some cases, vCPU is a logical processor but per the following link there are services offering configurations of fractional vCPUs such as those in shared environments...
What is vCPU in AWS
You mention/ask...
... Now I'm looking to get an AWS server or something similar on which I can run the script. ...
... But how can I find out how many vCPU cores are used by the script/how many are needed? ...
You state AWS or something like it. The answer would depend on what your subprocess do, and how much of a vCPU or factional vCPU each subprocess needs. Generally, a vCPU is analogous to a logical processor upon which a thread can execute. A fractional portion of a vCPU will be some limited usage (than some otherwise "full" or complete "usage") of a vCPU.
The meaning of one or more vCPUs (or fractional vCPUs thereto) to your subprocesses really depends on those subprocesses, what they do. If one subprocess is sitting waiting on I/O most of the time, you hardly need a dedicated vCPU for it.
I recommend starting with some minimal least expensive configuration and see how it works with your app's expected workload. If you are not happy, increase the configuration as needed.
If it helps...
I usually use subprocesses if I need simultaneous execution that avoids Python's GIL limitations by breaking things into subprocesses. I generally use a single active thread per subprocess, where any other threads in the same subprocess are usually at a wait, waiting for I/O or do not otherwise compete with the primary active thread of the subprocess. Of course, a subprocess could be dedicated to I/O if you want to separate such from other threads you place in other subprocesses.
Since we do not know your app's purpose, architecture and many other factors, it's hard to say more than the generalities above.

Your computer has hundreds if not thousands of processes running at any given point. How does it handle all of those if it only has 5 cores? The thing is, each core takes a process for a certain amount of time or until it has nothing left to do inside that process.
For example, if I create a script that calculates the square root of all numbers from 1 to say a billion, you will see that a single core will hit max usage, then a split second later another core hits max while the first drops to normal and so on until the calculation is done.
Or if the process waits for an I/O process, then the core has nothing to do, so it drops the process, and goes to another process, when the I/O operation is done, the core can pick the process back, and get back to work.
You can run your multiprocessing python code on a single core, or on 100 cores, you can't really do much about it. However, on windows, you can set affinity of a process, which gives the process access to certain cores only. So, when the processes start, you can go to each one and set the affinity to say core 1 or each one to a separate core. Not sure how you do that on Linux though.
In conclusion, if you want a short and direct answer, I think we can say as many cores as it has access to. If you give them one core or 200 cores, they will still work. However, performance may degrade if the processes are CPU intensive, so I recommend starting with one core on AWS, check performance, and upgrade if needed.

I'll try to do my own summary about "I just need to know how many vCPU cores the script uses".
There is no way to answer that properly other than running your app and monitoring its resource usage. Assuming your Python processes do not spawn subprocesses (which could even be multithreaded applications), all we can say is that your app won't utilize more than 6 cores (as per total number of processes). There's a ton of ways for program to under-utilize CPU cores, like waiting for I/O (disk or network) or interprocess synchronization (shared resources). So to get any kind of understanding of CPU utilization, you really need to measure the actual performance (e.g., with htop utility on Linux or macOS) and investigating the causes of underperforming (if any).
Hope it helps.

Why is ThreadPoolExecutor's default max_workers decided based on the number of CPUs?

The documentation for concurrent.futures.ThreadPoolExecutor says:
Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
I want to understand why the default max_workers value depends on the number of CPUs. Regardless of how many CPUs I have, only one Python thread can run at any point in time.
Let us assume each thread is I/O intensive and it spends only 10% of its time in the CPU and 90% of its time waiting for I/O. Let us then assume we have 2 CPUs. We can only run 10 threads to utilize 100% CPU. We can't utilize any more CPU because only one thread runs at any point in time. This holds true even if there are 4 CPUs.
So why is the default max_workers decided based on the number of CPUs?

It's a lot easier to check the number of processors than to check how I/O bound your program is, especially at thread pool startup, when your program hasn't really started working yet. There isn't really anything better to base the default on.
Also, adding the default was a pretty low-effort, low-discussion change. (Previously, there was no default.) Trying to get fancy would have been way more work.
That said, getting fancier might pay off. Maybe some kind of dynamic system that adjusts thread count based on load, so you don't have to decide the count at the time when you have the least information. It won't happen unless someone writes it, though.

CPython thread implementation is light-weight. It mostly ships the thing to the os with some accounting for GIL (and signal handling). Increasing number of threads proportional to cores usually does not work out. Since the threads are managed by the os, with many cores, the os gets greedy and try to run as many ready threads as possible if there is a thread context switch. All of them try to acquire the GIL and only one succeeds. This leads to a lot of waste - worse than the linear calculation of assuming only one thread can run at a given time. If you are using pure CPU-bound threads in the executor, there is no reason to link it to cores because of this. But we should not deprive users who really want the CPU power and are okay with a GIL release to utilise the cores. So the arguably, the default value should be linked to the number of cores in this case - if you assume most people running Python know what they are doing.
Now if the threads in the executor are I/O-bound, then you rightly mentioned the max capacity is 1/p, where p is fraction of CPU each thread needs. For deciding the default, it is impossible to know what p is beforehand. The default minimum of 0.2 (min 5 threads) does not look too bad. But usually my guess is this p will be much lower, so the limiting factor may never be the CPU (but if it is, again we get in to the CPU thrashing problem of multiple cores as above). So the linking to number of cores will probably not end up being unsafe (unless the threads have heavy processing or you have too many cores!).

How to make Python's multiprocess spawn to make use all the available CPUs

I have an AWS instance has 32 CPUS:
ubuntu#ip-122-00-18-114:~$ cat /proc/cpuinfo | grep processor | wc -l
32
My question is how can I make use of Python's multiprocessing
so that each command runs on every CPU.
For example with the following code, will each command run on every single CPU available?
import multiprocessing
import os
POOL_SIZE = 32
cmdlist = []
for param in items:
cmd = """./cool_command %s""" % (param)
cmdlist.append(cmd)
p = multiprocessing.Pool(POOL_SIZE)
p.map(os.system, cmdlist)
If not, what's the right way to do it?
And what happened if I set POOL_SIZE > # Processors (CPUs)?

First a little correction on your wording. A CPU has different cores and each cores has hyperthreads. Each hyperthread is the logical unit which runs a processor. On Amazon you have 32 vCPUs which correspond to hyperthreads, not CPUs or cores. This is not important for this question but just in case if you do any further research it is important to have the wording right. I'll refer to this "lowest logical processing unit" of hyperthread as vCPU below
If you do not specify the pool size:
p = multiprocessing.Pool()
p.map(os.system, cmdlist)
then python will find out the number of available logical processors (in your case 32 vCPUs) itself (via os.cpu_count()).
In normal circumstances, all 32 processes run on separate vCPUs because Linux tries to balance the load evenly between them.
If, however there are other heavy processes running at the same time, then two processes might run on the same vCPU.
The key to understand here is how the Linux scheduler works: It periodically reschedules processes so all processing units are utilized about the same. That means if you start only 16 processes then they will spread out to all 32 vCPUs and utilize them about the same (use htop to see how the load spreads).
And what happened if I set POOL_SIZE > # Processors (CPUs)?
If you start more processes than the available vCPUs, then a few processes need to share a vCPU. That means that they a process is periodically switched out in the context switch by the scheduler. If your process is CPU bound (utilized 100% cpu, e.g. when you do number crunching) then having more processes than vCPUs will slow down the overall process as you'll have the context switches which slow down and if you have communication between the processes (not in your example, but something you'd normally do when doing multiprocessing) which slow down as well.
However. If your processes are not CPU bound but e.g. disk bound (need to wait for the disk for read/write) or network bound (e.g. wait for the other server to answer) then they are switched out by the scheduler to make room for another process since they need to wait anyway.

Easy question is "Not exactly". You may get cpu count with os.cpu_count() function and run this number of processes. But only operating system assigns process to the CPU. And more than that - it might switch it to another cpu in some time. I won't explain how preemptive multitasking works here.
If you have other "heavy" processes running on this server - for example database or even web server - they might need some cpu time for their execution as well.
Some good news are that there is a thing named Process Affinity exists that could be of use for your needs. But it is a kind of fine tuning the os.

When using multiprocessing Pool should the number of worker processes be the same as the number of CPU's or cores?

When using Python multiprocessing Pool should the number of worker processes be the same as the number of CPU's or cores?
This article http://www.howtogeek.com/194756/cpu-basics-multiple-cpus-cores-and-hyper-threading-explained/ says each core is really a central processing unit on a CPU chip. And thus it seems like there should not be a problem with having 1 process/core
e.g. If I have a single CPU chip with 4 cores, can 1 process/core for a total of 4 processes be ran without the possibility of slowing performance.

From what I've learned regarding python and multiprocessing, the best course of action is...
One process per core, but skip logical ones.
Hyperthreading is no help for python. It'll actually hurt performance in many cases, but test it yourself first of course.
Use the affinity (pip install affinity) module to stick each process to a specific core.
At least tested extensively on windows using 32bit python, not doing this will hurt performance significantly due to constant trashing of the cache. And again: skip logical cores! Logical ones, assuming you have an intel cpu with hyperthreading, are 1,3,5,7, etc.
More threads than real cores will help you nothing, unless there's also IO happening, which it shouldn't if you're crunching numbers. Test my claim yourself, especially if you use Linux, as I didn't get to test in Linux at all.

It really depends on your workload. Case by case, the best approach is to run some benchmark test and see what is the result.
Scheduling processes is an expensive operation, the more running processes, the more you need to change context.
If most of your processes are not running (they are waiting for IO for example) then overcommitting might prove beneficial. On the opposite, if your processes are running most of the time, adding more of them contending your CPU is going to be detrimental.

Why are threads spread between CPUs?

I am trying to get my head around threading vs. CPU usage. There are plenty of discussions about threading vs. multiprocessing (a good overview being this answer) so I decided to test this out by launching a maximum number of threads on my 8 CPU laptop running Windows 10, Python 3.4.
My assumption was that all the threads would be bound to a single CPU.
EDIT: it turns out that it was not a good assumption. I now understand that for multithreaded code, only one piece of python code can run at once (no matter where/on which core). This is different for multiprocessing code (where processes are independent and run indeed independently).
While I read about these differences, it is one answer which actually clarified this point.
I think it also explains the CPU view below: that it is an average view of many threads spread out on many CPUs, but only one of them running at one given time (which "averages" to all of them running all the time).
It is not a duplicate of the linked question (which addresses the opposite problem, i.e. all threads on one core) and I will leave it hanging in case someone has a similar question one day and is hopefully helped by my enlightenment.
The code
import threading
import time
def calc():
time.sleep(5)
while True:
a = 2356^36
n = 0
while True:
try:
n += 1
t = threading.Thread(target=calc)
t.start()
except RuntimeError:
print("max threads: {n}".format(n=n))
break
else:
print('.')
time.sleep(100000)
Led to 889 threads being started.
The load on the CPUs was however distributed (and surprisingly low for a pure CPU calculation, the laptop is otherwise idle with an empty load when not running my script):
Why is it so? Are the threads constantly moved as a pack between CPUs and what I see is just an average (the reality being that at a given moment all threads are on one CPU)? Or are they indeed distributed?

As of today it is still the case that 'one thread holds the GIL'. So one thread is running at a time.
The threads are managed on the operating system level. What happens is that every 100 'ticks' (=interpreter instruction) the running thread releases the GIL and resets the tick counter.
Because the threads in this example do continuous calculations, the tick limit of 100 instructions is reached very fast, leading to an almost immediate release of the GIL and a 'battle' between threads starts to acquire the GIL.
So, my assumption is that your operating system has a higher than expected load , because of (too) fast thread switching + almost continuous releasing and acquiring the GIL. The OS spends more time on switching than actually doing any useful calculation.
As you mention yourself, for using more than one core at a time, it's better to look at multiprocessing modules (joblib/Parallel).
Interesting read:
http://www.dabeaz.com/python/UnderstandingGIL.pdf

Um. The point of multithreading is to make sure they work gets spread out. A really easy cheat is to use as many threads as you have CPU cores. The point is they are all independent so they can actually run at the same time. If they were on the same core only one thread at a time could really run at all. They'd pass that core back and forth for processing at the OS level.
Your assumption is wrong and bizarre. What would ever lead you to think they should run on the same CPU and consequently go at 1/8th speed? As the only reason to thread them is typically to get the whole batch to go faster than a single core alone.
In fact, what the hell do you think writing parallel code is for if not to run independently on several cores at the same time? Like this would be pointless and hard to do, let's make complex fetching, branching, and forking routines to accomplish things slower than one core just plugging away at the data?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.