Background
I am trying to speed-up computation by parallelization (via joblib) using more available cores in Python 3.8, but observed that it does scale poorly.
Trials
I wrote a little script to test and demonstrate the behavior which can be found later. The script (see later) is designed to have a completely independent task doing some iterations of dummy operations using NumPy and Pandas. There is no input and no output to the task, no disc or other I/O, nor any communication or shared memory, just plain CPU and RAM usage. The processes do not use any other resources either other than the occasional request for the current time. Amdahl's Law should not apply to the code here, since there is no common code at all except for process setup.
I ran some experiments with increased workloads by duplicating the tasks using sequential vs. parallelization processing and measured the time it takes for each iteration and the whole (parallel) processes to complete. I ran the script on my Windows 10 laptop, and two AWS EC2 Linux (Amazon Linux 2) machines. The number of parallel processed never exceeded the number of available cores.
Observation
I observed the following (see results later for details, duration in seconds):
In case the number of parallel processed was less than the number of available cores, the total average CPUs utilization (user) never was more than 93%, system calls did not exceed 4%, and no iowait (measured with iostat -hxm 10)
The workload seems to be distributed equally over the available cores, though, which might be an indication for frequent switches between processes even though there are plenty of cores available
Interestingly, for sequential processing, the CPU utilization (user) was around 48%
The summed duration of all iterations is only slightly less than the total duration of a process, hence the process setup does not seem to be a major factor
For each doubling of the number of parallel processes there is a decrease in speed per each iteration/process of 50%
Whereas the duration for sequential processing approx. doubles as expected with doubling the workload (total number of iterations),
the duration for the parallel processing also increased significantly by approx. 50% per each doubling
These findings in this magnitude are unexpected to me.
Questions
What is the cause for this beavior?
Am I missing something?
How can it be remedied in order to utilize the full prospect of using more cores?
Detailed results
Windows 10
6 CPUs, 12 cores
Call: python .\time_parallel_processing.py 1,2,4,8 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParallelCount
Joblib 1 4.363902 0.195268 43.673971 10
2 6.322100 0.140654 63.870973 20
4 9.270582 0.464706 93.631790 40
8 15.489000 0.222859 156.670544 80
Seq 1 4.409772 0.126686 44.133441 10
2 4.465326 0.113183 89.377296 20
4 4.534959 0.125097 181.528372 40
8 4.444790 0.083315 355.849860 80
AWS c5.4xlarge
8 CPUs, 16 cores
Call: python time_parallel_processing.py 1,2,4,8,16 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.196086 0.009798 21.987626 10
2 3.392873 0.010025 34.297323 20
4 4.519174 0.126054 45.967140 40
8 6.888763 0.676024 71.815990 80
16 12.191278 0.156941 123.287779 160
Seq 1 2.192089 0.010873 21.945536 10
2 2.184294 0.008955 43.735713 20
4 2.201437 0.027537 88.156621 40
8 2.145312 0.009631 171.805374 80
16 2.137723 0.018985 342.393953 160
AWS c5.9xlarge
18 CPUs, 36 cores
Call: python time_parallel_processing.py 1,2,4,8,16,32 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 1.888071 0.023799 18.905295 10
2 2.797132 0.009859 28.307708 20
4 3.349333 0.106755 34.199839 40
8 4.273267 0.705345 45.998927 80
16 6.383214 1.455857 70.469109 160
32 10.974141 4.220783 129.671016 320
Seq 1 1.891170 0.030131 18.934494 10
2 1.866365 0.007283 37.373133 20
4 1.893082 0.041085 75.813468 40
8 1.855832 0.007025 148.643725 80
16 1.896622 0.007573 303.828529 160
32 1.864366 0.009142 597.301383 320
Script code
import argparse
import sys
import time
from argparse import Namespace
from typing import List
import numpy as np
import pandas as pd
from joblib import delayed
from joblib import Parallel
from tqdm import tqdm
RESULT_COLUMNS = {"Mode": str, "ParCount": int, "ProcessId": int, "IterId": int, "Duration": float}
def _create_empty_data_frame() -> pd.DataFrame:
return pd.DataFrame({key: [] for key, _ in RESULT_COLUMNS.items()}).astype(RESULT_COLUMNS)
def _do_task() -> None:
for _ in range(10):
array: np.ndarray = np.random.rand(2500, 2500)
_ = np.matmul(array, array)
data_frame: pd.DataFrame = pd.DataFrame(np.random.rand(250, 250), columns=list(map(str, list(range(250)))))
_ = data_frame.merge(data_frame)
def _process(process_id: int, iter_count: int) -> pd.DataFrame:
durations: pd.DataFrame = _create_empty_data_frame()
for i in tqdm(range(iter_count)):
iter_start_time: float = time.time()
_do_task()
durations = durations.append(
{
"Mode": "",
"ParCount": 0,
"ProcessId": process_id,
"IterId": i,
"Duration": time.time() - iter_start_time,
},
ignore_index=True,
)
return durations
def main(args: Namespace) -> None:
"""Execute main script."""
iter_durations: List[pd.DataFrame] = []
mode_durations: List[pd.DataFrame] = []
for par_count in list(map(int, args.par_counts.split(","))):
total_iter_count: int = par_count * int(args.iter_count)
print(f"\nRunning {par_count} processes in parallel and {total_iter_count} iterations in total")
start_time_joblib: float = time.time()
with Parallel(n_jobs=par_count) as parallel:
joblib_durations: List[pd.DataFrame] = parallel(
delayed(_process)(process_id, int(args.iter_count)) for process_id in range(par_count)
)
iter_durations.append(pd.concat(joblib_durations).assign(**{"Mode": "Joblib", "ParCount": par_count}))
end_time_joblib: float = time.time()
print(f"\nRunning {par_count} processes sequentially with {total_iter_count} iterations in total")
start_time_seq: float = time.time()
seq_durations: List[pd.DataFrame] = []
for process_id in range(par_count):
seq_durations.append(_process(process_id, int(args.iter_count)))
iter_durations.append(pd.concat(seq_durations).assign(**{"Mode": "Seq", "ParCount": par_count}))
end_time_seq: float = time.time()
mode_durations.append(
pd.DataFrame(
{
"Mode": ["Joblib", "Seq"],
"ParCount": [par_count] * 2,
"Duration": [end_time_joblib - start_time_joblib, end_time_seq - start_time_seq],
"TotalIterCount": [total_iter_count] * 2,
}
)
)
print("\nDuration in seconds")
grouping_columns: List[str] = ["Mode", "ParCount"]
print(
pd.concat(iter_durations)
.groupby(grouping_columns)
.agg({"Duration": ["mean", "std"]})
.merge(
pd.concat(mode_durations).groupby(grouping_columns).agg({"Duration": ["mean"], "TotalIterCount": "mean"}),
on=grouping_columns,
suffixes=["/Iter", " total"],
how="inner",
)
)
if __name__ == "__main__":
print(f"Command line: {sys.argv}")
parser: argparse.ArgumentParser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"par_counts",
help="Comma separated list of parallel processes counts to start trials for (e.g. '1,2,4,8,16,32')",
)
parser.add_argument("iter_count", help="Number of iterations per parallel process to carry out")
args: argparse.Namespace = parser.parse_args()
start_time: float = time.time()
main(args)
print(f"\nTotal elapsed time: {time.time() - start_time:.2f} seconds")
Environment
Created with' conda env create -f environment.time_parallel.yaml
environment.time_parallel.yaml:
name: time_parallel
channels:
- defaults
- conda-forge
dependencies:
- python=3.8.5
- pip=20.3.3
- pandas=1.2.0
- numpy=1.19.2
- joblib=1.0.0
- tqdm=4.55.1
Update 1
Thanks to the coment of #sholderbach I investigated into the NumPy/Pandas usage and found out a couple of things.
1)
NumPy uses a linear algebra backend which automatically will run some commands (including matrix multiplication) in parallel threads which results in too many threads altogether clogging the system, the more parallel processes, the more, hence the increasing duration per iteration.
I tested this hypthesis by removing NumPy and Pandas operations in method _do_task adn replacing it by simple math operations only:
def _do_task() -> None:
for _ in range(10):
for i in range(10000000):
_ = 1000 ^ 2 % 200
The results are exactly as expected in that the duration of an iteration does not change when increasing the number of processes (beyond the number of cores available).
Windows 10
Call python time_parallel_processing.py 1,2,4,8 5
Duration in seconds
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.562570 0.015496 13.468393 5
2 2.556241 0.021074 13.781174 10
4 2.565614 0.054754 16.171828 20
8 2.630463 0.258474 20.328055 40
Seq 2 2.576542 0.033270 25.874965 10
AWS c5.9xlarge
Call python time_parallel_processing.py 1,2,4,8,16,32 10
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 2.082849 0.022352 20.854512 10
2 2.126195 0.034078 21.596860 20
4 2.287874 0.254493 27.420978 40
8 2.141553 0.030316 21.912917 80
16 2.156828 0.137937 24.483243 160
32 3.581366 1.197282 42.884399 320
Seq 2 2.076256 0.004231 41.571033 20
2)
Following the hint of #sholderbach I found a number of other links which cover the topic of linear algebra backends using multiple threads automatically and how to turn this off:
NumPy issue (from #sholderbach)
threadpoolctl package
Nice article
Pinning process to a specific CPU with Python (and package psutil)
Add to _process:
proc = psutil.Process()
proc.cpu_affinity([process_id])
with threadpool_limits(limits=1):
...
Add to environment:
- threadpoolctl=2.1.0
- psutil=5.8.0
Note: I had to replace joblib by multiprocessing, since pinning did not work properly with joblib (only one half of the processes got spawned at a time on Linux).
I did some tests with mixed results. Monitoring shows that pinnng and restricting to one thread per process works for both Windows 10 and Linux/AWS c5.9xlarge. Unfortunately, the absolute duration per iteration increases by these "fixes".
Also, the duration per iteration still begins to increase at some point of parallelization.
Here are the results:
Windows 10
Call: python time_parallel_processing.py 1,2,4,8 5
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Joblib 1 9.502184 0.046554 47.542230 5
2 9.557120 0.092897 49.488612 10
4 9.602235 0.078271 50.249238 20
8 10.518716 0.422020 60.186707 40
Seq 2 9.493682 0.062105 95.083382 10
AWS c5.9xlarge
Call python time_parallel_processing.py 1,2,4,8,16,20,24,28,32 5
Duration/Iter Duration total TotalIterCount
mean std mean mean
Mode ParCount
Parallel 1 5.271010 0.008730 15.862883 3
2 5.400430 0.016094 16.271649 6
4 5.708021 0.069001 17.428172 12
8 6.088623 0.179789 18.745922 24
16 8.330902 0.177772 25.566504 48
20 10.515132 3.081697 47.895538 60
24 13.506221 4.589382 53.348917 72
28 16.318631 4.961513 57.536180 84
32 19.800182 4.435462 64.717435 96
Seq 2 5.212529 0.037129 31.332297 6
What is the cause for this behavior?
Very generally, this type of slowdown usually indicates some combination of being blocked by the GIL, context-switching between cores, or doing a lot of pickling
Am I missing something?
You may be missing some small issues - try profiling (some sampling profiler may be much more performant than cProfile) to see where the time is spent!
However, there's still a finite limit to how fast this can be made before you are reimplementing the suggestions below
How can it be remedied in order to utilize the full prospect of using more cores?
Take a look at numba and dask, which can allow you to get tremendous speedups on numpy and pandas code through parallelization that steps outside of the GIL
numba compiles numpy code and caches it for greater speed and practical processor operations
dask is a framework full of good tricks for efficient parallelization on a single and multiple systems
When I had a scaling issue with ipyparallel, it was caused by garbage collection during the run.
The source code of timeit shows how to disable gc properly.
I am trying to learn the multiprocessing library in Python3.9. One thing I compared was the performance of a repeated computation of on a dataset composing of 220500 samples per dataset. I did this using the multiprocessing library and then using for loops.
Throughout my tests I am consistently getting better performance using for loops. Here is the code for the test I am running. I am computing the FFT of a signal with 220500 samples. My experiment involves running this process for a certain amount of times in each test. I am testing this out with setting the number of processes to 10, 100, and 1000 respectively.
import time
import numpy as np
from scipy.signal import get_window
from scipy.fftpack import fft
import multiprocessing
from itertools import product
def make_signal():
# moved this code into a function to make threading portion of code clearer
DUR = 5
FREQ_HZ = 10
Fs = 44100
# precompute the size
N = DUR * Fs
# get a windowing function
w = get_window('hanning', N)
t = np.linspace(0, DUR, N)
x = np.zeros_like(t)
b = 2*np.pi*FREQ_HZ*t
for i in range(50):
x += np.sin(b*i)
return x*w, Fs
def fft_(x, Fs):
yfft = fft(x)[:x.size//2]
xfft = np.linspace(0,Fs//2,yfft.size)
return 2/yfft.size * np.abs(yfft), xfft
if __name__ == "__main__":
# grab the raw sample data which will be computed by the fft function
x = make_signal()
# len(x) = 220500
# create 5 different tests, each with the amount of processes below
# array([ 10, 100, 1000])
tests_sweep = np.logspace(1,3,3, dtype=int)
# sweep through the processes
for iteration, test_num in enumerate(tests_sweep):
# create a list of the amount of processes to give for each iteration
fft_processes = []
for i in range(test_num):
fft_processes.append(x)
start = time.time()
# repeat the process for test_num amount of times (e.g. 10, 100, 1000)
with multiprocessing.Pool() as pool:
results = pool.starmap(fft_, fft_processes)
end = time.time()
print(f'{iteration}: Multiprocessing method with {test_num} processes took: {end - start:.2f} sec')
start = time.time()
for fft_processes in fft_processes:
# repeat the process the same amount of time as the multiprocessing method using for loops
fft_(*fft_processes)
end = time.time()
print(f'{iteration}: For-loop method with {test_num} processes took: {end - start:.2f} sec')
print('----------')
Here are the results of my test.
0: Multiprocessing method with 10 processes took: 0.84 sec
0: For-loop method with 10 processes took: 0.05 sec
----------
1: Multiprocessing method with 100 processes took: 1.46 sec
1: For-loop method with 100 processes took: 0.45 sec
----------
2: Multiprocessing method with 1000 processes took: 6.70 sec
2: For-loop method with 1000 processes took: 4.21 sec
----------
Why is the for-loop method considerably faster? Am I using the multiprocessing library correctly? Thanks.
There is a nontrivial amount of overhead to starting a new process. In addition the data has to be copied from one process to another (again with some overhead compared to a normal memory copy).
Another aspect is that you should limit the number of processes to the number of cores you have. Going over will make you incurr process switching costs as well.
This, coupled with the fact that you have little computation per process makes the switch not worth while.
I think if you make the signal significantly longer (10x or 100x) you should start seeing some benefits from using multiple cores.
Also check if the operations you are running are already using some parallelism. They might be implemented with threads, which are significantly cheaper the processes (but historically didn't work well in python, dye to GIL).
I have a parallelised program using concurrent.futures/ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor as PoolExecutor
import numpy as np, timeit
start = timeit.default_timer()
n = 2
def f(samp):
t = samp ** 10
samps = np.random.uniform(low=0, high=1, size=(100000,))
with PoolExecutor(max_workers=n) as executor:
for _ in executor.map(f, samps):
pass
print(f"time: {timeit.default_timer() - start}")
It takes about 3s to run.
If I run it sequentially without parallelising, i.e.:
for samp in samps: t = samp ** 10
It takes about 0.05s to run (i.e. 100,000 iterations).
Why is the parallelised version taking so much longer. NB increasing max_workers also increases run time. Also, this maybe a silly code example but my original code was processing 800 files - it also took longer than the sequential version.
I just deployed a m5.4xlarge on AWS to test the multiprocessing performance and I'm getting weird results.
multiprocessing.cpu_count() returns 16
#home I5-3570K 4cores/4threads, with a pool size of 4 : Computation took 5.15700006485 seconds
#aws m5.4xlarge 16 threads, with a pool size of 4 : Computation took 3.80112195015 seconds
#aws m5.4xlarge 16 threads, with a pool size of 8 : Computation took 3.77861309052 seconds
#aws m5.4xlarge 16 threads, with a pool size of 15 : Computation took 3.26295304298 seconds
#aws m5.4xlarge 16 threads, with a pool size of 16 : Computation took 4.16541814804 seconds
Did I do something wrong in my script?
# coding: utf-8
import hashlib
import time
from multiprocessing import Pool
#on a fresh AWS linux instance run :
#sudo yum groupinstall "Development Tools"
#sudo easy_install hashlib
def compute_hash_256(very_random_string):
return hashlib.sha256(very_random_string).hexdigest()
if __name__ == '__main__':
POOL_SIZE = 16 #number of threads of our computer
pool = Pool(processes=POOL_SIZE)
########################### generates strings for hashing
N_STRINGS = 3000000
print "Generating {} strings for hashing...".format(N_STRINGS)
random_strings = []
padding_size = len(str(N_STRINGS))
for i in range(N_STRINGS):
random_strings.append(str(i).zfill(padding_size))
############################ hashes the strings using multiprocessing
print "Computing {} hashes".format(len(random_strings))
start = time.time()
hashes = pool.map(compute_hash_256, random_strings)
end = time.time()
print "Computation took {} seconds".format(end-start)
Thanks
There is a rule of allocating threads when ever you are doing computational intensive work the number of threads should always be less then the no of cores in the machine.If the thread count is increased there will be race condition and your algo will take more time to give back result
NoOfThreads < NoOfCores
you can use this code to check the number of cores
import multiprocessing
multiprocessing.cpu_count()
I'm using Parallel Python for executing a computation heavy code on multiple cores.
I have an i7-4600M processor, which has 2 cores and 4 threads.
The interesting thing is, the computation takes nearly the same time if I use 2 or 4 theads. I wrote a little example code, which demonstrates this phenomenon.
import itertools
import pp
import time
def cc(data, n):
count = 0
for A in data:
for B in itertools.product((-1,0,1), repeat=n):
inner_product = sum(a*b for a,b in zip(A,B))
if inner_product == 0:
count += 1
return count
n = 9
for thread_count in (1, 2, 3, 4):
print("Thread_count = {}".format(thread_count))
ppservers = ()
job_server = pp.Server(thread_count, ppservers=ppservers)
datas = [[] for _ in range(thread_count)]
for index, A in enumerate(itertools.product((0,1), repeat=n)):
datas[index%thread_count].append(A)
print("Data sizes: {}".format(map(len, datas)))
time_start = time.time()
jobs = [job_server.submit(cc,(data,n), (), ("itertools",)) for data in datas]
result = sum(job() for job in jobs)
time_end = time.time()
print("Time = {}".format(time_end - time_start))
print("Result = {}".format(result))
print
Here's a short video of running the program and the cpu usage: https://www.screenr.com/1ULN When I use 2 threads, the cpu has 50% usage, if I use 4 threads, it uses 100%. But it's only slightly faster. Using 2 threads, I get a speedup of 1.8x, using 3 threads a speedup of 1.9x, and using 4 threads a speedup of 2x.
If the code is too fast, use n = 10 or n = 11. But be careful, the complexity is 6^n. So n = 10 will take 6x as long as n = 9.
2 cores and 4 threads means you have two hyperthreads on each core, which won't scale linearly, since they share resources and can get in each other's way, depending on the workload. Parallel Python uses processes and IPC behind the scenes. Each core is scheduling two distinct processes, so you're probably seeing cache thrashing (a core's cache is shared between hyperthreads).
I know this thread is a bit old but I figured some added data points might help. I ran this on a vm with 4 virtual-cpus (2.93Ghz X5670 xeon) and 8GB of ram allocated. The VM was hosted on Hyper-V and is running Python 2.7.8 on Ubuntu 14.10 64-bit, but my version of PP is the fork PPFT.
In the first run the number of threads was 4. In the second I modified the for loop to go to 8.
Output: http://pastebin.com/ByF7nbfm
Adding 4 more cores, and doubling the ram, same for loop, looping for 8:
Output: http://pastebin.com/irKGWMRy