scipy.fftpack.fft with multiprocessing, how to avoid performance losses? - python

I would like to use scipy.fftpack.fft (and rfft) inside a multiprocessing structure,
I have observed significant performances losses due to an apparent incompatibility between scipy.fftpack and multiprocessing, which makes the parallelization almost inefficient.
Unless the issue seems well known, I could not find a solution in the web to avoid this performance losses.
Below is an minimalist example showing the issue :
import time
import multiprocessing as mp
from scipy.fftpack import fft, ifft
import numpy as np
def costly_function(n_mean: int):
start = time.time()
x = np.ones(16385, dtype=float)
for n in range(n_mean):
fft(ifft(x))
return (time.time() - start) * 1000.
n_run = 24
# ===== sequential test
sequential_times = [costly_function(500) for _ in range(n_run)]
print(f"time per run (sequential): {np.mean(sequential_times):.2f}+-{np.std(sequential_times):.2f}ms")
# ===== parallel test
with mp.Pool(12) as pool:
parallel_times = pool.map(costly_function, [500 for _ in range(n_run)])
print(f"time per run (parallel): {np.mean(parallel_times):.2f}+-{np.std(parallel_times):.2f}ms")
On a 12 cores machine under Ubuntu and python 3.10, I get the following result :
>> time per run (sequential): 510.55+-64.64ms
>> time per run (parallel): 1254.26+-114.47ms
note : none of these additions could resolve the problem
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'

Related

Simulation of Markov chain slower than in Matlab

I run the same test code in Python+Numpy and in Matlab and see that the Matlab code is faster by an order of magnitude. I want to know what is the bottleneck of the Python code and how to speed it up.
I run the following test code using Python+Numpy (the last part is the performance sensitive one):
# Packages
import numpy as np
import time
# Number of possible outcomes
num_outcomes = 20
# Dimension of the system
dim = 50
# Number of iterations
num_iterations = int(1e7)
# Possible outcomes
outcomes = np.arange(num_outcomes)
# Possible transition matrices
matrices = [np.random.rand(dim, dim) for k in outcomes]
matrices = [mat/np.sum(mat, axis=0) for mat in matrices]
# Initial state
state = np.random.rand(dim)
state = state/np.sum(state)
# List of samples
samples = np.random.choice(outcomes, size=(num_iterations,))
samples = samples.tolist()
# === PERFORMANCE-SENSITIVE PART OF THE CODE ===
# Update the state over all iterations
start_time = time.time()
for k in range(num_iterations):
sample = samples[k]
matrix = matrices[sample]
state = np.matmul(matrix, state)
end_time = time.time()
# Print the execution time
print(end_time - start_time)
I then run an equivalent code using Matlab (the last part is the performance sensitive one):
% Number of possible outcomes
num_outcomes = 20;
% Number of dimensions
dim = 50;
% Number of iterations
num_iterations = 1e7;
% Possible outcomes
outcomes = 1:num_outcomes;
% Possible transition matrices
matrices = rand(num_outcomes, dim, dim);
matrices = matrices./sum(matrices,2);
matrices = num2cell(matrices,[2,3]);
matrices = cellfun(#shiftdim, matrices, 'UniformOutput', false);
% Initial state
state = rand(dim,1);
state = state./sum(state);
% List of samples
samples = datasample(outcomes, num_iterations);
% === PERFORMANCE-SENSITIVE PART OF THE CODE ===
% Update the state over all iterations
tic;
for k = 1:num_iterations
sample = samples(k);
matrix = matrices{sample};
state = matrix * state;
end
toc;
The Python code is consistently slower than the Matlab code by an order of magnitude, and I am not sure why.
Any idea where to start?
I run the Python code with the Python 3.10 interpreter and Numpy 1.22.4. I run the Matlab code with Matlab R2022a. Both codes are run on Windows 11 Pro 64 bits on a Lenovo T14 ThinkPad with the following processors:
11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz, 2803 Mhz, 4 Core(s), 8 Logical Processor(s)
EDIT 1: I made some additional tests and it looks like the culprit is some type of Python-specific constant overhead at low matrix sizes:
As hpaulj and MSS suggest, this might mean that a JIT compiler could solve some of these issues. I will do my best to try this in the near future.
EDIT 2: I ran the code under Pypy 3.9-v7.3.11-win64 and although it does change the scaling and even beats Cpython at small matrix sizes, it generally incurs a big overhead for this particular code:
So a JIT compiler could help if there are ways to mitigate this overhead. Otherwise a Cython implementation is probably the remaining way to go...
In the loop, the main hindrance is np.matmul(matrix,state).
If we unroll the loop:
st[1] = m[0]#st[0]
st[2] = m[1]#st[1] = m[1]#m[0]#st[0]
st[3] = m[2]#m[1]#m[0]#st[0]
There is no obvious vectorized way to do looped np.matmul in a non loopy way.
A better way would be to do it in log_2(n) loops.
import numpy as np
outcomes = 20
dim = 50
num_iter = int(1e7)
mat = np.random.rand(outcomes,dim, dim)
mat = mat/mat.sum(axis=1)[...,None]
state = np.random.rand(dim)
state = state/np.sum(state)
samples = np.random.choice(np.arange(outcomes), size=(num_iter,))
a = mat[samples,...]
# This while loop takes log_2(num_iter) iterations
while len(a) > 1:
a = np.matmul(a[::2, ...], a[1::2, ...])
state = np.matmul(a,state)
The time may be further reduced by using numba jit.

Is there any simple method to parallel np.einsum?

I would like to know, is there any simple method to parallel einsum in Numpy?
I found some discussions
Numpy np.einsum array multiplication using multiple cores
Any chance of making this faster? (numpy.einsum)
numpy.tensordot() only for binary contraction with a single axis, Numba needs to specify certain loops. Is there any simple and robust approach to parallel einsum (possibly including opt-einsum, tf-einsum etc) with arbitrary contractions?
A sample code is as following (if necessary I can use more complicated contraction as the example)
import numpy as np
import timeit
import time
na = nc = 1000
nb = 1000
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
start = time.time()
C = np.einsum('ij,jk->ik', A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)

Numpy matrix multiplications with multiprocessing suddenly slow down as dimension increase

I want to do some large matrix multiplications using multiprocessing.Pool.
Suddenly, when the dimension is higher than 50, it takes an extremely long computation time.
Is there any easy way to be faster?
Here, I don't want to use shared memory like RawArray, because my original code randomly generate the matrix for each time.
The sample code is as follows.
import numpy as np
from time import time
from multiprocessing import Pool
from functools import partial
def f(d):
a = int(10*d)
N = int(10000/d)
for _ in range(N):
X = np.random.randn(a,10) # np.random.randn(10,10)
return X
# Dimensions
ds = [1,2,3,4,5,6,8,10,20,35,40,45,50,60,62,64,66,68,70,80,90,100]
# Serial processing
serial = []
for d in ds:
t1 = time()
for i in range(20):
f(d)
serial.append(time()-t1)
# Parallel processing
parallel = []
for d in ds:
t1 = time()
pool = Pool()
for i in range(20):
pool.apply_async(partial(f,d), args=())
pool.close()
pool.join()
parallel.append(time()-t1)
# Plot
import matplotlib.pyplot as plt
plt.title('Matrix multiplication time with 10000/d repetitions')
plt.plot(ds,serial,label='serial')
plt.plot(ds,parallel,label='parallel')
plt.xlabel('d (dimension)')
plt.ylabel('Total time (sec)')
plt.legend()
plt.show()
Due to the total computation cost of f(d) is the same for all d, the parallel processing time should be equal.
But the actual output is not.
System info:
Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
Intel(R) Core(TM) i9-7940X CPU # 3.10GHz
NOTE I want to use parallel computation as a complicated internal simulation (like #), not sending data to the child process.
This is for self-reference.
Here, I found a solution.
My numpy uses MKL as backend, it may be the problem that MKL multithreading collides multiprocessing.
If I run the code:
import os
os.environ['MKL_NUM_THREADS'] = '1'
before importing numpy, then it solved.
I just found an explanation here: https://github.com/numpy/numpy/issues/10145.
Looks like the CPU caching gets messed up when you have conflicting MKL matrix multiplications going at the same time.

Numeric Integration Python versus Matlab

My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)
How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.
The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.

How to decrease time of execution using multi-threading python

I am performing DCT(in Raspberry Pi). I've broken the image into 8x8 blocks. Initially I performed DCT in nested for loop (without multithreading). I observed that it takes about 18 seconds for a 512x512 image.
But, Here's the code with multi-threads
#!/usr/bin/env python
from __future__ import print_function,division
import time
start_time = time.time()
import cv2
import numpy as np
import sys
import pylab as plt
import threading
import Queue
from numpy import empty,arange,exp,real,imag,pi
from numpy.fft import rfft,irfft
from pprint import pprint
queue = Queue.Queue()
if len(sys.argv)>1:
im = cv2.imread(sys.argv[1])
else :
im = cv2.imread('baboon.jpg')
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
h, w = im.shape[:2]
DF = np.zeros((h,w))
Nb=8
def dct2(y):
M = y.shape[0]
N = y.shape[1]
a = empty([M,N],float)
b = empty([M,N],float)
for i in range(M):
a[i,:] = dct(y[i,:])
for j in range(N):
b[:,j] = dct(a[:,j])
queue.put(b)
def dct(y):
N = len(y)
y2 = empty(2*N,float)
y2[:N] = y[:]
y2[N:] = y[::-1]
c = rfft(y2)
phi = exp(-1j*pi*arange(N)/(2*N))
return real(phi*c[:N])
def Main():
jobs = []
for row in range(0, h, Nb):
for col in range(0, w, Nb):
f = im[(row):(row+Nb), (col):(col+Nb)]
thread = threading.Thread(target=dct2(f))
jobs.append(thread)
df = queue.get()
DF[row:row+Nb, col:col+Nb] = df
for j in jobs:
j.start()
for j in jobs:
j.join()
if __name__ == "__main__":
Main()
cv2.imwrite('dct_img.jpg', DF)
print("--- %s seconds ---" % (time.time() - start_time))
plt.imshow(DF1, cmap = 'Greys')
plt.show()
cv2.waitKey(0)
cv2.destroyAllWindows()
After using multiple threads, this code take about 25 seconds to get executed. What's wrong? Have I implemented multi-threading wrongly? I want to reduce the time taken to perform DCT as much as possible (1-5 seconds). Any suggestions?
Any other concept or method (I've read post on multiprocessing) that'll significantly reduce my execution and processing time?
Due to GIL all your threads are executed in a sequence (not in parallel).
So you might want to switch to multiprocessing. Another option is to build numba, which can greatly increase speed of usual python code and also can unlock GIL.
In Python, you should use multithreading for performances only when mixing IO and CPU tasks.
For your problem you should use multiprocessing.
Maybe the other posters are right about the GIL. But OpenCV as well as Numpy release the GIL so I would at least expect a speedup from a multithreaded solution.
I would have a look at how many threads you are creating simultaneously. It's probably a lot since you start one for each 8 by 8 pixel sub picture. (Each time a thread is taken off the cpu and replaced by another it incurs a small overhead which in sum gets quite noticeable if you have a lot of threads)
If this is the case you probably gain performance by not starting them all at once but to only start as many as you have cpu cores (a few more a few less...just experiment) and only start the next thread if one has finished.
Look at the answers to this question on how to do this with minimal effort.

Categories