I've been working on optimizing some Euclidean distance transform calculations for a program that I'm building. To preface, I have little formal training in computer science other than some MOOCs I've been taking.
I've learned through empirical testing in Python that assigning values to individual variables and performing operations on them is faster than performing operations on arrays. Is this observation reproducible for others?
If so, could someone provide a deeper explanation as to why there are such speed differences between these two forms of syntax?
Please see some example code below.
import numpy as np
from math import sqrt
import time
# Numpy array math
def test1(coords):
results = []
for coord in coords:
mins = np.array([1,1,1])
# The three lines below seem faster than np.linalg.norm()
mins = (coord - mins)**2
mins = np.sum(mins)
results.append(sqrt(mins))
# Individual variable assignment math
def test2(coords):
results = []
for point in coords:
z, y, x = 1, 1, 1
z = (point[0] - z)**2
y = (point[1] - y)**2
x = (point[2] - x)**2
mins = sqrt(z + y + x)
results.append(mins)
a = np.random.randint(0, 10, (500000,3))
t = time.perf_counter()
test1(a)
print ("Test 1 speed:", time.perf_counter() - t)
t = time.perf_counter()
test2(a)
print ("Test 2 speed:", time.perf_counter() - t)
Test 1 speed: 3.261552719 s
Test 2 speed: 0.716983475 s
Python operations and memory allocations are generally much slower than Numpy's highly optimized, vectorized array operations. Since you are looping over the array and allocating memory, you don't get any of the benefits that Numpy offers. It's especially bad in your first one because it causes an undue number of allocations of small arrays.
Compare your code to one that offloads all the operations to Numpy instead of having Python do the operations one by one:
def test3(coords):
mins = (coords - 1)**2
results = np.sqrt(np.sum(mins, axis=1))
return results
On my system, this results in:
Test 1 speed: 4.995761550962925
Test 2 speed: 1.3881473205983639
Test 3 speed: 0.05562112480401993
Related
Context:
I have 3 3D arrays ("precursor arrays") that I am upsampling with an Inverse Distance Weighting method. To do that, I calculate a 3D weights array that I use in a for loop on each point of my precursor arrays.
Each 2D slice of my weights array is used to calculate a partial array. Once I generate all 28 of them, they are summed to give one final host array.
I would like to parallelize this for loop in order to reduce my computing time. I tried doing it but I can not manage to update correctly my host arrays.
Question:
How could I parallelize my main function (last section of my code) ?
EDIT: Or is there a way I could "slice" my i for loop (for example one core running between i = 0 to 5, and one core running on i = 6 to 9) ?
Summary:
3 precursor arrays (temperatures, precipitations, snow): 10x4x7 (10 is a time dimension)
1 weight array (w): 28x1101x2101
28x3 partial arrays: 1101x2101
3 host arrays (temp, prec, Eprec): 1101x2101
Here is my code (runable as it is aside from the MAIN ALGORITHM PARALLEL section, please see the MAIN ALGORITHM NOT PARALLEL section at the end for the non-parallelized version of my code):
import numpy as np
import multiprocessing as mp
import time
#%% ------ Create data ------ ###
temperatures = np.random.rand(10,4,7)*100
precipitation = np.random.rand(10,4,7)
snow = np.random.rand(10,4,7)
# Array of altitudes to "adjust" the temperatures
alt = np.random.rand(4,7)*1000
#%% ------ Functions to run in parallel ------ ###
# This function upsamples the precursor arrays and creates the partial arrays
def interpolator(i, k, mx, my):
T = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000)) * w[k,:,:]
P = (precipitation[i,mx,my])*w[k,:,:]
S = (snow[i,mx,my])*w[k,:,:]
return(T, P, S)
# We add each partial array to each other to create the host array
def get_results(results):
global temp, prec, Eprec
temp += results[0]
prec += results[1]
Eprec += results[2]
#%% ------ IDW Interpolation ------ ###
# We create a weight matrix that we use to upsample our temperatures, precipitations and snow matrices
# This part is not that important, it works well as it is
MX,MY = np.shape(temperatures[0])
N = 300
T = np.zeros([N*MX+1, N*MY+1])
# create NxM inverse distance weight matrices based on Gaussian interpolation
x = np.arange(0,N*MX+1)
y = np.arange(0,N*MY+1)
X,Y = np.meshgrid(x,y)
k = 0
w = np.zeros([MX*MY,N*MX+1,N*MY+1])
for mx in range(MX):
for my in range(MY):
# Gaussian
add_point = np.exp(-((mx*N-X.T)**2+(my*N-Y.T)**2)/N**2)
w[k,:,:] += add_point
k += 1
sum_weights = np.sum(w, axis=0)
for k in range(MX*MY):
w[k,:,:] /= sum_weights
#%% ------ MAIN ALGORITHM PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
# Start a timer
ts = time.time()
# Iterate over the time dimension
for i in range(temperatures.shape[0]):
# Initialize the host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
# Create the pool based on my amount of cores
pool = mp.Pool(mp.cpu_count())
# Loop through every weight slice, for every cell of the temperatures, precipitations and snow arrays
for k in range(0,w.shape[0]):
for mx in range(MX):
for my in range(MY):
# Upsample the temperatures, precipitations and snow arrays by adding the contribution of each weight slice
pool.apply_async(interpolator, args = (i, k, mx, my), callback = get_results)
pool.close()
pool.join()
# Print the time spent on the loop
print("Time spent: ", time.time()-ts)
#%% ------ MAIN ALGORITHM NOT PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
ts = time.time()
for i in range(temperatures.shape[0]):
# Create empty host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
k = 0
for mx in range(MX):
for my in range(MY):
get_results(interpolator(i, k, mx, my))
k += 1
print("Time spent:", time.time()-ts)
The problem with multiprocessing is that it creates many new processes taht execute the code before the main (ie. before if __name__ == '__main__'). This causes a very slow initialization (since all process does it) and a huge amount of RAM being used for nothing. You certainly should move everything in the main or if possible in functions (which generally results in a faster execution and is a good software engineering practice anyway, especially for parallel codes). Even with this, there is another huge problem with multiprocessing: inter-process communication is slow. One solution is to use a multi-threaded approach made possible by using Numba or Cython (you can disable the GIL with them as opposed to basic CPython threads). In fact, they are often simpler to use than multiprocessing. However, you should be more careful though since parallel access are unprotected and data-races can appear in bogus parallel codes.
In your case, the computation is mostly memory-bound. This means multiprocessing is pretty useless. In fact, parallelism is barely useful here unless you are running this code on a computing server with a high-throughput. Indeed, the memory is a shared resource and using more computing core does not help much since 1 core can almost saturate the memory bandwidth on a regular PC (while few cores are needed on computing servers).
The key to speed up memory-bound codes is to avoid creating temporary arrays and use cache-friendly algorithms. In your case, T, P and S are filled just to be read later so to update the temp, prec and Eprec arrays. This temporary step is pretty expensive and necessary here (especially filling the arrays). Removing this will increase the arithmetic intensity resulting in a code that will certainly be faster in sequential and that can better scale on multiple cores. This is the case on my machine.
Here is an example of code using Numba so to parallelize the code:
import numba as nb
# get_results + interpolator
#nb.njit('void(float64[:,::1], float64[:,::1], float64[:,::1], float64[:,:,::1], int_, int_, int_, int_)', parallel=True)
def interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my):
factor1 = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000))
factor2 = precipitation[i,mx,my]
factor3 = snow[i,mx,my]
for i in nb.prange(w.shape[1]):
for j in range(w.shape[2]):
val = w[k, i, j]
temp[i, j] += factor1 * val
prec[i, j] += factor2 * val
Eprec[i, j] += factor3 * val
# Example of usage:
interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my)
Note the string in nb.njit is called a signature and specify the type to the JIT so it can compile it eagerly.
This code is 4.6 times faster on my 6-core machine (while it was barely faster without the merge of get_results and interpolator). In fact, it is 3.8 times faster in sequential so threads does not help much since the computation is still memory-bound. Indeed, the cost of the multiply-add is negligible compared to the memory reads/writes.
My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)
How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.
The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.
I was having problems with the accuracy of floats in Python. I need high accuracy because I want to use explicitly written spherical bessel functions J_n (x), which deviate (especially for n>5) from their theoretical values at low x values if numpy floats are used (15 precise digits).
I have tried many options, especially from mpmath and sympy, in order to keep more precise numbers. I had problems when combining the accuracy of mpmath inside the functions with numpy arrays, until I knew there was the function numpy.vectorize. Finally I got this solution to my initial problem:
import time
% matplotlib qt
import scipy
import numpy as np
from scipy import special
import matplotlib.pyplot as plt
from sympy import *
from mpmath import *
mp.dps=100
#explicit inaccurate
def bessel6_expi(z):
return -((z**6-210*z**4+4725*z**2-10395)*np.sin(z)+(21*z**5-1260*z**3+10395*z)*np.cos(z))/z**7
#explicit inaccurate 1, computation time increases, a bit less inaccuracy
def bessel6_exp1(z):
def bv(z):
return -((z**6-210*z**4+4725*z**2-10395)*mp.sin(z)+(21*z**5-1260*z**3+10395*z)*mp.cos(z))/z**7
bvec=np.vectorize(bv)
return bvec(z)
#explicit accurate 2, computation time increases markedly, accurate
def bessel6_exp2(z):
def bv(z):
return -((mpf(z)**mpf(6)-mpf(210)*mpf(z)**mpf(4)+mpf(4725)*mpf(z)**mpf(2)-mpf(10395))*mp.sin(mpf(z))+(mpf(21)*mpf(z)**mpf(5)-mpf(1260)*mpf(z)**mpf(3)+mpf(10395)*mpf(z))*mp.cos(mpf(z)))/mpf(z)**mpf(7)
bvec=np.vectorize(bv)
return bvec(z)
#explicit accurate 3, computation time increases markedly, accurate
def bessel6_exp3(z):
def bv(z):
return -((mpf(z)**6-210*mpf(z)**4+4725*mpf(z)**2-10395)*mp.sin(mpf(z))+(21*mpf(z)**5-1260*mpf(z)**3+10395*mpf(z))*mp.cos(mpf(z)))/mpf(z)**7
bvec=np.vectorize(bv)
return bvec(z)
#implemented in scipy, accurate, fast
def bessel6_imp(z):
def bv(z):
return scipy.special.sph_jn(6,(z))[0][6]
bvec=np.vectorize(bv)
return bvec(z)
a=np.arange(0.0001,17,0.0001)
plt.figure()
start = time.time()
plt.plot(a,bessel6_expi(a),'b',lw=1,label='expi')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_exp1(a),'m',lw=1,label='exp1')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_exp2(a),'c',lw=3,label='exp2')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_exp2(a),'y',lw=5,linestyle='--',label='exp3')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_imp(a),'r',lw=1,label='imp')
end = time.time()
print(end - start)
plt.ylim(-0.5/10**7,2.5/10**7)
plt.xlim(0,2.0)
plt.legend()
plt.show()
The problem I have now is that just for plotting the explicit, accurate ones, it takes quite a long time (about 31 times slower than the scipy function for mp.dps=100). Smaller dps do not make these processes much faster, even with mp.dps=15, they are still 26 times slower. Is there a way to make this faster?
Note that the loss of accuracy you observe near zero comes from the fact that you are subtracting two nearly equal terms both of the form 10395 z^-6 + O(z^-4). As the true value is 1/135135 z^6 + O(z^8) you will lose a factor of ~1.4 x 10^9 z^-12 in accuracy. So if you want to calculate the value at z=0.01 to, say, 7 decimals you need to start with >40 decimals precision.
The solution is of course to avoid this cancellation. A straight-forward way of achieving this is to compute the power series around 0.
You could use sympy to obtain the power series:
>>> z = sympy.Symbol('z')
>>> f = -((z**6-210*z**4+4725*z**2-10395)*sympy.sin(z)+(21*z**5-1260*z**3+10395*z)*sympy.cos(z))/z**7
>>> f.nseries(n=20)
z**6/135135 - z**8/4054050 + z**10/275675400 - z**12/31426995600 + z**14/5279735260800 - z**16/1214339109984000 + z**18/364301732995200000 + O(z**20)
For small z a small number of terms appear to be enough for good accuracy.
>>> ply = f.nseries(n=20).removeO().as_poly()
>>> float(ply.subs(z, 0.1))
7.397541093587708e-12
You can export the coefficients for use with numpy.
>>> monoms = np.array(ply.monoms(), dtype=int).ravel()
>>> coeffs = np.array(ply.coeffs(), dtype=float)
>>>
>>> (np.linspace(-0.1, 0.1, 21)[:, None]**monoms * coeffs).sum(axis=1)
array([7.39754109e-12, 3.93160564e-12, 1.93945374e-12, 8.70461282e-13,
3.45213317e-13, 1.15615481e-13, 3.03088138e-14, 5.39444356e-15,
4.73594159e-16, 7.39998273e-18, 0.00000000e+00, 7.39998273e-18,
4.73594159e-16, 5.39444356e-15, 3.03088138e-14, 1.15615481e-13,
3.45213317e-13, 8.70461282e-13, 1.93945374e-12, 3.93160564e-12,
7.39754109e-12])
I'm converting a Matlab script to Python and I am getting different results in the 10**-4 order.
In matlab:
f_mean=f_mean+nanmean(f);
f = f - nanmean(f);
f_t = gradient(f);
f_tt = gradient(f_t);
if n_loop==1
theta = atan2( sum(f.*f_tt), sum(f.^2) );
end
theta = -2.2011167e+03
In Python:
f_mean = f_mean + np.nanmean(vel)
vel = vel - np.nanmean(vel)
firstDerivative = np.gradient(vel)
secondDerivative = np.gradient(firstDerivative)
if numberLoop == 1:
theta = np.arctan2(np.sum(vel * secondDerivative),
np.sum([vel**2]))
Although first and secondDerivative give the same results in Python and Matlab, f_mean is slightly different: -0.0066412 (Matlab) and -0.0066414 (Python); and so theta: -0.4126186 (M) and -0.4124718 (P). It is a small difference, but in the end leads to different results in my scripts.
I know some people asked about this difference, but always regarding std, which I get, but not regarding mean values. I wonder why it is.
One possible source of the initial difference you describe (between means) could be numpy's use of pairwise summation which on large arrays will typically be appreciably more accurate than the naive method:
a = np.random.uniform(-1, 1, (10**6,))
a = np.r_[-a, a]
# so the sum should be zero
a.sum()
# 7.815970093361102e-14
# use cumsum to get naive summation:
a.cumsum()[-1]
# -1.3716805469243809e-11
Edit (thanks #sascha): for the last word and as a "provably exact" reference you could use math.fsum:
import math
math.fsum(a)
# 0.0
Don't have matlab, so can't check what they are doing.
Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.
Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).
EDIT: I just tried using scipy.spatial.distance.pdistbut it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.
features = np.array(dataframe)
distances = np.zeros((17000, 17000))
def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).
But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?
You would use it as:
from sklearn.metrics import pairwise
import numpy as np
a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)
The big thing with numpy is to avoid using loops and to let it do its magic with the vectorised operations, so there are a few basic improvements that will save you some computation time:
import numpy as np
import timeit
#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))
def sum_diff():
for i in range(n):
for j in range(n):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
for i in range(n):
for j in range(n):
diff = features[i] - features[j]
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Collapsing of the statements -> no improvement
def sum_diff_v1():
for i in range(n):
for j in range(n):
distances[i][j] = np.sum(np.square(features[i] - features[j]))
# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
for i in range(n):
distances[i] = np.sum(np.square(features[i] - features),axis=1)
# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
for i in range(n):
distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
distances[:] = distances + distances.T
print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))
Edit : For completeness I also timed Camilleri's solution that is much faster:
from sklearn.metrics import pairwise
def Camilleri_solution():
distances=pairwise.pairwise_distances(features)
Timing results (in seconds, function run 10 times with 1000*300 input):
original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531
So as you can see we can easily gain an order of magnitude by using the proper numpy syntax. Note that with only 1/20th of the data the function run in about one second so I would expect the whole thing to run in the tens of minutes as the scipt runs in N^2.