Numba parallel code slower than its sequential counterpart

Numba parallel code slower than its sequential counterpart - python

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)

One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

Related

how to use scipy.special.expi (Exponential integral) in tensorflow?

I have to code scipy.special.expi in tensorflow but I don't know how!!!!
please, someone, help
since there is no as such direct code in tensorflow so I'm stuck here
please help!!!

I don't really know much about this function, but based on the Fortran implementation in SciPy, the function EIX in scipy/special/specfun/specfun.f, I have put together a TensorFlow implementation following each step there. It is only for positive values, though, as the computation for negative values included a loop harder to vectorize.
import math
import tensorflow as tf
def expi(x):
x = tf.convert_to_tensor(x)
# When X is zero
m_0 = tf.equal(x, 0)
y_0 = -math.inf + tf.zeros_like(x)
# When X is negative
m_neg = x < 0
# This should be -e1xb(-x) according to ScyPy
# (negative exponential integral -1)
# Here it is just left as NaN
y_neg = math.nan + tf.zeros_like(x)
# When X is less or equal to 40 - Power series around x = 0
m_le40 = x <= 40
k = tf.range(1, 101, dtype=x.dtype)
r = tf.cumprod(tf.expand_dims(x, -1) * k / tf.square(k + 1), axis=-1)
ga = tf.constant(0.5772156649015328, dtype=x.dtype)
y_le40 = ga + tf.log(x) + x * (1 + tf.reduce_sum(r, axis=-1))
# Otherwise (X is greater than 40) - Asymptotic expansion (the series is not convergent)
k = tf.range(1, 21, dtype=x.dtype)
r = tf.cumprod(k / tf.expand_dims(x, -1), axis=-1)
y_gt40 = tf.exp(x) / x * (1 + tf.reduce_sum(r, axis=-1))
# Select values
return tf.where(
m_0, y_0, tf.where(
m_neg, y_neg, tf.where(
m_le40, y_le40, y_gt40)))
A small test
import tensorflow as tf
import scipy.special
import numpy as np
# Test
x = np.linspace(0, 100, 20)
y = scipy.special.expi(x)
with tf.Graph().as_default(), tf.Session() as sess:
y_tf = sess.run(expi(x))
print(np.allclose(y, y_tf))
# True
Note however this will take more memory than SciPy, because it is unrolling the approximation loops in memory instead of computing one step at a time.

Fastest way to perform calculations on every NXN sub-array in 2D numpy array

I have a 2D numpy array which represents a grayscale image. I need to extract every N x N sub-array within that array, with a specified overlap between sub-arrays, and calculate a property such as the mean, standard deviation, or median.
The code below performs this task but is quite slow because it uses Python for loops. Any ideas on how to vectorize this calculation or otherwise speed it up?
import numpy as np
img = np.random.randn(100, 100)
N = 4
step = 2
h, w = img.shape
out = []
for i in range(0, h - N, step):
outr = []
for j in range(0, w - N, step):
outr.append(np.mean(img[i:i+N, j:j+N]))
out.append(outr)
out = np.array(out)

For mean and standard deviation, there is a fast cumsum based solution.
Here are timings for a 500x200 image, 30x20 window and step sizes 5 and 3. For comparison I use skimage.util.view_as_windows with numpy mean and std.
mn + sd using cumsum 1.1531693299184553 ms
mn using view_as_windows 3.495307120028883 ms
sd using view_as_windows 21.855629019846674 ms
Code:
import numpy as np
from math import gcd
from timeit import timeit
def wsum2d(A, winsz, stepsz, canoverwriteA=False):
M, N = A.shape
m, n = winsz
i, j = stepsz
for X, x, s in ((M, m, i), (N, n, j)):
g = gcd(x, s)
if g > 1:
X //= g
x //= g
s //= g
A = A[:X*g].reshape(X, g, -1).sum(axis=1)
elif not canoverwriteA:
A = A.copy()
canoverwriteA = True
A[x:] -= A[:-x]
A = A.cumsum(axis=0)[x-1::s]
A = A.T
return A
def w2dmnsd(A, winsz, stepsz):
# combine A and A*A into a complex, so overheads apply only once
M21 = wsum2d(A*(A+1j), winsz, stepsz, True)
M2, mean_ = M21.real / np.prod(winsz), M21.imag / np.prod(winsz)
sd = np.sqrt(M2 - mean_*mean_)
return mean_, sd
# test
np.random.seed(0)
A = np.random.random((500, 200))
wsz = (30, 20)
stpsz = (5, 3)
mn, sd = w2dmnsd(A, wsz, stpsz)
from skimage.util import view_as_windows
Av = view_as_windows(A, wsz, stpsz) # this emits a warning on my system
assert np.allclose(mn, np.mean(Av, axis=(2, 3)))
assert np.allclose(sd, np.std(Av, axis=(2, 3)))
from timeit import repeat
print('mn + sd using cumsum ', min(repeat(lambda: w2dmnsd(A, wsz, stpsz), number=100))*10, 'ms')
print('mn using view_as_windows', min(repeat(lambda: np.mean(Av, axis=(2, 3)), number=100))*10, 'ms')
print('sd using view_as_windows', min(repeat(lambda: np.std(Av, axis=(2, 3)), number=100))*10, 'ms')

If Numba is an option the only thing to do is to avoid the list appends (It does work with list appends too, but slower.
To make use of parallization too, rewrote the implementation a bit to avoid the step within range, which is not supported when using parfor.
Example
#nb.njit(error_model='numpy',parallel=True)
def calc_p(img,N,step):
h,w=img.shape
i_w=(h - N)//step
j_w=(w - N)//step
out = np.empty((i_w,j_w))
for i in nb.prange(0, i_w):
for j in range(0, j_w):
out[i,j]=np.std(img[i*step:i*step+N, j*step:j*step+N])
return out
def calc_n(img,N,step):
h, w = img.shape
out = []
for i in range(0, h - N, step):
outr = []
for j in range(0, w - N, step):
outr.append(np.std(img[i:i+N, j:j+N]))
out.append(outr)
return(np.array(out))
Timings
All timings are without compilation overhead of about 0.5s (the first call to the function is excluded from the timings).
#Data
img = np.random.randn(100, 100)
N = 4
step = 2
calc_n :17ms
calc_p :0.033ms
Because this is actually a rolling mean there is further room for improvement if N gets larger.

You could use scikit-image block_reduce:
So your code becomes:
import numpy as np
import skimage.measure
N = 4
# Your main array
a = np.arange(9).reshape(3,3)
mean = skimage.measure.block_reduce(a, (N,N), np.mean)
std_dev = skimage.measure.block_reduce(a, (N,N), np.std)
median = skimage.measure.block_reduce(a, (N,N), np.median)
However, the above code only works for strides/steps of size 1.
For mean, you could use mean pooling which is available in any modern day ML package. As for median and standard deviation, this seems the right approach.

The general case can be solved using scipy.ndimage.generic_filter:
import numpy as np
from scipy.ndimage import generic_filter
img = np.random.randn(100, 100)
N = 4
filtered = generic_filter(img.astype(np.float), np.std, size=N)
step = 2
output = filtered[::step, ::step]
However, this may actually run not much faster than a simple for loop.
To apply a mean and median filter you can use skimage.rank.mean and skimage.rank.median, respectively, which should be faster. There is also scipy.ndimage.median_filter. Otherwise, the mean can also be effectively computed through simple convolution with an (N, N) array with values 1./N^2. For the standard deviation you probably have to bite the bullet and use generic_filter unless your step size is larger or equal to N.

Is there any good way to optimize the speed of this python code?

I have a following piece of code, which basically evaluates some numerical expression, and use it to integrate over certain range of values. The current piece of code runs within about 8.6 s, but I am just using mock values, and my actual array is much larger. Especially, my actual size of freq_c= (3800, 101) and size of number_bin = (3800, 100), which makes the following code really inefficient, as the total execution time will be close to 9 minutes for the actual array. One part of the code that is quite slow is evaluation of k_one_third and k_two_third, for which I have also used numexpr.evaluate(".."), which speeds up the code quite a bit by about 10-20%. But, I have avoided numexpr below, so that anyone can run it without having to install the package. Is there any more ways to improve the speed of this code? An improvement of a few factor would also be good enough. Please note that the for loop is almost unavoidable, due to memory issues, as the arrays are really large, I am manipulating each axis at a time through the loop. I also wonder if numba jit optimisation is possible here.
import numpy as np
import scipy
from scipy.integrate import simps as simps
import time
def k_one_third(x):
return (2.*np.exp(-x**2)/x**(1/3) + 4./x**(1/6)*np.exp(-x)/(1+x**(1/3)))**2
def k_two_third(x):
return (np.exp(-x**2)/x**(2/3) + 2.*x**(5/2)*np.exp(-x)/(6.+x**3))**2
def spectrum(freq_c, number_bin, frequency, gamma, theta):
theta_gamma_factor = np.einsum('i,j->ij', theta**2, gamma**2)
theta_gamma_factor += 1.
t_g_bessel_factor = 1.-1./theta_gamma_factor
number = np.concatenate((number_bin, np.zeros((number_bin.shape[0], 1), dtype=number_bin.dtype)), axis=1)
number_theta_gamma = np.einsum('jk, ik->ijk', theta_gamma_factor**2*1./gamma**3, number)
final = np.zeros((np.size(freq_c[:,0]), np.size(theta), np.size(frequency)))
for i in xrange(np.size(frequency)):
b_n_omega_theta_gamma = frequency[i]**2*number_theta_gamma
eta = theta_gamma_factor**(1.5)*frequency[i]/2.
eta = np.einsum('jk, ik->ijk', eta, 1./freq_c)
bessel_eta = np.einsum('jl, ijl->ijl',t_g_bessel_factor, k_one_third(eta))
bessel_eta += k_two_third(eta)
eta = None
integrand = np.multiply(bessel_eta, b_n_omega_theta_gamma, out= bessel_eta)
final[:,:, i] = simps(integrand, gamma)
integrand = None
return final
frequency = np.linspace(1, 100, 100)
theta = np.linspace(1, 3, 100)
gamma = np.linspace(2, 200, 101)
freq_c = np.random.randint(1, 200, size=(50, 101))
number_bin = np.random.randint(1, 100, size=(50, 100))
time1 = time.time()
spectra = spectrum(freq_c, number_bin, frequency, gamma, theta)
print(time.time()-time1)

I profiled the code and found that k_one_third() and k_two_third() are slow. There are some duplicated calculations in the two functions.
By merging the two functions into one function, and decorate it with #numba.jit(parallel=True), I got 4x speedup.
#jit(parallel=True)
def k_one_two_third(x):
x0 = x ** (1/3)
x1 = np.exp(-x ** 2)
x2 = np.exp(-x)
one = (2*x1/x0 + 4*x2/(x**(1/6)*(x0 + 1)))**2
two = (2*x**(5/2)*x2/(x**3 + 6) + x1/x**(2/3))**2
return one, two

As said in the comments large parts of the code should be rewritten to get best performance.
I have only modified the simpson integration and modified #HYRY answer a bit. This speeds up the calculation from 26.15s to 1.76s (15x), by the test-data you provided. By replacing the np.einsums with simple loops this should end up in less than a second. (About 0.4s from the improved integration, 24s from k_one_two_third(x))
For getting performance using Numba read. The latest Numba version (0.39), the Intel SVML-package and things like fastmath=True makes quite a big impact on your example.
Code
#a bit faster than HYRY's version
#nb.njit(parallel=True,fastmath=True,error_model='numpy')
def k_one_two_third(x):
one=np.empty(x.shape,dtype=x.dtype)
two=np.empty(x.shape,dtype=x.dtype)
for i in nb.prange(x.shape[0]):
for j in range(x.shape[1]):
for k in range(x.shape[2]):
x0 = x[i,j,k] ** (1/3)
x1 = np.exp(-x[i,j,k] ** 2)
x2 = np.exp(-x[i,j,k])
one[i,j,k] = (2*x1/x0 + 4*x2/(x[i,j,k]**(1/6)*(x0 + 1)))**2
two[i,j,k] = (2*x[i,j,k]**(5/2)*x2/(x[i,j,k]**3 + 6) + x1/x[i,j,k]**(2/3))**2
return one, two
#improved integration
#nb.njit(fastmath=True)
def simpson_nb(y_in,dx):
s = y[0]+y[-1]
n=y.shape[0]//2
for i in range(n-1):
s += 4.*y[i*2+1]
s += 2.*y[i*2+2]
s += 4*y[(n-1)*2+1]
return(dx/ 3.)*s
#nb.jit(fastmath=True)
def spectrum(freq_c, number_bin, frequency, gamma, theta):
theta_gamma_factor = np.einsum('i,j->ij', theta**2, gamma**2)
theta_gamma_factor += 1.
t_g_bessel_factor = 1.-1./theta_gamma_factor
number = np.concatenate((number_bin, np.zeros((number_bin.shape[0], 1), dtype=number_bin.dtype)), axis=1)
number_theta_gamma = np.einsum('jk, ik->ijk', theta_gamma_factor**2*1./gamma**3, number)
final = np.empty((np.size(frequency),np.size(freq_c[:,0]), np.size(theta)))
#assume that dx is const. on integration
#speedimprovement of the scipy.simps is about 4x
#numba version to scipy.simps(y,x) is about 60x
dx=gamma[1]-gamma[0]
for i in range(np.size(frequency)):
b_n_omega_theta_gamma = frequency[i]**2*number_theta_gamma
eta = theta_gamma_factor**(1.5)*frequency[i]/2.
eta = np.einsum('jk, ik->ijk', eta, 1./freq_c)
one,two=k_one_two_third(eta)
bessel_eta = np.einsum('jl, ijl->ijl',t_g_bessel_factor, one)
bessel_eta += two
integrand = np.multiply(bessel_eta, b_n_omega_theta_gamma, out= bessel_eta)
#reorder array
for j in range(integrand.shape[0]):
for k in range(integrand.shape[1]):
final[i,j, k] = simpson_nb(integrand[j,k,:],dx)
return final

Vectorizing for loop with repeated indices in python

I am trying to optimize a snippet that gets called a lot (millions of times) so any type of speed improvement (hopefully removing the for-loop) would be great.
I am computing a correlation function of some j'th particle with all others
C_j(|r-r'|) = sqrt(E((s_j(r')-s_k(r))^2)) averaged over k.
My idea is to have a variable corrfun which bins data into some bins (the r, defined elsewhere). I find what bin of r each s_k belongs to and this is stored in ind. So ind[0] is the index of r (and thus the corrfun) for which the j=0 point corresponds to. Multiple points can fall into the same bin (in fact I want bins to be big enough to contain multiple points) so I sum together all of the (s_j(r')-s_k(r))^2 and then divide by number of points in that bin (stored in variable rw). The code I ended up making for this is the following (np is for numpy):
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
Note, the rw2 business was because I want to avoid divide by 0 problems but I do return the rw array and I want to be able to differentiate between the rw=0 and rw=1 elements. Perhaps there is a more elegant solution for this as well.
Is there a way to make the for-loop faster? While I would like to not add the self interaction (j==k) I am even ok with having self interaction if it means I can get significantly faster calculation (length of ind ~ 1E6 so self interaction is probably insignificant anyways).
Thank you!
Ilya
Edit:
Here is the full code. Note, in the full code I am averaging over j as well.
import numpy as np
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
print(r)
corrfun = r*0
rw = r*0
print(maxR)
''' go through all points'''
for j in range(0, n-1):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
return r, corrfun, rw
I debug test it the following way
from twopointcorr import twopointcorr
import numpy as np
import matplotlib.pyplot as plt
import time
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
print('running two point corr functinon')
start_time = time.time()
r,corrfun,rw = twopointcorr(x,y,s,0.1)
print("--- Execution time is %s seconds ---" % (time.time() - start_time))
fig1=plt.figure()
plt.plot(r, corrfun,'-x')
fig2=plt.figure()
plt.plot(r, rw,'-x')
plt.show()
Again, the main issue is that in the real dataset n~1E6. I can resample to make it smaller, of course, but I would love to actually crank through the dataset.

Here is the code that use broadcast, hypot, round, bincount to remove all the loops:
def twopointcorr2(x, y, s, dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
osub = lambda x:np.subtract.outer(x, x)
ind = np.clip(np.round(np.hypot(osub(x), osub(y)) / dr), 0, len(r)-1).astype(int)
rw = np.bincount(ind.ravel())
rw[0] -= len(x)
corrfun = np.bincount(ind.ravel(), (osub(s)**2).ravel())
return r, corrfun, rw
to compare, I modified your code as follows:
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
return r, corrfun, rw
and here is the code to check the results:
import numpy as np
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
r1, corrfun1, rw1 = twopointcorr(x,y,s,0.1)
r2, corrfun2, rw2 = twopointcorr2(x,y,s,0.1)
assert np.allclose(r1, r2)
assert np.allclose(corrfun1, corrfun2)
assert np.allclose(rw1, rw2)
and the %timeit results:
%timeit twopointcorr(x,y,s,0.1)
%timeit twopointcorr2(x,y,s,0.1)
outputs:
1 loop, best of 3: 5.16 s per loop
10 loops, best of 3: 134 ms per loop

Your original code on my system runs in about 5.7 seconds. I fully vectorized the inner loop and got it to run in 0.39 seconds. Simply replace your "go through all points" loop with this:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
# go through all points
for j in range(n): # n.b. previously n-1, not sure why
ind = inds[j]
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[ind[j]] -= 1 # subtract self
The first observation was that your hypot code was computing 2D distances, so I replaced that with cdist from SciPy to do it all in a single call. The second was that the inner for loop was slow, and thanks to an insightful comment from #hpaulj I vectorized that as well using np.add.at().
Since you asked how to vectorize the inner loop as well, I did that later. It now takes 0.25 seconds to run, for a total speedup of over 20x. Here's the final code:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
sn = np.tile(s, (n,1)) # n copies of s
diffs = (sn - sn.T)**2 # squares of pairwise differences
np.add.at(corrfun, inds, diffs)
rw = np.bincount(inds.flatten(), minlength=len(r))
np.subtract.at(rw, inds.diagonal(), 1) # subtract self
This uses more memory but does produce a substantial speedup vs. the single-loop version above.

Ok, so as it turns out outer products are incredibly memory expensive, however, using answers from #HYRY and #JohnZwinck i was able to make code that is still roughly linear in n in memory and computes fast (0.5 seconds for the test case)
import numpy as np
def twopointcorr(x,y,s,dr,maxR=-1):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
if maxR < dr:
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR+dr, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
ind = np.clip(np.round(np.hypot(x[j]-x,y[j]-y) / dr), 0, len(r)-1).astype(int)
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[0] -= n
corrfun = np.sqrt(np.divide(corrfun, np.maximum(rw,1)))
r=np.delete(r,-1)
rw=np.delete(rw,-1)
corrfun=np.delete(corrfun,-1)
return r, corrfun, rw

Correctly annotate a numba function using jit

I started with this code to calculate a simple matrix multiplication. It runs with %timeit in around 7.85s on my machine.
To try to speed this up I tried cython which reduced the time to 0.4s. I want to also try to use numba jit compiler to see if I can get similar speed ups (with less effort). But adding the #jit annotation appears to give exactly the same timings (~7.8s). I know it can't figure out the types of the calculate_z_numpy() call but I'm not sure what I can do to coerce it. Any ideas?
from numba import jit
import numpy as np
#jit('f8(c8[:],c8[:],uint)')
def calculate_z_numpy(q, z, maxiter):
"""use vector operations to update all zs and qs to create new output array"""
output = np.resize(np.array(0, dtype=np.int32), q.shape)
for iteration in range(maxiter):
z = z*z + q
done = np.greater(abs(z), 2.0)
q = np.where(done, 0+0j, q)
z = np.where(done, 0+0j, z)
output = np.where(done, iteration, output)
return output
def calc_test():
w = h = 1000
maxiter = 1000
# make a list of x and y values which will represent q
# xx and yy are the co-ordinates, for the default configuration they'll look like:
# if we have a 1000x1000 plot
# xx = [-2.13, -2.1242,-2.1184000000000003, ..., 0.7526000000000064, 0.7584000000000064, 0.7642000000000064]
# yy = [1.3, 1.2948, 1.2895999999999999, ..., -1.2844000000000058, -1.2896000000000059, -1.294800000000006]
x1, x2, y1, y2 = -2.13, 0.77, -1.3, 1.3
x_step = (float(x2 - x1) / float(w)) * 2
y_step = (float(y1 - y2) / float(h)) * 2
y = np.arange(y2,y1-y_step,y_step,dtype=np.complex)
x = np.arange(x1,x2,x_step)
q1 = np.empty(y.shape[0],dtype=np.complex)
q1.real = x
q1.imag = y
# Transpose y
x_y_square_matrix = x+y[:, np.newaxis] # it is np.complex128
# convert square matrix to a flatted vector using ravel
q2 = np.ravel(x_y_square_matrix)
# create z as a 0+0j array of the same length as q
# note that it defaults to reals (float64) unless told otherwise
z = np.zeros(q2.shape, np.complex128)
output = calculate_z_numpy(q2, z, maxiter)
print(output)
calc_test()

I figured out how to do this with some help from someone else.
#jit('i4[:](c16[:],c16[:],i4,i4[:])',nopython=True)
def calculate_z_numpy(q, z, maxiter,output):
"""use vector operations to update all zs and qs to create new output array"""
for iteration in range(maxiter):
for i in range(len(z)):
z[i] = z[i] + q[i]
if z[i] > 2:
output[i] = iteration
z[i] = 0+0j
q[i] = 0+0j
return output
What I learnt is that use numpy datastructures as inputs (for typing), but within use c like paradigms for looping.
This runs in 402ms which is a touch faster than cython code 0.45s so for fairly minimal work in rewriting the loop explicitly we have a python version faster than C(just).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numba parallel code slower than its sequential counterpart - python

Related

how to use scipy.special.expi (Exponential integral) in tensorflow?

Fastest way to perform calculations on every NXN sub-array in 2D numpy array

Is there any good way to optimize the speed of this python code?

Vectorizing for loop with repeated indices in python

Correctly annotate a numba function using jit

Categories

Resources