Cython anomalous speed differences when casting variable from float to short?

Cython anomalous speed differences when casting variable from float to short? - python

I've noticed that when attempting to optimize a Cython loop, casting a float to a short take significantly more time for a defined (and ctyped) variable. Here is an example function with OPTION 1 and OPTION 2 denoted, one of which will be commented out when comparing the performance:
cpdef np.ndarray[np.int16_t, ndim=2] test_func(short[:, :, :] data_array):
cdef Py_ssize_t i, k, n
cdef Py_ssize_t n_pts = data_array.shape[0], length = data_array.shape[1], width = data_array.shape[2]
cdef float x_diff, y_diff, xy_sum
coeffs_array = np.zeros((length, width), dtype=np.int16)
cdef short[:, :] coeffs = coeffs_array
for i in range(length):
for k in range(width):
xy_sum = 0
for n in range(n_pts):
x_diff = data_array[n, i, k]
y_diff = data_array[n, 0, 0]
xy_sum = xy_sum + (x_diff * y_diff)
# OPTION 1
coeffs[i, k] = <short> xy_sum
# OPTION 2
coeffs[i, k] = <short> (7.235 + 2.31 + 78.123)
return coeffs_array
After compiling with one of the two options active, I tested with the following:
import numpy as np
from gen_libs.test import test_func
import time
np.random.seed(0)
jn = np.random.choice(100, size=(500, 5000, 500)).astype(np.int16)
start_time = time.time()
a = test_func(jn)
print(time.time() - start_time)
The performance of the two options changes drastically:
OPTION 1: 1.5317 seconds
OPTION 2: 0.0025 seconds
What am I missing here? It seems that xy_sum should be a simple ctyped float, just like the sum of the decimal numbers. I tested again by defining a new float variable like so:
cdef float a
a = (7.235 + 2.31 + 78.123)
coeffs[i, k] = <short> a
But again the timing was 0.0026 seconds, so what is it about xy_sum that is causing this ~1,000x slowdown? Is there something connected to the array access of x_diff and y_diff that could be an issue? I'm stumped.
EDIT:
Here's a test that appears to narrow it down to whether or not the variable getting cast from float to short was accumulative or not. No idea why this would make any difference:
Accumulative (xy_sum += x_diff)
cpdef np.ndarray[np.int16_t, ndim=2] test_func(short[:, :, :] data_array):
cdef Py_ssize_t i, k, n
cdef Py_ssize_t n_pts = data_array.shape[0], length = data_array.shape[1], width = data_array.shape[2]
cdef float x_diff, y_diff, xy_sum, a
coeffs_array = np.zeros((length, width), dtype=np.int16)
cdef short[:, :] coeffs = coeffs_array
for i in range(length):
for k in range(width):
xy_sum = 0
for n in range(n_pts):
x_diff = 0.152
# XY_SUM ACCUMULATING
xy_sum += x_diff
coeffs[i, k] = <short> xy_sum
return coeffs_array
Time: 1.068 seconds
Non-accumulative (xy_sum = x_diff)
cpdef np.ndarray[np.int16_t, ndim=2] test_func(short[:, :, :] data_array):
cdef Py_ssize_t i, k, n
cdef Py_ssize_t n_pts = data_array.shape[0], length = data_array.shape[1], width = data_array.shape[2]
cdef float x_diff, y_diff, xy_sum, a
coeffs_array = np.zeros((length, width), dtype=np.int16)
cdef short[:, :] coeffs = coeffs_array
for i in range(length):
for k in range(width):
xy_sum = 0
for n in range(n_pts):
x_diff = 0.152
# XY_SUM NON-ACCUMULATING
xy_sum = x_diff
coeffs[i, k] = <short> xy_sum
return coeffs_array
Time: 0.0025 seconds

Related

Can I make my gaussian moving average on high resolution data any faster?

I need a gaussian moving average to smooth and resample data. For example data with starting with a resolution of 1 m and then resampling so the data has a resolution of 2 km.
The width of the boxes for the Gaussian need to include 3 times the standard deviation. There are also some assumptions I make because I can, the initial data is a square, and the final resolution needed equally divides the original data into boxes.
I am currently using python 3.7 with cython. How could I make this faster?
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef symmetric_g(int length_window, long center, int std):
"""
:param length_window: length L
:param center: center of window
:param std: roll off for Gaussian standard deviation
:return: gaussian weights size L x L
"""
cdef np.ndarray[dtype = double, ndim = 1] L, weights
cdef np.ndarray[dtype = double, ndim = 2 ] weights2D
L = np.linspace(0, length_window, length_window)
weights = np.exp(-(L - center) ** 2 / (2 * std) ** 2)
weights2D = np.outer(weights, weights)
return weights2D
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
def gauss_moving(data, int resolution, int final_res, int std):
cdef:
int xdim, ydim, box_width, half_width, half_pad, padded_width
int wx, wy, count, end_point, i , j
long sum_weights, accume
list ave = []
#Pad data 3*std for allow for boxes on edges to be appropriate size
data_padded = np.pad(data, 3 * std, 'constant', constant_values=np.nan)
xdim, ydim = data_padded.shape
box_width = final_res // resolution
half_width = box_width // 2
padded_width = box_width + (3 * std) * 2
half_pad = padded_width // 2
weights_central = symmetric_g(padded_width, half_width, std)
sum_weights = np.sum(weights_central)
wx, wy = np.asarray(weights_central).shape
count = 0
end_point = xdim - 3 * std
for i in range(half_pad, end_point, box_width):
for j in range(half_pad, end_point, box_width):
box = data_padded[i - half_pad:i + half_pad, j - half_pad: j + half_pad]
count += 1
if np.sum(np.isnan(box)) == 0:
print('if')
accume = 0
for k in range(wx):
for L in range(wy):
accume += box[k, L] * weights_central[k, L]
ave.append(accume / sum_weights)
else:
print('else')
#check for edge nan values and adjust weights
#nans will be excluded from weight calculation
loc_nans = np.where(np.isnan(box))
weight_nans = weights_central.copy()
weight_nans[loc_nans] = np.nan
weight_nan_sum = np.nansum(weight_nans)
accume = 0
if np.sum(np.isnan(weight_nans)) == np.sum(np.isnan(box)):
for k in range(wx):
for L in range(wy):
accume += np.nansum(box[k, L] * weight_nans[k, L])
average = np.nansum(accume / weight_nan_sum)
ave.append(average)
return ave

Cython optimization of the code

I'm struggling to boost the performance of my python particle tracking code with Cython.
Here's my pure Python code:
from scipy.integrate import odeint
import numpy as np
from numpy import sqrt, pi, sin, cos
from time import time as Time
import multiprocessing as mp
from functools import partial
cLight = 299792458.
Dim = 6
class Integrator:
def __init__(self, ring):
self.ring = ring
def equations(self, X, s):
dXds = np.zeros(Dim)
E, B = self.ring.getEMField( [X[0], X[2], s], X[4] )
h = 1 + X[0]/self.ring.ringRadius
p_s = np.sqrt(X[5]**2 - self.ring.particle.mass**2 - X[1]**2 - X[3]**2)
dtds = h*X[5]/p_s
gamma = X[5]/self.ring.particle.mass
beta = np.array( [X[1], X[3], p_s] ) / X[5]
dXds[0] = dtds*beta[0]
dXds[2] = dtds*beta[1]
dXds[1] = p_s/self.ring.ringRadius + self.ring.particle.charge*(dtds*E[0] + dXds[2]*B[2] - h*B[1])
dXds[3] = self.ring.particle.charge*(dtds*E[1] + h*B[0] - dXds[0]*B[2])
dXds[4] = dtds
dXds[5] = self.ring.particle.charge*(dXds[0]*E[0] + dXds[2]*E[1] + h*E[2])
return dXds
def odeSolve(self, X0, sRange):
sol = odeint(self.equations, X0, sRange)
return sol
class Ring:
def __init__(self, particle):
self.particle = particle
self.ringRadius = 7.112
self.magicB0 = self.particle.magicMomentum/self.ringRadius
def getEMField(self, pos, time):
x, y, s = pos
theta = (s/self.ringRadius*180/pi) % 360
r = sqrt(x**2 + y**2)
arg = 0 if r == 0 else np.angle( complex(x/r, y/r) )
rn = r/0.045
k2 = 37*24e3
k10 = -4*24e3
E = np.zeros(3)
B = np.array( [ 0, self.magicB0, 0 ] )
for i in range(4):
if ((21.9+90*i < theta < 34.9+90*i or 38.9+90*i < theta < 64.9+90*i) and (-0.05 < x < 0.05 and -0.05 < y < 0.05)):
E = np.array( [ k2*x/0.045 + k10*rn**9*cos(9*arg), -k2*y/0.045 -k10*rn**9*sin(9*arg), 0] )
break
return E, B
class Particle:
def __init__(self):
self.mass = 105.65837e6
self.charge = 1.
self.gm2 = 0.001165921
self.magicMomentum = self.mass/sqrt(self.gm2)
self.magicEnergy = sqrt(self.magicMomentum**2 + self.mass**2)
self.magicGamma = self.magicEnergy/self.mass
self.magicBeta = self.magicMomentum/(self.magicGamma*self.mass)
def runSimulation(nParticles, tEnd):
particle = Particle()
ring = Ring(particle)
integrator = Integrator(ring)
Xs = np.array( [ np.array( [45e-3*(np.random.rand()-0.5)*2, 0, 0, 0, 0, particle.magicEnergy] ) for i in range(nParticles) ] )
sRange = np.arange(0, tEnd, 1e-9)*particle.magicBeta*cLight
ode = partial(integrator.odeSolve, sRange=sRange)
t1 = Time()
pool = mp.Pool()
sol = np.array(pool.map(ode, Xs))
t2 = Time()
print ("%.3f sec" %(t2-t1))
return t2-t1
Obviously, the most time-consuming process is integrating the ODE, defined as odeSolve() and equations() in class Integrator. Also, getEMField() method in class Ring is called as much as equations() method during the solving process.
I tried to get significant amount of speed up (at least 10x~20x) using Cython, but I only got ~1.5x level of speed up by the following Cython script:
import cython
import numpy as np
cimport numpy as np
from libc.math cimport sqrt, pi, sin, cos
from scipy.integrate import odeint
from time import time as Time
import multiprocessing as mp
from functools import partial
cdef double cLight = 299792458.
cdef int Dim = 6
#cython.boundscheck(False)
cdef class Integrator:
cdef Ring ring
def __init__(self, ring):
self.ring = ring
cpdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] equations(self,
np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] X,
double s):
cdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] dXds = np.zeros(Dim)
cdef double h, p_s, dtds, gamma
cdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] beta, E, B
E, B = self.ring.getEMField( [X[0], X[2], s], X[4] )
h = 1 + X[0]/self.ring.ringRadius
p_s = np.sqrt(X[5]*X[5] - self.ring.particle.mass*self.ring.particle.mass - X[1]*X[1] - X[3]*X[3])
dtds = h*X[5]/p_s
gamma = X[5]/self.ring.particle.mass
beta = np.array( [X[1], X[3], p_s] ) / X[5]
dXds[0] = dtds*beta[0]
dXds[2] = dtds*beta[1]
dXds[1] = p_s/self.ring.ringRadius + self.ring.particle.charge*(dtds*E[0] + dXds[2]*B[2] - h*B[1])
dXds[3] = self.ring.particle.charge*(dtds*E[1] + h*B[0] - dXds[0]*B[2])
dXds[4] = dtds
dXds[5] = self.ring.particle.charge*(dXds[0]*E[0] + dXds[2]*E[1] + h*E[2])
return dXds
cpdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] odeSolve(self,
np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] X0,
np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] sRange):
sol = odeint(self.equations, X0, sRange)
return sol
#cython.boundscheck(False)
cdef class Ring:
cdef Particle particle
cdef double ringRadius
cdef double magicB0
def __init__(self, particle):
self.particle = particle
self.ringRadius = 7.112
self.magicB0 = self.particle.magicMomentum/self.ringRadius
cpdef tuple getEMField(self,
list pos,
double time):
cdef double x, y, s
cdef double theta, r, rn, arg, k2, k10
cdef np.ndarray[np.double_t, ndim=1, negative_indices=False, mode="c"] E, B
x, y, s = pos
theta = (s/self.ringRadius*180/pi) % 360
r = sqrt(x*x + y*y)
arg = 0 if r == 0 else np.angle( complex(x/r, y/r) )
rn = r/0.045
k2 = 37*24e3
k10 = -4*24e3
E = np.zeros(3)
B = np.array( [ 0, self.magicB0, 0 ] )
for i in range(4):
if ((21.9+90*i < theta < 34.9+90*i or 38.9+90*i < theta < 64.9+90*i) and (-0.05 < x < 0.05 and -0.05 < y < 0.05)):
E = np.array( [ k2*x/0.045 + k10*rn**9*cos(9*arg), -k2*y/0.045 -k10*rn**9*sin(9*arg), 0] )
#E = np.array( [ k2*x/0.045, -k2*y/0.045, 0] )
break
return E, B
cdef class Particle:
cdef double mass
cdef double charge
cdef double gm2
cdef double magicMomentum
cdef double magicEnergy
cdef double magicGamma
cdef double magicBeta
def __init__(self):
self.mass = 105.65837e6
self.charge = 1.
self.gm2 = 0.001165921
self.magicMomentum = self.mass/sqrt(self.gm2)
self.magicEnergy = sqrt(self.magicMomentum**2 + self.mass**2)
self.magicGamma = self.magicEnergy/self.mass
self.magicBeta = self.magicMomentum/(self.magicGamma*self.mass)
def runSimulation(nParticles, tEnd):
particle = Particle()
ring = Ring(particle)
integrator = Integrator(ring)
#nParticles = 5
Xs = np.array( [ np.array( [45e-3*(np.random.rand()-0.5)*2, 0, 0, 0, 0, particle.magicEnergy] ) for i in range(nParticles) ] )
sRange = np.arange(0, tEnd, 1e-9)*particle.magicBeta*cLight
ode = partial(integrator.odeSolve, sRange=sRange)
t1 = Time()
pool = mp.Pool()
sol = np.array(pool.map(ode, Xs))
t2 = Time()
print ("%.3f sec" %(t2-t1))
return t2-t1
What should I do to get the maximum effect from Cython?
(I tried Numba instead of Cython, and actually the performance gain from Numba was enormous (around ~20x speedup). But I had extremely hard time to utilize Numba with python class instances, and I decided to use Cython instead of Numba).
For reference, the following is cython annotation on its compilation:

This is a very incomplete answer since I haven't profiled or timed anything or even checked that it gives the same answer. However here are some suggestions that reduce the amount of Python code that Cython generates:
Add the #cython.cdivision(True) compilation directive. This means that a ZeroDivisionError won't be raised on float division and you'll get a NaN value instead. (Only do this if you don't want the error to be raised).
Change p_s = np.sqrt(...) to p_s = sqrt(...). This removes a numpy call that only operates on a single value. You seem to have done this elsewhere so I don't know why you missed this line.
Where possible use fixed size C arrays instead of numpy arrays:
cdef double beta[3]
# ...
beta[0] = X[1]/X[5]
beta[1] = X[3]/X[5]
beta[2] = p_s/X[5]
You can do this when the size is known at compile time (and fairly small) and when you don't want to return it. This avoids a call to np.zeros and some subsequent type-checking to assign it the the typed numpy array. I think beta is the only place you can do this.
np.angle( complex(x/r, y/r) ) can be replaced by atan2(y/r, x/r) (using atan2 from libc.math. You can also lose the division by r
cdef int i helps make your for loop faster in getEMField (Cython is often good at automatically picking up the types of loop variables but seems to have failed here)
I suspect it's quicker to assign E element-by-element than as a whole array:
E[0] = k2*x/0.045 + k10*rn**9*cos(9*arg)
E[1] = -k2*y/0.045 -k10*rn**9*sin(9*arg)
There isn't much value in specifying types like list and tuple and it may actually make the code slightly slower (because it will waste time checking the types).
A bigger change would be to pass E and B into GetEMField as pointers rather than using allocating them np.zeros. This would let you allocate them as static C arrays in equations (cdef double E[3]). The downside is that GetEMField would have to be cdef so no longer callable from Python (but you could make a Python callable wrapper function too if you like).

Nested loop optimization in cython: is there a faster way to setup this code example?

As part of a large piece of code, I need to call the (simplified) function example (pasted below) multiple (hundreds of thousands of) times, with different arguments. As such, I need this module to run quickly.
The main issue with the module seems to be the multiple nested loops. However, I am not sure if there is actually unnecessary overhead associated with these loops (as written), or if the code is really as fast it can get.
In general, when dealing with multiple nested for loops in cython, are there loop optimization techniques that can be used to reduce overhead and speed up the code? Do any of these techniques apply to the example code pasted below?
I am also compiling the cython with extra_compile_args=["-ffast-math",'-O3'], though this doesn't seem to make a huge difference.
If this code really can't get any faster in cython (which I hope is not the case), would there be any advantage to writing all or part of this module in C or Fortran?
import numpy as np
import math
cimport numpy as np
cimport cython
DTYPE = np.float
ctypedef np.float_t DTYPE_t
cdef extern from "math.h":
double log(double x) nogil
double exp(double x) nogil
double pow(double x, double y) nogil
def example(double[::1] xbg_PSF_compressed, double[::1] theta, double[::1] f_ary, double[::1] df_rho_div_f_ary, double[::1] PS_dist_compressed, int[::1] data, double Sc = 1000.0):
return example_int(xbg_PSF_compressed,theta, f_ary, df_rho_div_f_ary, PS_dist_compressed, data, Sc)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
#cython.initializedcheck(False)
cdef double example_int(double[::1] xbg_PSF_compressed, double[::1] theta, double[::1] f_ary, double[::1] df_rho_div_f_ary, double[::1] PS_dist_compressed, int[::1] data, double Sc ):
cdef int k_max = np.max(data) + 1
cdef double A = np.float(theta[0])
cdef double n1 = np.float(theta[1])
cdef double n2 = np.float(theta[2])
cdef double Sb = np.float(theta[3])
cdef int npixROI = len(xbg_PSF_compressed)
cdef double f2 = 0.0
cdef double df_rho_div_f2 = 0.0
cdef double[:,::1] x_m_ary = np.zeros((k_max + 1,npixROI), dtype=DTYPE)
cdef double[::1] x_m_sum = np.zeros(npixROI, dtype=DTYPE)
cdef double[:,::1] x_m_ary_f = np.zeros((k_max + 1, npixROI), dtype=DTYPE)
cdef double[::1] x_m_sum_f = np.zeros(npixROI, dtype=DTYPE)
cdef double[::1] g1_ary_f = np.zeros(k_max + 1, dtype=DTYPE)
cdef double[::1] g2_ary_f = np.zeros(k_max + 1, dtype=DTYPE)
cdef Py_ssize_t f_index, p, k, n
#calculations for PS
cdef int do_half = 0
cdef double term1 = 0.0
cdef double term2 = 0.0
cdef double second_2_a = 0.0
cdef double second_2_b = 0.0
cdef double second_2_c = 0.0
cdef double second_2_d = 0.0
cdef double second_1_a = 0.0
cdef double second_1_b = 0.0
cdef double second_1_c = 0.0
cdef double second_1_d = 0.0
for f_index in range(len(f_ary)):
f2 = f_ary[f_index]
df_rho_div_f2 = df_rho_div_f_ary[f_index]
g1_ary_f = np.random.random(k_max+1)
g2_ary_f = np.random.random(k_max+1)
term1 = (A * Sb * f2) \
* (1./(n1-1.) + 1./(1.-n2) - pow(Sb / Sc, n1-1.)/(n1-1.) \
- (pow(Sb * f2, n1-1.) * g1_ary_f[0] + pow(Sb * f2, n2-1.) * g2_ary_f[0]))
second_1_a = A * pow(Sb * f2, n1)
second_1_b = A * pow(Sb * f2, n2)
for p in range(npixROI):
x_m_sum_f[p] = term1 * PS_dist_compressed[p]
x_m_sum[p] += df_rho_div_f2*x_m_sum_f[p]
second_1_c = second_1_a * PS_dist_compressed[p]
second_1_d = second_1_b * PS_dist_compressed[p]
for k in range(data[p]+1):
x_m_ary_f[k,p] = second_1_c * g1_ary_f[k] + second_1_d * g2_ary_f[k]
x_m_ary[k,p] += df_rho_div_f2*x_m_ary_f[k,p]
cdef double[::1] nu_ary = np.zeros(k_max + 1, dtype=DTYPE)
cdef double[::1] f0_ary = np.zeros(npixROI, dtype=DTYPE)
cdef double[::1] f1_ary = np.zeros(npixROI, dtype=DTYPE)
cdef double[:,::1] nu_mat = np.zeros((k_max+1, npixROI), dtype=DTYPE)
cdef double ll = 0.
for p in range(npixROI):
f0_ary[p] = -(xbg_PSF_compressed[p] + x_m_sum[p])
f1_ary[p] = (xbg_PSF_compressed[p] + x_m_ary[1,p])
nu_mat[0,p] = exp(f0_ary[p])
nu_mat[1,p] = nu_mat[0,p] * f1_ary[p]
for k in range(2,data[p]+1):
for n in range(0, k - 1):
nu_mat[k,p] += (k-n)/ float(k) * x_m_ary[k-n,p] * nu_mat[n,p]
nu_mat[k,p] += f1_ary[p] * nu_mat[k-1,p] / float(k)
ll+=log( nu_mat[data[p],p])
if math.isnan(ll) ==True or math.isinf(ll) ==True:
ll = -10.1**10.
return ll
For reference, when trying to run this code, example arguments are
f_ary=np.array([ 0.05, 0.15, 0.25 , 0.35 , 0.45 ,0.55 , 0.65 , 0.75, 0.85 , 0.95])
df_rho_div_f_ary = np.array([ 24.27277928, 2.83852471 , 1.14224844 , 0.61687863 , 0.39948536,
0.30138642 , 0.24715899 , 0.22077999 , 0.21594814 , 0.19035121])
theta=[.002, 3.01,0.01, 10.013]
n_p=1000
data= np.random.randint(1,400,n_p).astype(dtype='int32')
k_max=int(np.max(data))+1
xbg_PSF_compressed= np.ones(n_p)*20
PS_dist_compressed= np.ones(n_p)
and the example may then be called as example(k_max,xbg_PSF_compressed,theta,f_ary,df_rho_div_f_ary, PS_dist_compressed). For timing, I find that this example evaluates in ~10 loops, best of 3: 147 ms per loop. Since the full code takes hours to run, any decrease in this run time would make a big overall difference in performance.

Calling cython -a on your code shows that almost all relevant part run in pure C, so there's not much to gain here.
Still, you're overusing arrays, where a scalar could be enough. or You're using matrices when a 1D array would be enough. Doing this optimization removes a lot of memory accesses, as showcased here:
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
#cython.initializedcheck(False)
cdef double example_int(double[::1] xbg_PSF_compressed, double[::1] theta, double[::1] f_ary, double[::1] df_rho_div_f_ary, double[::1] PS_dist_compressed, int[::1] data, double Sc ):
cdef int k_max = np.max(data) + 1
cdef double A = np.float(theta[0])
cdef double n1 = np.float(theta[1])
cdef double n2 = np.float(theta[2])
cdef double Sb = np.float(theta[3])
cdef int npixROI = len(xbg_PSF_compressed)
cdef double f2 = 0.0
cdef double df_rho_div_f2 = 0.0
cdef double[:,::1] x_m_ary = np.zeros((k_max + 1,npixROI), dtype=DTYPE)
cdef double[::1] x_m_sum = np.zeros(npixROI, dtype=DTYPE)
cdef double x_m_ary_f
cdef double x_m_sum_f
cdef double[::1] g1_ary_f = np.zeros(k_max + 1, dtype=DTYPE)
cdef double[::1] g2_ary_f = np.zeros(k_max + 1, dtype=DTYPE)
cdef Py_ssize_t f_index, p, k, n
#calculations for PS
cdef int do_half = 0
cdef double term1 = 0.0
cdef double term2 = 0.0
cdef double second_2_a = 0.0
cdef double second_2_b = 0.0
cdef double second_2_c = 0.0
cdef double second_2_d = 0.0
cdef double second_1_a = 0.0
cdef double second_1_b = 0.0
cdef double second_1_c = 0.0
cdef double second_1_d = 0.0
for f_index in range(len(f_ary)):
f2 = f_ary[f_index]
df_rho_div_f2 = df_rho_div_f_ary[f_index]
g1_ary_f = np.random.random(k_max+1)
g2_ary_f = np.random.random(k_max+1)
term1 = (A * Sb * f2) \
* (1./(n1-1.) + 1./(1.-n2) - pow(Sb / Sc, n1-1.)/(n1-1.) \
- (pow(Sb * f2, n1-1.) * g1_ary_f[0] + pow(Sb * f2, n2-1.) * g2_ary_f[0]))
second_1_a = A * pow(Sb * f2, n1)
second_1_b = A * pow(Sb * f2, n2)
for p in range(npixROI):
x_m_sum_f = term1 * PS_dist_compressed[p]
x_m_sum[p] += df_rho_div_f2*x_m_sum_f
second_1_c = second_1_a * PS_dist_compressed[p]
second_1_d = second_1_b * PS_dist_compressed[p]
for k in range(data[p]+1):
x_m_ary_f = second_1_c * g1_ary_f[k] + second_1_d * g2_ary_f[k]
x_m_ary[k,p] += df_rho_div_f2*x_m_ary_f
cdef double[::1] nu_ary = np.zeros(k_max + 1, dtype=DTYPE)
cdef double f0_ary
cdef double f1_ary
cdef double[:] nu_mat = np.zeros((k_max+1), dtype=DTYPE)
cdef double ll = 0.
for p in range(npixROI):
f0_ary = -(xbg_PSF_compressed[p] + x_m_sum[p])
f1_ary = (xbg_PSF_compressed[p] + x_m_ary[1,p])
nu_mat[0] = exp(f0_ary)
nu_mat[1] = nu_mat[0] * f1_ary
for k in range(2,data[p]+1):
for n in range(0, k - 1):
nu_mat[k] += (k-n)/ float(k) * x_m_ary[k-n,p] * nu_mat[n]
nu_mat[k] += f1_ary * nu_mat[k-1] / float(k)
ll+=log( nu_mat[data[p]])
if math.isnan(ll) or math.isinf(ll):
ll = -10.1**10.
return ll
Running your benchmark on this version yields:
>>> %timeit example(xbg_PSF_compressed, theta, f_ary, df_rho_div_f_ary, PS_dist_compressed, data)
10 loops, best of 3: 74.1 ms per loop
When the original code was running much slower:
>>> %timeit example(xbg_PSF_compressed, theta, f_ary, df_rho_div_f_ary, PS_dist_compressed, data)
1 loops, best of 3: 146 ms per loop

Numba or Cython acceleration in reaction-diffusion algorithm

I'd like to accelerate the code written in Python and NumPy. I use Gray-Skott algorithm (http://mrob.com/pub/comp/xmorphia/index.html) for reaction-diffusion model, but with Numba and Cython it's even slower! Is it possible to speed it up? Thanks in advance!
Python+NumPy
def GrayScott(counts, Du, Dv, F, k):
n = 300
U = np.zeros((n+2,n+2), dtype=np.float_)
V = np.zeros((n+2,n+2), dtype=np.float_)
u, v = U[1:-1,1:-1], V[1:-1,1:-1]
r = 20
u[:] = 1.0
U[n/2-r:n/2+r,n/2-r:n/2+r] = 0.50
V[n/2-r:n/2+r,n/2-r:n/2+r] = 0.25
u += 0.15*np.random.random((n,n))
v += 0.15*np.random.random((n,n))
for i in range(counts):
Lu = ( U[0:-2,1:-1] +
U[1:-1,0:-2] - 4*U[1:-1,1:-1] + U[1:-1,2:] +
U[2: ,1:-1] )
Lv = ( V[0:-2,1:-1] +
V[1:-1,0:-2] - 4*V[1:-1,1:-1] + V[1:-1,2:] +
V[2: ,1:-1] )
uvv = u*v*v
u += Du*Lu - uvv + F*(1 - u)
v += Dv*Lv + uvv - (F + k)*v
return V
Numba
from numba import jit, autojit
#autojit
def numbaGrayScott(counts, Du, Dv, F, k):
n = 300
U = np.zeros((n+2,n+2), dtype=np.float_)
V = np.zeros((n+2,n+2), dtype=np.float_)
u, v = U[1:-1,1:-1], V[1:-1,1:-1]
r = 20
u[:] = 1.0
U[n/2-r:n/2+r,n/2-r:n/2+r] = 0.50
V[n/2-r:n/2+r,n/2-r:n/2+r] = 0.25
u += 0.15*np.random.random((n,n))
v += 0.15*np.random.random((n,n))
Lu = np.zeros_like(u)
Lv = np.zeros_like(v)
for i in range(counts):
for row in range(n):
for col in range(n):
Lu[row,col] = U[row+1,col+2] + U[row+1,col] + U[row+2,col+1] + U[row,col+1] - 4*U[row+1,col+1]
Lv[row,col] = V[row+1,col+2] + V[row+1,col] + V[row+2,col+1] + V[row,col+1] - 4*V[row+1,col+1]
uvv = u*v*v
u += Du*Lu - uvv + F*(1 - u)
v += Dv*Lv + uvv - (F + k)*v
return V
Cython
%%cython
cimport cython
import numpy as np
cimport numpy as np
cpdef cythonGrayScott(int counts, double Du, double Dv, double F, double k):
cdef int n = 300
cdef np.ndarray U = np.zeros((n+2,n+2), dtype=np.float_)
cdef np.ndarray V = np.zeros((n+2,n+2), dtype=np.float_)
cdef np.ndarray u = U[1:-1,1:-1]
cdef np.ndarray v = V[1:-1,1:-1]
cdef int r = 20
u[:] = 1.0
U[n/2-r:n/2+r,n/2-r:n/2+r] = 0.50
V[n/2-r:n/2+r,n/2-r:n/2+r] = 0.25
u += 0.15*np.random.random((n,n))
v += 0.15*np.random.random((n,n))
cdef np.ndarray Lu = np.zeros_like(u)
cdef np.ndarray Lv = np.zeros_like(v)
cdef int i, row, col
cdef np.ndarray uvv
for i in range(counts):
for row in range(n):
for col in range(n):
Lu[row,col] = U[row+1,col+2] + U[row+1,col] + U[row+2,col+1] + U[row,col+1] - 4*U[row+1,col+1]
Lv[row,col] = V[row+1,col+2] + V[row+1,col] + V[row+2,col+1] + V[row,col+1] - 4*V[row+1,col+1]
uvv = u*v*v
u += Du*Lu - uvv + F*(1 - u)
v += Dv*Lv + uvv - (F + k)*v
return V
Usage example:
GrayScott(4000, 0.16, 0.08, 0.04, 0.06)

Here is steps to speedup the cython version:
cdef np.ndarray doen't make element access faster, you need use memoryview in cython: cdef double[:, ::1] bU = U.
Turn off boundscheck and wraparound.
Do all the calculations in for-loop.
Here is the modified cython code:
%%cython
#cython: boundscheck=False
#cython: wraparound=False
cimport cython
import numpy as np
cimport numpy as np
cpdef cythonGrayScott(int counts, double Du, double Dv, double F, double k):
cdef int n = 300
cdef np.ndarray U = np.zeros((n+2,n+2), dtype=np.float_)
cdef np.ndarray V = np.zeros((n+2,n+2), dtype=np.float_)
cdef np.ndarray u = U[1:-1,1:-1]
cdef np.ndarray v = V[1:-1,1:-1]
cdef int r = 20
u[:] = 1.0
U[n/2-r:n/2+r,n/2-r:n/2+r] = 0.50
V[n/2-r:n/2+r,n/2-r:n/2+r] = 0.25
u += 0.15*np.random.random((n,n))
v += 0.15*np.random.random((n,n))
cdef np.ndarray Lu = np.zeros_like(u)
cdef np.ndarray Lv = np.zeros_like(v)
cdef int i, c, r1, c1, r2, c2
cdef double uvv
cdef double[:, ::1] bU = U
cdef double[:, ::1] bV = V
cdef double[:, ::1] bLu = Lu
cdef double[:, ::1] bLv = Lv
for i in range(counts):
for r in range(n):
r1 = r + 1
r2 = r + 2
for c in range(n):
c1 = c + 1
c2 = c + 2
bLu[r,c] = bU[r1,c2] + bU[r1,c] + bU[r2,c1] + bU[r,c1] - 4*bU[r1,c1]
bLv[r,c] = bV[r1,c2] + bV[r1,c] + bV[r2,c1] + bV[r,c1] - 4*bV[r1,c1]
for r in range(n):
r1 = r + 1
for c in range(n):
c1 = c + 1
uvv = bU[r1,c1]*bV[r1,c1]*bV[r1,c1]
bU[r1,c1] += Du*bLu[r,c] - uvv + F*(1 - bU[r1,c1])
bV[r1,c1] += Dv*bLv[r,c] + uvv - (F + k)*bV[r1,c1]
return V
It's about 11x faster than the numpy version.

Aside from the looping and the sheer volume of operations involved, what is most likely killing performance in your case is array allocation. I don't know why your Numba and Cython versions are not living up to your expectation, but you can make your numpy code 2x faster (at the cost of some readability), by doing all operations in-place, i.e. replacing your current loop with:
Lu, Lv, uvv = np.empty_like(u), np.empty_like(v), np.empty_like(u)
for i in range(counts):
Lu[:] = u
Lu *= -4
Lu += U[:-2,1:-1]
Lu += U[1:-1,:-2]
Lu += U[1:-1,2:]
Lu += U[2:,1:-1]
Lu *= Du
Lv[:] = v
Lv *= -4
Lv += V[:-2,1:-1]
Lv += V[1:-1,:-2]
Lv += V[1:-1,2:]
Lv += V[2:,1:-1]
Lv *= Dv
uvv[:] = u
uvv *= v
uvv *= v
Lu -= uvv
Lv += uvv
u *= 1 - F
u += F
u += Lu
v *= 1 - F - k
v += Lv

Calculate special correlation distance matrix faster

I would like to build a distance matrix using Pearson correlation distance.
I first tried the scipy.spatial.distance.pdist(df,'correlation') which is very fast for my 5000 rows * 20 features dataset.
Since I want to build a recommender, I wanted to slightly change the distance, only considering features which are distinct for NaN for both users. Indeed, scipy.spatial.distance.pdist(df,'correlation') output NaN when it meets any feature whose value is float('nan').
Here is my code, df being my 5000*20 pandas DataFrame
dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
for j,row_j in enumerate(df.itertuples()):
if i<j:
print(i,j)
ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))
This code works but it is ashtoningly slow compared to the scipy.spatial.distance.pdist(df,'correlation') one. My question is: how can I improve my code so it runs a lot faster? Or where can I find a library which calculates correlation between two vectors which only take in consideration features which appears in both of them?
Thank you for your answers.

I think you need to do this with Cython, here is an example:
#cython: boundscheck=False, wraparound=False, cdivision=True
import numpy as np
cdef extern from "math.h":
bint isnan(double x)
double sqrt(double x)
def pair_correlation(double[:, ::1] x):
cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
cdef double u, v
cdef int i, j, k, count
cdef double du, dv, d, n, r
cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv
for i in range(x.shape[0]):
for j in range(i, x.shape[0]):
sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
count = 0
for k in range(x.shape[1]):
u = x[i, k]
v = x[j, k]
if u == u and v == v:
sum_u += u
sum_v += v
sum_u2 += u*u
sum_v2 += v*v
sum_uv += u*v
count += 1
if count == 0:
res[i, j] = res[j, i] = -9999
continue
um = sum_u / count
vm = sum_v / count
n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count)
dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
r = 1 - n / (du * dv)
res[i, j] = res[j, i] = r
return res.base
To check the output without NAN:
import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))
To check the output with NAN:
x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)
i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython anomalous speed differences when casting variable from float to short? - python

Related

Can I make my gaussian moving average on high resolution data any faster?

Cython optimization of the code

Nested loop optimization in cython: is there a faster way to setup this code example?

Numba or Cython acceleration in reaction-diffusion algorithm

Calculate special correlation distance matrix faster

Categories

Resources