Efficient way of computing Kullback–Leibler divergence in Python - python

I have to compute the Kullback-Leibler Divergence (KLD) between thousands of discrete probability vectors. Currently I am using the following code but it's way too slow for my purposes. I was wondering if there is any faster way to compute KL Divergence?
import numpy as np
import scipy.stats as sc
#n is the number of data points
kld = np.zeros(n, n)
for i in range(0, n):
for j in range(0, n):
if(i != j):
kld[i, j] = sc.entropy(distributions[i, :], distributions[j, :])

Scipy's stats.entropy in its default sense invites inputs as 1D arrays giving us a scalar, which is being done in the listed question. Internally this function also allows broadcasting, which we can abuse in here for a vectorized solution.
From the docs -
scipy.stats.entropy(pk, qk=None, base=None)
If only probabilities pk
are given, the entropy is calculated as S = -sum(pk * log(pk),
axis=0).
If qk is not None, then compute the Kullback-Leibler divergence S =
sum(pk * log(pk / qk), axis=0).
In our case, we are doing these entropy calculations for each row against all rows, performing sum reductions to have a scalar at each iteration with those two nested loops. Thus, the output array would be of shape (M,M), where M is the number of rows in input array.
Now, the catch here is that stats.entropy() would sum along axis=0, so we will feed it two versions of distributions, both of whom would have the rowth-dimension brought to axis=0 for reduction along it and the other two axes interleaved - (M,1) & (1,M) to give us a (M,M) shaped output array using broadcasting.
Thus, a vectorized and much more efficient way to solve our case would be -
from scipy import stats
kld = stats.entropy(distributions.T[:,:,None], distributions.T[:,None,:])
Runtime tests and verify -
In [15]: def entropy_loopy(distrib):
...: n = distrib.shape[0] #n is the number of data points
...: kld = np.zeros((n, n))
...: for i in range(0, n):
...: for j in range(0, n):
...: if(i != j):
...: kld[i, j] = stats.entropy(distrib[i, :], distrib[j, :])
...: return kld
...:
In [16]: distrib = np.random.randint(0,9,(100,100)) # Setup input
In [17]: out = stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
In [18]: np.allclose(entropy_loopy(distrib),out) # Verify
Out[18]: True
In [19]: %timeit entropy_loopy(distrib)
1 loops, best of 3: 800 ms per loop
In [20]: %timeit stats.entropy(distrib.T[:,:,None], distrib.T[:,None,:])
10 loops, best of 3: 104 ms per loop

Related

Loop speed up of FFT in python (with `np.einsum`)

Problem: I want to speed up my python loop containing a lot of products and summations with np.einsum, but I'm also open to any other solutions.
My function takes an vector configuration S of shape (n,n,3) (my case: n=72) and does a Fourier-Transformation on the correlation function for N*N points. The correlation function is defined as the product of every vector with every other. This gets multiplied by a cosine function of the postions of vectors times the kx and ky values. Every position i,j is in the end summed to get one point in k-space p,m:
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
return(chi,kx,ky)
My problem is that I need roughly 100*100 points which are denoted by kx*ky and the loop needs to many hours to finish this job for a lattice with 72*72 vectors.
Number of calculations: 72*72*72*72*100*100
I cannot use the built-in FFT of numpy, because of my triangular grid, so I need some other option to reduce here the computional cost.
My idea: First I recognized that reshaping the configuration into a list of vectors instead of a matrix reduces the computational cost. Furthermore I used the numba package, which also has reduced the cost, but its still too slow. I found out that a good way of calculating these kind of objects is the np.einsum function. Calculating the product of every vector with every vector is done with the following:
np.einsum('ij,kj -> ik',np.reshape(S,(72**2,3)),np.reshape(S,(72**2,3)))
The tricky part is the calculation of the term inside the np.cos. Here I want to caclulate the product between a list of shape (100,1) with the positions of the vectors (e.g. np.shape(x)=(72**2,1)). Especially I really dont know how to implement the distance in x-direction and y-direction with np.einsum.
To reproduce the code (Probably you won't need this): First you need a vector configuration. You can do it simply with np.ones((72,72,3) or you take random vectors as example with:
def spherical_to_cartesian(r, theta, phi):
'''Convert spherical coordinates (physics convention) to cartesian coordinates'''
sin_theta = np.sin(theta)
x = r * sin_theta * np.cos(phi)
y = r * sin_theta * np.sin(phi)
z = r * np.cos(theta)
return x, y, z # return a tuple
def random_directions(n, r):
'''Return ``n`` 3-vectors in random directions with radius ``r``'''
out = np.empty(shape=(n,3), dtype=np.float64)
for i in range(n):
# Pick directions randomly in solid angle
phi = random.uniform(0, 2*np.pi)
theta = np.arccos(random.uniform(-1, 1))
# unpack a tuple
x, y, z = spherical_to_cartesian(r, theta, phi)
out[i] = x, y, z
return out
S = np.reshape(random_directions(72**2,1),(72,72,3))
(The reshape in this example is needed to shape it in the function spin_spin back to the (72**2,3) shape.)
For the positions of vectors I use a triangular grid defined by
def triangular(nsize):
'''Positional arguments of the spin configuration'''
X=np.zeros((nsize,nsize))
Y=np.zeros((nsize,nsize))
for i in range(nsize):
for j in range(nsize):
X[i,j]+=1/2*j+i
Y[i,j]+=np.sqrt(3)/2*j
return(X,Y)
Optimized Numba implementation
The main problem in your code is calling external BLAS function np.dot repeatedly with extremely small data. In this code it would make more sense to calculate them only once, but if you have to do this calculations in a loop write a Numba implementation. Example
Optimized function (brute-force)
import numpy as np
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N).astype(np.float32)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N).astype(np.float32)
x=np.reshape(triangular(n)[0],(n**2)).astype(np.float32)
y=np.reshape(triangular(n)[1],(n**2)).astype(np.float32)
#precalc some values
fact=nb.float32(2/(n**2))
conf_dot=np.dot(conf,conf.T).astype(np.float32)
for p in nb.prange(N):
for m in range(N):
#accumulating on a scalar is often beneficial
acc=nb.float32(0)
for i in range(n**2):
for j in range(n**2):
acc+= conf_dot[i,j]*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
chi[p,m]=fact*acc
return(chi,kx,ky)
Optimized function (removing of redundant calculations)
There are a lot of redundant calculations done. This is an example on how to remove them. This is also a version which does the calculations in double precision.
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
Timings
#brute-force
%timeit res=spin_spin(S,100)
#48 s ± 671 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#new version
%timeit res_2=spin_spin_opt_2(S,100)
#5.33 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=spin_spin_opt_2(S,1000)
#1min 23s ± 2.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edit (SVML-check)
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def foo(n):
x = np.empty(n*8, dtype=np.float64)
ret = np.empty_like(x)
for i in range(ret.size):
ret[i] += np.cos(x[i])
return ret
foo(1000)
if 'intel_svmlcc' in foo.inspect_llvm(foo.signatures[0]):
print("found")
else:
print("not found")
#found
If there is a not found read this link. It should work on Linux and Windows, but I haven't tested it on macOS.
Here is one approach to speed things up. I didn't start using np.einsum because a little tweaking of your loops was sufficient.
The main thing slowing down your code was redundant recalculations of the same thing. The nested loop here is the perpetrator:
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
It contains a lot of redundancy, recalculating vector operations many times.
Consider the np.dot(...): this calculation is completely independent of the points kx and ky. But only the points kx and ky required indexing with m and n. So you can run the dot products over all i and j just once, and save the result, as opposed to recalculating for each m,n (which would be 10,000 times!).
In a similar approach, no need for the vector differences between to be recalculated at each point in the lattice. At every point you calculate every vector distance, when all that is needed is to calculate the vector distances once and merely multiply this result by each lattice point.
So, having fixed the loops and used dictionaries with indices (i,j) as keys to store all the values, you can just look up the relevant value during the loop over i, j. Here is my code:
def spin_spin(S, N):
n = len(S)
conf = np.reshape(S,(n**2, 3))
chi = np.zeros((N, N))
kx = np.linspace(-5*np.pi/3, 5*np.pi/3, N)
ky = np.linspace(-3*np.pi/np.sqrt(3), 3*np.pi/np.sqrt(3), N)
# Minor point; no need to use triangular twice
x, y = triangular(n)
x, y = np.reshape(x,(n**2)), np.reshape(y,(n**2))
# Build a look-up for all the dot products to save calculating them many times
dot_prods = dict()
x_diffs, y_diffs = dict(), dict()
for i, j in itertools.product(range(n**2), range(n**2)):
dot_prods[(i, j)] = np.dot(conf[i], conf[j])
x_diffs[(i, j)], y_diffs[(i, j)] = x[i] - x[j], y[i] - y[j]
# Minor point; improve syntax by converting nested for loops to one line
for p, m in itertools.product(range(N), range(N)):
for i, j in itertools.product(range(n**2), range(n**2)):
# All vector operations are replaced by look ups to the dictionaries defined above
chi[p, m] += 2/(n**2)*dot_prods[(i, j)]*np.cos(kx[p]*(x_diffs[(i, j)]) + ky[m]*(y_diffs[(i, j)]))
return(chi, kx, ky)
I am running this at the moment with the dimensions you provide, on a decent machine, and the loop over i,j finishes in two minutes. That only needs to happen once; then it is just a loop over m, n. Each one of these is taking about 90 seconds, so still a 2-3 hour run time. I welcome any suggestions on how to optimise that cos calculation to speed that up!
I hit the low hanging fruit of optimization, but to give a sense of speed, the loop of i, j takes 2 minutes, and this way it runs 9,999 fewer times!

Fast inner product of two 2-d masked arrays in numpy

My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop

NumPy - vectorizing matrix-matrix column correlation coefficient [duplicate]

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.
Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])

Efficient weighted vector distance calculation with numpy

I want to calculate the squared euclidean distance between two sets of points, inputs and testing. inputs is typically a real array of size ~(200, N), whereas testing is typically ~(1e8, N), and N is around 10. The distances should be scaled in each dimension in N, so I'd be aggregating the expression scale[j]*(inputs[i,j] - testing[ii,j])**2 (where scale is the scaling vector) over N times. I am trying to make this as fast as possible, particularly as N can be large. My first test is
def old_version (inputs, testing, x0):
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros((n, nn))
for d in xrange(d1):
b += x0[d] * (((np.tile(inputs[:, d], (nn, 1)) -
np.tile (testing[:, d], (n, 1)).T))**2).T
return b
Nothing too fancy. I then tried using scipy.spatial.distance.cdist, although I still have to loop through it to get the scaling right
def new_version (inputs, testing, x0):
# import scipy.spatial.distance as dist
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros ((n, nn))
for d in xrange(d1):
b += x0[d] * dist.cdist(inputs[:, d][:, None],
testing[:, d][:, None], 'sqeuclidean')
return b
It would appear that new_version scales better (as N > 1000), but I'm not sure that I've gone as fast as possible here. Any further ideas much appreciated!
This code gave me a factor of 10 over your implementation, give it a try:
x = np.random.randn(200, 10)
y = np.random.randn(1e5, 10)
scale = np.abs(np.random.randn(1, 10))
scale_sqrt = np.sqrt(scale)
dist_map = dist.cdist(x*scale_sqrt, y*scale_sqrt, 'sqeuclidean')
These are the test results:
In [135]: %timeit suggested_version(inputs, testing, x0)
1 loops, best of 3: 341 ms per loop
In [136]: %timeit op_version(inputs, testing, x00) (NOTICE: x00 is a reshape of x0)
1 loops, best of 3: 3.37 s per loop
Just make sure than when you go for the larger N you don't get low on memory. It can really slow things down.

Why doesn't Numba improve this iteration ...?

I am trying out Numba in speeding up a function that computes a minimum conditional probability of joint occurrence.
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,2))
def cooccurance_probability(X):
P = X.shape[1]
CS = np.sum(X, axis=0) #Column Sums
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
D[i, j] = (X[:,i] * X[:,j]).sum() / max(CS[i], CS[j])
return D
cooccurance_probability_numba = autojit(cooccurance_probability)
However I am finding that the performance of cooccurance_probability and cooccurance_probability_numba to be pretty much the same.
%timeit cooccurance_probability(X)
1 loops, best of 3: 302 ms per loop
%timeit cooccurance_probability_numba(X)
1 loops, best of 3: 307 ms per loop
Why is this? Could it be due to the numpy element by element operation?
I am following as an example:
http://nbviewer.ipython.org/github/ellisonbg/talk-sicm2-2013/blob/master/NumbaCython.ipynb
[Note: I could half the execution time due to the symmetric nature of the problem - but that isn't my main concern]
My guess would be that you're hitting the object layer instead of generating native code due to the calls to sum, which means that Numba isn't going to speed things up significantly. It just doesn't know how to optimize/translate sum (at this point). Additionally it's usually better to unroll vectorized operations into explicit loops with Numba. Notice that the ipynb that you link to only calls out to np.sqrt which I believe does get translated to machine code, and it operates on elements, not slices. I would try to expand out the sum in the inner loop as an explicit additional loop over elements, rather than taking slices and using the sum method.
My experience is that Numba can work wonders sometimes, but it doesn't speed-up arbitrary python code. You need to get a sense of the limitations and what it can optimize effectively. Also note that v0.11 is a bit different in this regard as compared to 0.12 and 0.13 due to the major refactoring that Numba went through between those versions.
Below is a solution using Josh's advice, which is spot on. It appears however the max() works fine in the below implementation. It would be great if there was a list of "safe" python / numpy functions.
Note: I reduced the dimensionality of the original matrix to 100 x 200]
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,200))
def cooccurance_probability_explicit(X):
C = X.shape[0]
P = X.shape[1]
# - Column Sums - #
CS = np.zeros((P,), dtype=np.float)
for p in range(P):
for c in range(C):
CS[p] += X[c,p]
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
# - Compute Elemental Pairwise Sums over each Product Vector - #
pws = 0
for c in range(C):
pws += (X[c,i] * X[c,j])
D[i,j] = pws / max(CS[i], CS[j])
return D
cooccurance_probability_explicit_numba = autojit(cooccurance_probability_explicit)
%timeit results:
%timeit cooccurance_probability(X)
10 loops, best of 3: 83 ms per loop
%timeit cooccurance_probability_explicit(X)
1 loops, best of 3: 2.55s per loop
%timeit cooccurance_probability_explicit_numba(X)
100 loops, best of 3: 7.72 ms per loop
The interesting thing about the results is that the explicitly written version executed by python is very slow due to the large type checking overheads. But passing through Numba works it's magic. (Numba is ~11.5 times faster than the python solution using Numpy).
Update: Added a Cython Function for Comparison (thanks to moarningsun: Cython function with variable sized matrix input)
%load_ext cythonmagic
%%cython
import numpy as np
cimport numpy as np
def cooccurance_probability_cy(double[:,:] X):
cdef int C, P, i, j, k
C = X.shape[0]
P = X.shape[1]
cdef double pws
cdef double [:] CS = np.sum(X, axis=0)
cdef double [:,:] D = np.empty((P,P), dtype=np.float)
for i in range(P):
for j in range(P):
pws = 0.0
for c in range(C):
pws += (X[c, i] * X[c, j])
D[i,j] = pws / max(CS[i], CS[j])
return D
%timeit results:
%timeit cooccurance_probability_cy(X)
100 loops, best of 3: 12 ms per loop

Categories