Why doesn't Numba improve this iteration ...? - python

I am trying out Numba in speeding up a function that computes a minimum conditional probability of joint occurrence.
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,2))
def cooccurance_probability(X):
P = X.shape[1]
CS = np.sum(X, axis=0) #Column Sums
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
D[i, j] = (X[:,i] * X[:,j]).sum() / max(CS[i], CS[j])
return D
cooccurance_probability_numba = autojit(cooccurance_probability)
However I am finding that the performance of cooccurance_probability and cooccurance_probability_numba to be pretty much the same.
%timeit cooccurance_probability(X)
1 loops, best of 3: 302 ms per loop
%timeit cooccurance_probability_numba(X)
1 loops, best of 3: 307 ms per loop
Why is this? Could it be due to the numpy element by element operation?
I am following as an example:
http://nbviewer.ipython.org/github/ellisonbg/talk-sicm2-2013/blob/master/NumbaCython.ipynb
[Note: I could half the execution time due to the symmetric nature of the problem - but that isn't my main concern]

My guess would be that you're hitting the object layer instead of generating native code due to the calls to sum, which means that Numba isn't going to speed things up significantly. It just doesn't know how to optimize/translate sum (at this point). Additionally it's usually better to unroll vectorized operations into explicit loops with Numba. Notice that the ipynb that you link to only calls out to np.sqrt which I believe does get translated to machine code, and it operates on elements, not slices. I would try to expand out the sum in the inner loop as an explicit additional loop over elements, rather than taking slices and using the sum method.
My experience is that Numba can work wonders sometimes, but it doesn't speed-up arbitrary python code. You need to get a sense of the limitations and what it can optimize effectively. Also note that v0.11 is a bit different in this regard as compared to 0.12 and 0.13 due to the major refactoring that Numba went through between those versions.

Below is a solution using Josh's advice, which is spot on. It appears however the max() works fine in the below implementation. It would be great if there was a list of "safe" python / numpy functions.
Note: I reduced the dimensionality of the original matrix to 100 x 200]
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,200))
def cooccurance_probability_explicit(X):
C = X.shape[0]
P = X.shape[1]
# - Column Sums - #
CS = np.zeros((P,), dtype=np.float)
for p in range(P):
for c in range(C):
CS[p] += X[c,p]
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
# - Compute Elemental Pairwise Sums over each Product Vector - #
pws = 0
for c in range(C):
pws += (X[c,i] * X[c,j])
D[i,j] = pws / max(CS[i], CS[j])
return D
cooccurance_probability_explicit_numba = autojit(cooccurance_probability_explicit)
%timeit results:
%timeit cooccurance_probability(X)
10 loops, best of 3: 83 ms per loop
%timeit cooccurance_probability_explicit(X)
1 loops, best of 3: 2.55s per loop
%timeit cooccurance_probability_explicit_numba(X)
100 loops, best of 3: 7.72 ms per loop
The interesting thing about the results is that the explicitly written version executed by python is very slow due to the large type checking overheads. But passing through Numba works it's magic. (Numba is ~11.5 times faster than the python solution using Numpy).
Update: Added a Cython Function for Comparison (thanks to moarningsun: Cython function with variable sized matrix input)
%load_ext cythonmagic
%%cython
import numpy as np
cimport numpy as np
def cooccurance_probability_cy(double[:,:] X):
cdef int C, P, i, j, k
C = X.shape[0]
P = X.shape[1]
cdef double pws
cdef double [:] CS = np.sum(X, axis=0)
cdef double [:,:] D = np.empty((P,P), dtype=np.float)
for i in range(P):
for j in range(P):
pws = 0.0
for c in range(C):
pws += (X[c, i] * X[c, j])
D[i,j] = pws / max(CS[i], CS[j])
return D
%timeit results:
%timeit cooccurance_probability_cy(X)
100 loops, best of 3: 12 ms per loop

Related

why does it run faster when it was converted from dense to sparse array?

To simplify, I have the test code below:
from scipy.sparse import csr_matrix, dok_array, issparse
import numpy as np
from tqdm import tqdm
X = np.load('dense.npy')
# convert it to csr sparse matrix
#X = csr_matrix(X)
print(repr(X))
n = X.shape[0]
with tqdm(total=n*(n-1)//2) as pbar:
cooccur = dok_array((n, n), dtype='float32')
for i in range(n):
for j in range(i+1, n):
u, v = X[i], X[j]
if issparse(u):
u = u.toarray()[0]
v = v.toarray()[0]
#import pdb; pdb.set_trace()
m = u - v
min_uv = u - np.maximum(m, 0)
val = np.sum(min_uv - np.abs(m) * min_uv)
pbar.update()
Case 1: Run as it is - the time usage is like this (2min 54sec):
Case 2: uncomment the line X=csr_matrix(X) (just for the sake of comparison), the running time is 1min 56sec:
It is so weird and I can't figure out why it is even slower to operate on dense array. I subsampled the array for this test; for the original array, the run time difference between sparse and dense array is big (due to the large number of iterations.)
I put the code into a function and used line_profiler to see the time usage. My findings are: 1. slice indeed is much slower for sparse matrix; 2. the 3 lines above last line are much faster in Case 2; 3. the total run time is smaller for Case 2 even it takes extra time for slicing and converting to dense vector.
I am so confused why these 3 lines costed different run time in Case 1 and Case 2 - they are exactly the same numpy vectors in the two cases. Any explanations?
The dense.npy file is uploaded to here to reproduce the observation.
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import issparse
n = 1_000
sparsity = 0.98
A = np.random.rand(n, n)
A[A < sparsity] = 0
As = csr_matrix(A)
def _test(X):
n = X.shape[0]
for i in range(n):
for j in range(i+1, n):
u, v = X[i], X[j]
if issparse(u):
u = u.toarray()[0]
v = v.toarray()[0]
m = u - v
min_uv = u - np.maximum(m, 0)
val = np.sum(min_uv - np.abs(m) * min_uv)
Running this on dense:
%timeit _test(A)
5.3 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Running this on sparse:
%timeit _test(As)
1min 10s ± 1.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This makes sense as you're not actually using a sparse data structure for anything - you're just expensively and inefficiently converting it back to a dense data structure every time your inner loop iterates.
I don't know how you got the runtimes you got as the order of magnitude difference between dense and sparse is what I would expect for the code you have provided.

How I can improve a loop for in a Python code?

I am translating this code from Matlab to Python. The code function fine but it is painfully slow in python. In Matlab, the code runs in way less then a minute, in python it took 30 min!!! Someone with mode experience in python could help me?
# P({ai})
somai = 0
for i in range(1, n):
somaj = 0
for j in range(1, n):
exponencial = math.exp(-((a[i] - a[j]) * (a[i] - a[j])) / dev_a2 - ((b[i] - b[j]) * (b[i] - b[j])) / dev_b2)
somaj = somaj + exponencial
somai = somai + somaj
As with MATLAB, I'd recommend you vectorize your code. Iterating by for-loops can be much slower than the lower level implementation of MATLAB and numpy.
Your operations (a[i] - a[j])*(a[i] - a[j]) are pairwise squared-Euclidean distance for all N data points. You can calculate a pairwise distance matrix using scipy's pdist and squareform functions -- pdist, squareform.
Then you calculate the difference between pairwise distance matrices A and B, and sum the exponential decay. So you could get a vectorized code like:
import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
# Example data
N = 1000
a = np.random.rand(N,1)
b = np.random.rand(N,1)
dev_a2 = np.random.rand()
dev_b2 = np.random.rand()
# `a` is an [N,1] matrix (i.e. column vector)
A = pdist(a, 'sqeuclidean')
# Change to pairwise distance matrix
A = squareform(A)
# Divide all elements by same divisor
A = A / dev_a2
# Then do the same for `b`'s
# `b` is an [N,1] matrix (i.e. column vector)
B = pdist(b, 'sqeuclidean')
B = squareform(B)
B = B / dev_b2
# Calculate exponential decay
expo = np.exp(-(A-B))
# Sum all elements
total = np.sum(expo)
Here's a quick timing comparison between the iterative method and this vectorized code.
N: 1000 | Iter Output: 2729989.851117 | Vect Output: 2732194.924364
Iter time: 6.759 secs | Vect time: 0.031 secs
N: 5000 | Iter Output: 24855530.997400 | Vect Output: 24864471.007726
Iter time: 171.795 secs | Vect time: 0.784 secs
Note that the final results are not exactly the same. I'm not sure why this is, it might be rounding error or math error on my part, but I'll leave that to you.
TLDR
Use numpy
Why Numpy?
Python, by default, is slow. One of the powers of python is that it plays nicely with C and has tons of libraries. The one that will help you hear is numpy. Numpy is mostly implemented in C and, when used properly, is blazing fast. The trick is to phrase the code in such a way that you keep the execution inside numpy and outside of python proper.
Code and Results
import math
import numpy as np
n = 1000
np_a = np.random.rand(n)
a = list(np_a)
np_b = np.random.rand(n)
b = list(np_b)
dev_a2, dev_b2 = (1, 1)
def old():
somai = 0.0
for i in range(0, n):
somaj = 0.0
for j in range(0, n):
tmp_1 = -((a[i] - a[j]) * (a[i] - a[j])) / dev_a2
tmp_2 = -((b[i] - b[j]) * (b[i] - b[j])) / dev_b2
exponencial = math.exp(tmp_1 + tmp_2)
somaj += exponencial
somai += somaj
return somai
def new():
tmp_1 = -np.square(np.subtract.outer(np_a, np_a)) / dev_a2
tmp_2 = -np.square(np.subtract.outer(np_b, np_b)) / dev_a2
exponential = np.exp(tmp_1 + tmp_2)
somai = np.sum(exponential)
return somai
old = 1.76 s ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new = 24.6 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is about a 70x improvement
old yields 740919.6020840995
new yields 740919.602084099
Explanation
You'll notice I broke up your code with the tmp_1 and tmp_2 a bit for clarity.
np.random.rand(n): This creates an array of length n that has random floats going from 0 to 1 (excluding 1) (documented here).
np.subtract.outer(a, b): Numpy has modules for all the operators that allow you do various things with them. Lets say you had np_a = [1, 2, 3], np.subtract.outer(np_a, np_a) would yield
array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
Here's a stackoverflow link if you want to go deeper on this. (also the word "outer" comes from "outer product" like from linear algebra)
np.square: simply squares every element in the matrix.
/: In numpy when you do arithmetic operators between scalars and matrices it does the appropriate thing and applies that operation to every element in the matrix.
np.exp: like np.square
np.sum: sums every element together and returns a scalar.

Numpy array and for loop: how to improve it

I am new with Python and I have a question about a for loop to speed up.
Let "u" be an numpy array of dimension (N,K) and let "kernel_vect" be a numpy array of dimension (K,) both of float64 numbers.
I would like to speed up the following code (by eliminating the for loop for example)
Kernel_appo = np.zeros((N**2,))
for k in range(K):
uk = u[:,k]
Mat_appo = np.outer(uk,uk)
Kernel_appo = Kernel_appo + kernel_vect[k] * routines.vec(Mat_appo)
Any idea? Thanks!
Here a faster implementation without a loop over k:
# Version 2
Kernel_appo = np.zeros((N**2,))
for n1 in range(N):
for n2 in range(N):
Kernel_appo[n1*N+n2] = (u[n1,:] * u[n2,:] * kernel_vect).sum()
We can make it faster by using the symmetry of the u product:
# Version 3
Kernel_appo = np.zeros((N,N))
for n1 in range(N):
for n2 in range(n1,N):
Kernel_appo[n1,n2] = (u[n1,:] * u[n2,:] * kernel_vect).sum()
Kernel_appo = np.triu(Kernel_appo, 1) + np.tril(Kernel_appo.transpose(), 0) # make the matrix symmetric
Kernel_appo = np.ravel(Kernel_appo, order='C')
Here is a version that remove one of the loop:
# Version 4
Kernel_appo = np.zeros((N,N))
for n1 in range(N):
Kernel_appo[n1,n1:N] = ((u[n1,:] * kernel_vect) * u[n1:N,:]).sum(axis=1)
Kernel_appo = np.triu(Kernel_appo, 1) + np.tril(Kernel_appo.transpose(), 0)
Kernel_appo = np.ravel(Kernel_appo, order='C')
We still have a loop over N. However, this seems reasonable to keep it since N is small. Removing it will certainly force numpy to create huge matrices in memory which should cause a performance drop (and can even crash if N and K a very big).
Note that Version 4 will probably not be so fast if K is much bigger (since temporary numpy matrices could not fit in the cache of the CPU).
UPDATE: I just discover that this is possible to use the awesome np.einsum in this case:
# Version 5
Kernel_appo = np.ravel(np.einsum('ji,ki,i->jk', u, u, kernel_vect, optimize=True), order='C')
Be prepared, because this simpler implementation is also much faster (because numpy is able to vectorize the code and run it in parallel).
Here are performance results with N=50 and K=5000 on my machine:
Initial code: 58.15 ms
Version 2: 19.94 ms
Version 3: 10.11 ms
Version 4: 5.08 ms
Version 5: 0.57 ms
The final implementation is now about 100 times faster than the initial one!

Loop speed up of FFT in python (with `np.einsum`)

Problem: I want to speed up my python loop containing a lot of products and summations with np.einsum, but I'm also open to any other solutions.
My function takes an vector configuration S of shape (n,n,3) (my case: n=72) and does a Fourier-Transformation on the correlation function for N*N points. The correlation function is defined as the product of every vector with every other. This gets multiplied by a cosine function of the postions of vectors times the kx and ky values. Every position i,j is in the end summed to get one point in k-space p,m:
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
return(chi,kx,ky)
My problem is that I need roughly 100*100 points which are denoted by kx*ky and the loop needs to many hours to finish this job for a lattice with 72*72 vectors.
Number of calculations: 72*72*72*72*100*100
I cannot use the built-in FFT of numpy, because of my triangular grid, so I need some other option to reduce here the computional cost.
My idea: First I recognized that reshaping the configuration into a list of vectors instead of a matrix reduces the computational cost. Furthermore I used the numba package, which also has reduced the cost, but its still too slow. I found out that a good way of calculating these kind of objects is the np.einsum function. Calculating the product of every vector with every vector is done with the following:
np.einsum('ij,kj -> ik',np.reshape(S,(72**2,3)),np.reshape(S,(72**2,3)))
The tricky part is the calculation of the term inside the np.cos. Here I want to caclulate the product between a list of shape (100,1) with the positions of the vectors (e.g. np.shape(x)=(72**2,1)). Especially I really dont know how to implement the distance in x-direction and y-direction with np.einsum.
To reproduce the code (Probably you won't need this): First you need a vector configuration. You can do it simply with np.ones((72,72,3) or you take random vectors as example with:
def spherical_to_cartesian(r, theta, phi):
'''Convert spherical coordinates (physics convention) to cartesian coordinates'''
sin_theta = np.sin(theta)
x = r * sin_theta * np.cos(phi)
y = r * sin_theta * np.sin(phi)
z = r * np.cos(theta)
return x, y, z # return a tuple
def random_directions(n, r):
'''Return ``n`` 3-vectors in random directions with radius ``r``'''
out = np.empty(shape=(n,3), dtype=np.float64)
for i in range(n):
# Pick directions randomly in solid angle
phi = random.uniform(0, 2*np.pi)
theta = np.arccos(random.uniform(-1, 1))
# unpack a tuple
x, y, z = spherical_to_cartesian(r, theta, phi)
out[i] = x, y, z
return out
S = np.reshape(random_directions(72**2,1),(72,72,3))
(The reshape in this example is needed to shape it in the function spin_spin back to the (72**2,3) shape.)
For the positions of vectors I use a triangular grid defined by
def triangular(nsize):
'''Positional arguments of the spin configuration'''
X=np.zeros((nsize,nsize))
Y=np.zeros((nsize,nsize))
for i in range(nsize):
for j in range(nsize):
X[i,j]+=1/2*j+i
Y[i,j]+=np.sqrt(3)/2*j
return(X,Y)
Optimized Numba implementation
The main problem in your code is calling external BLAS function np.dot repeatedly with extremely small data. In this code it would make more sense to calculate them only once, but if you have to do this calculations in a loop write a Numba implementation. Example
Optimized function (brute-force)
import numpy as np
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N).astype(np.float32)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N).astype(np.float32)
x=np.reshape(triangular(n)[0],(n**2)).astype(np.float32)
y=np.reshape(triangular(n)[1],(n**2)).astype(np.float32)
#precalc some values
fact=nb.float32(2/(n**2))
conf_dot=np.dot(conf,conf.T).astype(np.float32)
for p in nb.prange(N):
for m in range(N):
#accumulating on a scalar is often beneficial
acc=nb.float32(0)
for i in range(n**2):
for j in range(n**2):
acc+= conf_dot[i,j]*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
chi[p,m]=fact*acc
return(chi,kx,ky)
Optimized function (removing of redundant calculations)
There are a lot of redundant calculations done. This is an example on how to remove them. This is also a version which does the calculations in double precision.
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
Timings
#brute-force
%timeit res=spin_spin(S,100)
#48 s ± 671 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#new version
%timeit res_2=spin_spin_opt_2(S,100)
#5.33 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=spin_spin_opt_2(S,1000)
#1min 23s ± 2.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edit (SVML-check)
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def foo(n):
x = np.empty(n*8, dtype=np.float64)
ret = np.empty_like(x)
for i in range(ret.size):
ret[i] += np.cos(x[i])
return ret
foo(1000)
if 'intel_svmlcc' in foo.inspect_llvm(foo.signatures[0]):
print("found")
else:
print("not found")
#found
If there is a not found read this link. It should work on Linux and Windows, but I haven't tested it on macOS.
Here is one approach to speed things up. I didn't start using np.einsum because a little tweaking of your loops was sufficient.
The main thing slowing down your code was redundant recalculations of the same thing. The nested loop here is the perpetrator:
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
It contains a lot of redundancy, recalculating vector operations many times.
Consider the np.dot(...): this calculation is completely independent of the points kx and ky. But only the points kx and ky required indexing with m and n. So you can run the dot products over all i and j just once, and save the result, as opposed to recalculating for each m,n (which would be 10,000 times!).
In a similar approach, no need for the vector differences between to be recalculated at each point in the lattice. At every point you calculate every vector distance, when all that is needed is to calculate the vector distances once and merely multiply this result by each lattice point.
So, having fixed the loops and used dictionaries with indices (i,j) as keys to store all the values, you can just look up the relevant value during the loop over i, j. Here is my code:
def spin_spin(S, N):
n = len(S)
conf = np.reshape(S,(n**2, 3))
chi = np.zeros((N, N))
kx = np.linspace(-5*np.pi/3, 5*np.pi/3, N)
ky = np.linspace(-3*np.pi/np.sqrt(3), 3*np.pi/np.sqrt(3), N)
# Minor point; no need to use triangular twice
x, y = triangular(n)
x, y = np.reshape(x,(n**2)), np.reshape(y,(n**2))
# Build a look-up for all the dot products to save calculating them many times
dot_prods = dict()
x_diffs, y_diffs = dict(), dict()
for i, j in itertools.product(range(n**2), range(n**2)):
dot_prods[(i, j)] = np.dot(conf[i], conf[j])
x_diffs[(i, j)], y_diffs[(i, j)] = x[i] - x[j], y[i] - y[j]
# Minor point; improve syntax by converting nested for loops to one line
for p, m in itertools.product(range(N), range(N)):
for i, j in itertools.product(range(n**2), range(n**2)):
# All vector operations are replaced by look ups to the dictionaries defined above
chi[p, m] += 2/(n**2)*dot_prods[(i, j)]*np.cos(kx[p]*(x_diffs[(i, j)]) + ky[m]*(y_diffs[(i, j)]))
return(chi, kx, ky)
I am running this at the moment with the dimensions you provide, on a decent machine, and the loop over i,j finishes in two minutes. That only needs to happen once; then it is just a loop over m, n. Each one of these is taking about 90 seconds, so still a 2-3 hour run time. I welcome any suggestions on how to optimise that cos calculation to speed that up!
I hit the low hanging fruit of optimization, but to give a sense of speed, the loop of i, j takes 2 minutes, and this way it runs 9,999 fewer times!

optimal numba implementations for matrix multiplication depends significantly on matrix size

This question relates to one I posted awhile back:
Python, numpy, einsum multiply a stack of matrices
I am trying to understand why I get the speedups I get with Numba when used in a particular manner when multiplying a stack of a stack of matrices. As before, I am putting in a (500,201,2,2) array, multiplying the (2x2) matrices at the end along the first axis (so 500 multiplications), to get a (201,2,2) array as the result.
Here is the Python code:
from numba import jit # numba 0.24, numpy 1.9.3, python 2.7.11
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
#jit(nopython=True)
def loopMultJit(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = np.dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_2X2(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
x1 = ArrMult[i,0,0] * Arr[j,i,0,0] + ArrMult[i,0,1] * Arr[j,i,1,0]
y1 = ArrMult[i,0,0] * Arr[j,i,0,1] + ArrMult[i,0,1] * Arr[j,i,1,1]
x2 = ArrMult[i,1,0] * Arr[j,i,0,0] + ArrMult[i,1,1] * Arr[j,i,1,0]
y2 = ArrMult[i,1,0] * Arr[j,i,0,1] + ArrMult[i,1,1] * Arr[j,i,1,1]
ArrMult[i,0,0] = x1
ArrMult[i,0,1] = y1
ArrMult[i,1,0] = x2
ArrMult[i,1,1] = y2
return ArrMult
A1 = loopMult(Arr)
A2 = loopMultJit(Arr)
A3 = loopMultJit_2X2(Arr)
print np.allclose(A1, A2)
print np.allclose(A1, A3)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
Here is the output:
True
True
10 loops, best of 3: 40.5 ms per loop
10 loops, best of 3: 36 ms per loop
1000 loops, best of 3: 808 µs per loop
In the prior question, the accepted answer showed that with f2py there was a speedup of 8x without detailed optimization. Here, with Numba, I get about 10% speedup using numba over an einsum loop, but I get 45x speedup if instead of using np.dot in the loop, I simply do the 2x2 matrix multiplication by hand. Why is this? I should mention I have implemented both of these jit functions with proper type signatures as guvectorize versions as well, which basically provides the same speedup factors, so I left them out. Also speedup from iterating over a 201,500,2,2 matrix is minimal.
2 Comments have responded that the speedup is just due to python overhead, and I think that's right. The overhead is mostly function calls, but also for loops, and np.dot has some extra overhead on top of that. I set up a Naive dot product function:
#jit(nopython=True)
def dot(mat1, mat2):
s = 0
mat = np.empty(shape=(mat1.shape[1], mat2.shape[0]), dtype=mat1.dtype)
for r1 in range(mat1.shape[0]):
for c2 in range(mat2.shape[1]):
s = 0
for j in range(mat2.shape[0]):
s += mat1[r1,j] * mat2[j,c2]
mat[r1,c2] = s
return mat
Then I set up to functions to multiply the arrays, one which calls the dot function and one which has the dot function built into the loop, so that it is executed without an extra function call:
#jit(nopython=True)
def loopMultJit_dot(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_dotInternal(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
s = 0.0
for r1 in range(ArrMult.shape[1]):
for c2 in range(Arr.shape[3]):
s = 0.0
for r2 in range(Arr.shape[2]):
s += ArrMult[i,r1,r2] * Arr[j,i,r2,c2]
ArrMult[i,r1,c2] = s
return ArrMult
Then I can run 2 comparisons: 2x2 arrays, and 10x10 arrays. With these I get some idea of the penalties paid for function calls in general, and for the np.dot function call in particular, and the gains from BLAS optimizations in np.dot:
print "2x2 Time Test:"
Arr = rand(500,201,2,2)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
print "10x10 Time Test:"
Arr = rand(500,201,10,10)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
which yields:
2x2 Time Test:
10 loops, best of 3: 55.8 ms per loop # einsum
10 loops, best of 3: 48.7 ms per loop # np.dot
1000 loops, best of 3: 1.09 ms per loop # 2x2
10 loops, best of 3: 28.3 ms per loop # naive dot, separate function
100 loops, best of 3: 2.58 ms per loop # naive dot internal
10x10 Time Test:
1 loop, best of 3: 499 ms per loop # einsum
10 loops, best of 3: 91.3 ms per loop # np.dot
10 loops, best of 3: 170 ms per loop # naive dot, separate function
10 loops, best of 3: 161 ms per loop # naive dot internal
I suppose the take home messages are:
einsum is nice if you're not using numba, or need one-liners, but for matrix multiplication, there are faster options
if you're working with small matrices, it can be faster to do things by hand and not call separate functions
for large matrices, there is a reason BLAS was invented, and in fact, speedups are quite noticeable at sizes as small as 10x10.

Categories