How To Optimise This Cython Function? - python

I have a Cython module:
#!python
#cython: language_level=3, boundscheck=False, nonecheck=False
import numpy as np
cimport numpy as np
def portfolio_s2( double[:,:] cv, double[:] weights ):
""" Calculate portfolio variance"""
cdef double s0
cdef double s1
cdef double s2
s0 = 0.0
for i in range( weights.shape[0] ):
s0 += weights[i]*weights[i]*cv[i,i]
s1 = 0.0
for i in range( weights.shape[0]-1 ):
s2 = 0.0
for j in range( i+1, weights.shape[0] ):
s2 += weights[j]*cv[i,j]
s1+= weights[i]*s2
return s0+2.0*s1
I have the equivalent function in Numba:
#nb.jit( nopython=True )
def portfolio_s2( cv, weights ):
""" Calculate portfolio variance using numba """
s0 = 0.0
for i in range( weights.shape[0] ):
s0 += weights[i]*weights[i]*cv[i,i]
s1 = 0.0
for i in range( weights.shape[0]-1 ):
s2 = 0.0
for j in range( i+1, weights.shape[0] ):
s2 += weights[j]*cv[i,j]
s1+= weights[i]*s2
return s0+2.0*s1
For a covariance matrix of size 10, the Numba version is 20 times faster than Cython. I assume that this is due to something I am doing wrong in Cython, but I am new to Cython and am not sure what to do.
Using Cel's Optimisation...
I have written a script to test Cel's code vs the Numba version:
sizes = [ 2, 3, 4, 6, 8, 12, 16, 32, 48, 64, 96, 128, 196, 256 ]
cython_timings = []
numba_timings = []
for size in sizes:
X = np.random.randn(100,size)
cv = np.cov( X, rowvar=0 )
w = np.ones( cv.shape[0] )
num_tests=10
pm.portfolio_s2( cv, w )
with Timer( 'Cython' ) as cython_timer:
for _ in range( num_tests ):
s2_cython = pm.portfolio_s2_opt( cv, w )
cython_timings.append( cython_timer.interval )
helpers.portfolio_s2( cv, w )
with Timer( 'Numba' ) as numba_timer:
for _ in range( num_tests ):
s2_numba = helpers.portfolio_s2( cv, w )
numba_timings.append( numba_timer.interval )
plt.plot( sizes, cython_timings, label='Cython' )
plt.plot( sizes, numba_timings, label='Numba' )
plt.title( 'Execution Time By Covariance Size' )
plt.legend()
plt.show()
The resulting chart looks like this:
The chart shows that for small covariance matrices, Numba performs better. But as the covariance matrix size increases, Cython scales better and eventually outperforms by a large margin.
Is there some sort of function call overhead that is causing Cython to have such poor performance for small matrices? My use case for this code will involve calculating covariances for lots of small covariance matrices. So I need better performance for small matrices rather than large.

The important thing when using Cython is to make sure that everything is statically typed.
In your example the loop variables i and j were not typed. The declaration cdef size_t i, j already gives you a massive speedup.
There are nice examples in the Working with NumPy section of cython's docs.
This is my setup and the evaluation:
import numpy as np
n = 100
cv = np.random.rand(n,n)
weights= np.random.rand(n)
The original version:
%timeit portfolio_s2(cv, weights)
10000 loops, best of 3: 147 µs per loop
The optimized version:
%timeit portfolio_s2_opt(cv, weights)
100000 loops, best of 3: 10 µs per loop
And here is the code:
import numpy as np
cimport numpy as np
def portfolio_s2_opt(double[:,:] cv, double[:] weights):
""" Calculate portfolio variance"""
cdef double s0
cdef double s1
cdef double s2
cdef size_t i, j
s0 = 0.0
for i in range( weights.shape[0] ):
s0 += weights[i]*weights[i]*cv[i,i]
s1 = 0.0
for i in range( weights.shape[0]-1 ):
s2 = 0.0
for j in range( i+1, weights.shape[0] ):
s2 += weights[j]*cv[i,j]
s1+= weights[i]*s2
return s0+2.0*s1

Related

Cython call optimization

I've got a Python function I try to export to Cython. I have tested two implementations but I don't understand why the second one is slower than the first one. Furthermore, I am looking for ways to improve speed a little more but I have no clue how ?
Base code
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
cdef extern from "math.h":
double exp(double x)
#cython.boundscheck(False)
#cython.wraparound(False)
def bilateral_filter_C(np.ndarray[np.float_t, ndim=1] samples, int w=20):
# Filter Parameters
cdef Py_ssize_t size = samples.shape[0]
cdef float rang
cdef float sigma = 2*3.0*3.0
cdef int j, L
cdef unsigned int a, b
cdef np.float_t W, num, sub_sample, intensity
# Initialization
cdef np.ndarray[np.float_t, ndim=1] gauss = np.zeros(2*w+1, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] sub_samples, intensities = np.empty(size, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] samples_filtered = np.empty(size, dtype=np.float)
L = 2*w+1
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
<CODE TO IMPROVE>
return samples_filtered
I tried to inject those two code samples in the <CODE TO IMPROVE> section:
Most efficient approach
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
L = b-a
sub_samples = samples[a:b]-samples[i]
sub_samples *= sub_samples
for j in xrange(L):
sub_samples[j] = exp(-sub_samples[j]/sigma)
intensities = gauss[w-i+a:w-i+b]*sub_samples
num = 0.0
W = 0.0
for j in xrange(L):
W += intensities[j]
num += intensities[j]*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 45 ms per loop
Less efficient
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
num = 0.0
W = 0.0
for j in xrange(b-a):
sub_sample = samples[a+j]-samples[i]
intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma)
W += intensity
num += intensity*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 125 ms per loop
You have a few typos:
1) You forgot to define i, just add cdef int i, j, L
2) In the second algorithm you wrote intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma), it should be intensity, without the 1
3) I would add #cython.cdivision(True) to avoid the check of division by zero
With those changes and with x = np.random.rand(10000)I got the following results
%timeit bilateral_filter_C1(x, 20) # First code
10 loops, best of 3: 74.1 ms per loop
%timeit bilateral_filter_C2(x, 20) # Second code
100 loops, best of 3: 9.5 ms per loop
And, to check the results
np.all(np.equal(bilateral_filter_C1(x, 20), bilateral_filter_C2(x, 20)))
True
To avoid these problems I suggest to use the option cython my_file.pyx -a, it generates an html file that shows you the possible problems you have in your code
EDIT
Reading again the code, it seems to have more errors:
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
gauss has the same value always, what is the definition of rang?

Cython speedup isn't as large as expected

I have written a Python function that computes pairwise electromagnetic interactions between a largish number (N ~ 10^3) of particles and stores the results in an NxN complex128 ndarray. It runs, but it is the slowest part of a larger program, taking about 40 seconds when N=900 [corrected]. The original code looks like this:
import numpy as np
def interaction(s,alpha,kprop): # s is an Nx3 real array
# alpha is complex
# kprop is float
ndipoles = s.shape[0]
Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=np.complex128)
I = np.array([[1,0,0],[0,1,0],[0,0,1]])
im = complex(0,1)
k2 = kprop*kprop
for i in range(ndipoles):
xi = s[i,:]
for j in range(ndipoles):
if i != j:
xj = s[j,:]
dx = xi-xj
R = np.sqrt(dx.dot(dx))
n = dx/R
kR = kprop*R
kR2 = kR*kR
A = ((1./kR2) - im/kR)
nxn = np.outer(n, n)
nxn = (3*A-1)*nxn + (1-A)*I
nxn *= -alpha*(k2*np.exp(im*kR))/R
else:
nxn = I
Amat[i,:,j,:] = nxn
return(Amat.reshape((3*ndipoles,3*ndipoles)))
I had never previously used Cython, but that seemed like a good place to start in my effort to speed things up, so I pretty much blindly adapted the techniques I found in online tutorials. I got some speedup (30 seconds vs. 40 seconds), but not nearly as dramatic as I expected, so I'm wondering whether I'm doing something wrong or am missing a critical step. The following is my best attempt at cythonizing the above routine:
import numpy as np
cimport numpy as np
DTYPE = np.complex128
ctypedef np.complex128_t DTYPE_t
def interaction(np.ndarray s, DTYPE_t alpha, float kprop):
cdef float k2 = kprop*kprop
cdef int i,j
cdef np.ndarray xi, xj, dx, n, nxn
cdef float R, kR, kR2
cdef DTYPE_t A
cdef int ndipoles = s.shape[0]
cdef np.ndarray Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=DTYPE)
cdef np.ndarray I = np.array([[1,0,0],[0,1,0],[0,0,1]])
cdef DTYPE_t im = complex(0,1)
for i in range(ndipoles):
xi = s[i,:]
for j in range(ndipoles):
if i != j:
xj = s[j,:]
dx = xi-xj
R = np.sqrt(dx.dot(dx))
n = dx/R
kR = kprop*R
kR2 = kR*kR
A = ((1./kR2) - im/kR)
nxn = np.outer(n, n)
nxn = (3*A-1)*nxn + (1-A)*I
nxn *= -alpha*(k2*np.exp(im*kR))/R
else:
nxn = I
Amat[i,:,j,:] = nxn
return(Amat.reshape((3*ndipoles,3*ndipoles)))
The real power of NumPy is in performing an operation across a huge number of elements in a vectorized manner instead of using that operation in chunks spread across loops. In your case, you are using two nested loops and one IF conditional statement. I would propose extending the dimensions of the intermediate arrays, which would bring in NumPy's powerful broadcasting capability to come into play and thus the same operations could be used on all elements in one go instead of small chunks of data within the loops.
For extending the dimensions, None/np.newaxis could be used. So, the vectorized implementation to follow such a premise would look like this -
def vectorized_interaction(s,alpha,kprop):
im = complex(0,1)
I = np.array([[1,0,0],[0,1,0],[0,0,1]])
k2 = kprop*kprop
# Vectorized calculations for dx, R, n, kR, A
sd = s[:,None] - s
Rv = np.sqrt((sd**2).sum(2))
nv = sd/Rv[:,:,None]
kRv = Rv*kprop
Av = (1./(kRv*kRv)) - im/kRv
# Vectorized calculation for: "nxn = np.outer(n, n)"
nxnv = nv[:,:,:,None]*nv[:,:,None,:]
# Vectorized calculation for: "(3*A-1)*nxn + (1-A)*I"
P = (3*Av[:,:,None,None]-1)*nxnv + (1-Av[:,:,None,None])*I
# Vectorized calculation for: "-alpha*(k2*np.exp(im*kR))/R"
multv = -alpha*(k2*np.exp(im*kRv))/Rv
# Vectorized calculation for: "nxn *= -alpha*(k2*np.exp(im*kR))/R"
outv = P*multv[:,:,None,None]
# Simulate ELSE part of the conditional statement"if i != j:"
# with masked setting to I on the last two dimensions
outv[np.eye((N),dtype=bool)] = I
return outv.transpose(0,2,1,3).reshape(N*3,-1)
Runtime tests and output verification -
Case #1:
In [703]: N = 10
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [704]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [705]: %timeit interaction(s,alpha,kprop)
100 loops, best of 3: 7.6 ms per loop
In [706]: %timeit vectorized_interaction(s,alpha,kprop)
1000 loops, best of 3: 304 µs per loop
Case #2:
In [707]: N = 100
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [708]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [709]: %timeit interaction(s,alpha,kprop)
1 loops, best of 3: 826 ms per loop
In [710]: %timeit vectorized_interaction(s,alpha,kprop)
100 loops, best of 3: 14 ms per loop
Case #3:
In [711]: N = 900
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [712]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [713]: %timeit interaction(s,alpha,kprop)
1 loops, best of 3: 1min 7s per loop
In [714]: %timeit vectorized_interaction(s,alpha,kprop)
1 loops, best of 3: 1.59 s per loop

Unexpectedly slow cython convolution code

I need to quickly compute a matrix whose entries are obtained by convolving a filter with a vector for each row, subsampling the entries of the resulting vector, and then taking the dot product of the result with another vector. Specifically, I want to compute
M = [conv(e_j, f)*P_i*v_i ]_{i,j},
where i varies from 1 to n and j varies from 1 to m. Here e_j is the indicator (row) vector of size n with a one only in column j, f is the filter of length s, P_i is an (n+s-1)-by-k matrix which samples the appropriate k entries from the convolution, and v_i is a column vector of length k.
It takes O(n*s) operations to compute each entry of M, so O(n*s*n*m) overall to compute M. For n=6, m=7, s=3, one core of my computer (8GLOPs) should be able compute M in roughly .094 microseconds. Yet my very simple cython implementation, following the example given in the Cython documentation, takes more than 2 milliseconds to compute an example with those parameters. That is about 4 orders of magnitude difference!
Here is a shar file with the Cython implementation and test code. Copy and paste it to a file and run 'bash <fname>' in a clean directory to get the code, then run 'bash ./test.sh' to see the abysmal performance.
cat > fastcalcM.pyx <<'EOF'
import numpy as np
cimport numpy as np
cimport cython
from scipy.signal import convolve
DTYPE=np.float32
ctypedef np.float32_t DTYPE_t
#cython.boundscheck(False)
def calcM(np.ndarray[DTYPE_t, ndim=1, negative_indices=False] filtertaps, int
n, int m, np.ndarray[np.int_t, ndim=2, negative_indices=False]
keep_indices, np.ndarray[DTYPE_t, ndim=2, negative_indices=False] V):
""" Computes a numrows-by-k matrix M whose entries satisfy
M_{i,k} = [conv(e_j, f)^T * P_i * v_i],
where v_i^T is the i-th row of V, and P_i samples the entries from
conv(e_j, f)^T indicated by the ith row of the keep_indices matrix """
cdef int k = keep_indices.shape[1]
cdef np.ndarray M = np.zeros((n, m), dtype=DTYPE)
cdef np.ndarray ej = np.zeros((m,), dtype=DTYPE)
cdef np.ndarray convolution
cdef int rowidx, colidx, kidx
for rowidx in range(n):
for colidx in range(m):
ej[colidx] = 1
convolution = convolve(ej, filtertaps, mode='full')
for kidx in range(k):
M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
ej[colidx] = 0
return M
EOF
#-----------------------------------------------------------------------------
cat > test_calcM.py << 'EOF'
import numpy as np
from fastcalcM import calcM
filtertaps = np.array([-1, 2, -1]).astype(np.float32)
n, m = 6, 7
keep_indices = np.array([[1, 3],
[4, 5],
[2, 2],
[5, 5],
[3, 4],
[4, 5]]).astype(np.int)
V = np.random.random_integers(-5, 5, size=(6, 2)).astype(np.float32)
print calcM(filtertaps, n, m, keep_indices, V)
EOF
#-----------------------------------------------------------------------------
cat > test.sh << 'EOF'
python setup.py build_ext --inplace
echo -e "%run test_calcM\n%timeit calcM(filtertaps, n, m, keep_indices, V)" > script.ipy
ipython script.ipy
EOF
#-----------------------------------------------------------------------------
cat > setup.py << 'EOF'
from distutils.core import setup
from Cython.Build import cythonize
import numpy
setup(
name="Fast convolutions",
include_dirs = [numpy.get_include()],
ext_modules = cythonize("fastcalcM.pyx")
)
EOF
I thought maybe the call to scipy's convolve might be the culprit (I'm not certain that cython and scipy play well together), so I implemented my own convolution code ala the same example in Cython documentation, but this resulted in the overall code being about 10 times slower.
Any ideas on how to get closer to the theoretically possible speed, or reasons why the difference is so great?
For one thing, the typing of M, eg and convolution doesn't allow fast indexing. The typing you've done is not particularly helpful at all, actually.
But it doesn't matter, because you have two overheads. The first is converting between Cython and Python types. You should keep untyped arrays around if you want to pass them to Python a lot, to prevent the need to convert. Just moving this to Python sped it up for that reason (1ms → 0.65μs).
Then I profiled it:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 def calcM(filtertaps, n, m, keep_indices, V):
16 4111 3615 0.9 0.1 k = keep_indices.shape[1]
17 4111 8024 2.0 0.1 M = np.zeros((n, m), dtype=np.float32)
18 4111 6090 1.5 0.1 ej = np.zeros((m,), dtype=np.float32)
19
20 28777 18690 0.6 0.3 for rowidx in range(n):
21 197328 123284 0.6 2.2 for colidx in range(m):
22 172662 112348 0.7 2.0 ej[colidx] = 1
23 172662 4076225 23.6 73.6 convolution = convolve(ej, filtertaps, mode='full')
24 517986 395513 0.8 7.1 for kidx in range(k):
25 345324 668309 1.9 12.1 M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
26 172662 120271 0.7 2.2 ej[colidx] = 0
27
28 4111 2374 0.6 0.0 return M
Before you consider anything else, deal with convolve.
Why is convolve slow? Well, it's got a lot of overhead. numpy/scipy normally does; it's best for large datasets. If you know the size of your array is going to stay small, just reimplement convolve in Cython.
Oh, try to use the buffer syntax. Use DTYPE[:, :] for a 2D array, DTYPE[:] for a 1D array, etc. It's the memoryview protocol, and it's way better. There are cases where it has more overhead, but those are typically possible to work around and it's way better in most other ways.
EDIT:
You can try (recursively) inlining the scipy function:
import numpy as np
from scipy.signal.sigtools import _correlateND
def calcM(filtertaps, n, m, keep_indices, V):
k = keep_indices.shape[1]
M = np.zeros((n, m), dtype=np.float32)
ej = np.zeros((m,), dtype=np.float32)
slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
sliced_filtertaps_view = filtertaps[slice_obj]
ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
in1zpadded = np.zeros(ps, ej.dtype)
out = np.empty(ps, ej.dtype)
for rowidx in range(n):
for colidx in range(m):
in1zpadded[colidx] = 1
convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)
for kidx in range(k):
M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
in1zpadded[colidx] = 0
return M
Note that this uses private implementation details.
This is tailored for the particular dimensions, so I don't know if it'll work on your actual data. But it removes the vast majority of overhead. You can then improve this by typing things again:
import numpy as np
cimport numpy as np
from scipy.signal.sigtools import _correlateND
DTYPE=np.float32
ctypedef np.float32_t DTYPE_t
def calcM(filtertaps, int n, int m, np.int_t[:, :] t_keep_indices, DTYPE_t[:, :] t_V):
cdef int rowidx, colidx, kidx, k
cdef DTYPE_t[:, :] t_M
cdef DTYPE_t[:] t_in1zpadded, t_convolution
k = t_keep_indices.shape[1]
t_M = M = np.zeros((n, m), dtype=np.float32)
ej = np.zeros((m,), dtype=np.float32)
slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
sliced_filtertaps_view = filtertaps[slice_obj]
ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
t_in1zpadded = in1zpadded = np.zeros(ps, ej.dtype)
out = np.empty(ps, ej.dtype)
for rowidx in range(n):
for colidx in range(m):
t_in1zpadded[colidx] = 1
t_convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)
for kidx in range(k):
t_M[rowidx, colidx] += t_convolution[<int>t_keep_indices[rowidx, kidx]] * t_V[rowidx, kidx]
t_in1zpadded[colidx] = 0
return M
It's over 10x as fast, but not as high as your pie-in-the-sky estimate. Then again, that calculation was a bit bogus to begin with ;).

Subset of a matrix multiplication, fast, and sparse

Converting a collaborative filtering code to use sparse matrices I'm puzzling on the following problem: given two full matrices X (m by l) and Theta (n by l), and a sparse matrix R (m by n), is there a fast way to calculate the sparse inner product . Large dimensions are m and n (order 100000), while l is small (order 10). This is probably a fairly common operation for big data since it shows up in the cost function of most linear regression problems, so I'd expect a solution built into scipy.sparse, but I haven't found anything obvious yet.
The naive way to do this in python is R.multiply(XTheta.T), but this will result in evaluation of the full matrix XTheta.T (m by n, order 100000**2) which occupies too much memory, then dumping most of the entries since R is sparse.
There is a pseudo solution already here on stackoverflow, but it is non-sparse in one step:
def sparse_mult_notreally(a, b, coords):
rows, cols = coords
rows, r_idx = np.unique(rows, return_inverse=True)
cols, c_idx = np.unique(cols, return_inverse=True)
C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is dense
return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) )
This works fine, and fast, for me on small enough arrays, but it barfs on my big datasets with the following error:
... in sparse_mult(a, b, coords)
132 rows, r_idx = np.unique(rows, return_inverse=True)
133 cols, c_idx = np.unique(cols, return_inverse=True)
--> 134 C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is not sparse
135 return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) )
ValueError: array is too big.
A solution which IS actually sparse, but very slow, is:
def sparse_mult(a, b, coords):
rows, cols = coords
n = len(rows)
C = np.array([ float(a[rows[i],:]*b[:,cols[i]]) for i in range(n) ]) # this is sparse, but VERY slow
return sp.coo_matrix( (C,coords), (a.shape[0],b.shape[1]) )
Does anyone know a fast, fully sparse way to do this?
I profiled 4 different solutions to your problem, and it looks like for any size of the array, the numba jit solution is the best. A close second is #Alexander's cython solution.
Here are the results (M is the number of rows in the x array):
M = 1000
function sparse_dense took 0.03 sec.
function sparse_loop took 0.07 sec.
function sparse_numba took 0.00 sec.
function sparse_cython took 0.09 sec.
M = 10000
function sparse_dense took 2.88 sec.
function sparse_loop took 0.68 sec.
function sparse_numba took 0.00 sec.
function sparse_cython took 0.01 sec.
M = 100000
function sparse_dense ran out of memory
function sparse_loop took 6.84 sec.
function sparse_numba took 0.09 sec.
function sparse_cython took 0.12 sec.
The script I used to profile these methods is:
import numpy as np
from scipy.sparse import coo_matrix
from numba import autojit, jit, float64, int32
import pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"],
"include_dirs":np.get_include()},
reload_support=True)
def sparse_dense(a,b,c):
return coo_matrix(c.multiply(np.dot(a,b)))
def sparse_loop(a,b,c):
"""Multiply sparse matrix `c` by np.dot(a,b) by looping over non-zero
entries in `c` and using `np.dot()` for each entry."""
N = c.size
data = np.empty(N,dtype=float)
for i in range(N):
data[i] = c.data[i]*np.dot(a[c.row[i],:],b[:,c.col[i]])
return coo_matrix((data,(c.row,c.col)),shape=(a.shape[0],b.shape[1]))
##autojit
def _sparse_mult4(a,b,cd,cr,cc):
N = cd.size
data = np.empty_like(cd)
for i in range(N):
num = 0.0
for j in range(a.shape[1]):
num += a[cr[i],j]*b[j,cc[i]]
data[i] = cd[i]*num
return data
_fast_sparse_mult4 = \
jit(float64[:,:](float64[:,:],float64[:,:],float64[:],int32[:],int32[:]))(_sparse_mult4)
def sparse_numba(a,b,c):
"""Multiply sparse matrix `c` by np.dot(a,b) using Numba's jit."""
assert c.shape == (a.shape[0],b.shape[1])
data = _fast_sparse_mult4(a,b,c.data,c.row,c.col)
return coo_matrix((data,(c.row,c.col)),shape=(a.shape[0],b.shape[1]))
def sparse_cython(a, b, c):
"""Computes c.multiply(np.dot(a,b)) using cython."""
from sparse_mult_c import sparse_mult_c
data = np.empty_like(c.data)
sparse_mult_c(a,b,c.data,c.row,c.col,data)
return coo_matrix((data,(c.row,c.col)),shape=(a.shape[0],b.shape[1]))
def unique_rows(a):
a = np.ascontiguousarray(a)
unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
if __name__ == '__main__':
import time
for M in [1000,10000,100000]:
print 'M = %i' % M
N = M + 2
L = 10
x = np.random.rand(M,L)
t = np.random.rand(N,L).T
# number of non-zero entries in sparse r matrix
S = M*10
row = np.random.randint(M,size=S)
col = np.random.randint(N,size=S)
# remove duplicate rows and columns
row, col = unique_rows(np.dstack((row,col)).squeeze()).T
data = np.random.rand(row.size)
r = coo_matrix((data,(row,col)),shape=(M,N))
a2 = sparse_loop(x,t,r)
for f in [sparse_dense,sparse_loop,sparse_numba,sparse_cython]:
t0 = time.time()
try:
a = f(x,t,r)
except MemoryError:
print 'function %s ran out of memory' % f.__name__
continue
elapsed = time.time()-t0
try:
diff = abs(a-a2)
if diff.nnz > 0:
assert np.max(abs(a-a2).data) < 1e-5
except AssertionError:
print f.__name__
raise
print 'function %s took %.2f sec.' % (f.__name__,elapsed)
The cython function is a slightly modified version of #Alexander's code:
# working from tutorial at: http://docs.cython.org/src/tutorial/numpy.html
cimport numpy as np
# Turn bounds checking back on if there are ANY problems!
cimport cython
#cython.boundscheck(False) # turn of bounds-checking for entire function
def sparse_mult_c(np.ndarray[np.float64_t, ndim=2] a,
np.ndarray[np.float64_t, ndim=2] b,
np.ndarray[np.float64_t, ndim=1] data,
np.ndarray[np.int32_t, ndim=1] rows,
np.ndarray[np.int32_t, ndim=1] cols,
np.ndarray[np.float64_t, ndim=1] out):
cdef int n = rows.shape[0]
cdef int k = a.shape[1]
cdef int i,j
cdef double num
for i in range(n):
num = 0.0
for j in range(k):
num += a[rows[i],j] * b[j,cols[i]]
out[i] = data[i]*num
Based on the extra information on the comments, I think what's throwing you off is the call to np.unique. Try the following approach:
import numpy as np
import scipy.sparse as sps
from numpy.core.umath_tests import inner1d
n = 100000
x = np.random.rand(n, 10)
theta = np.random.rand(n, 10)
rows = np.arange(n)
cols = np.arange(n)
np.random.shuffle(rows)
np.random.shuffle(cols)
def sparse_multiply(x, theta, rows, cols):
data = inner1d(x[rows], theta[cols])
return sps.coo_matrix((data, (rows, cols)),
shape=(x.shape[0], theta.shape[0]))
I get the following timings:
n = 1000
%timeit sparse_multiply(x, theta, rows, cols)
1000 loops, best of 3: 465 us per loop
n = 10000
%timeit sparse_multiply(x, theta, rows, cols)
100 loops, best of 3: 4.29 ms per loop
n = 100000
%timeit sparse_multiply(x, theta, rows, cols)
10 loops, best of 3: 61.5 ms per loop
And of course, with n = 100:
>>> np.allclose(sparse_multiply(x, theta, rows, cols).toarray()[rows, cols],
x.dot(theta.T)[rows, cols])
>>> True
Haven't tested Jaime's answer yet (thanks again!), but I implemented another answer that works in the meantime using cython.
file sparse_mult_c.pyx:
# working from tutorial at: http://docs.cython.org/src/tutorial/numpy.html
cimport numpy as np
# Turn bounds checking back on if there are ANY problems!
cimport cython
#cython.boundscheck(False) # turn of bounds-checking for entire function
def sparse_mult_c(np.ndarray[np.float64_t, ndim=2] a,
np.ndarray[np.float64_t, ndim=2] b,
np.ndarray[np.int32_t, ndim=1] rows,
np.ndarray[np.int32_t, ndim=1] cols,
np.ndarray[np.float64_t, ndim=1] C ):
cdef int n = rows.shape[0]
cdef int k = a.shape[1]
cdef int i,j
for i in range(n):
for j in range(k):
C[i] += a[rows[i],j] * b[j,cols[i]]
Then compile it as per http://docs.cython.org/src/userguide/tutorial.html
Then in my python code, I include the following:
def sparse_mult(a, b, coords):
#a,b are np.ndarrays
from sparse_mult_c import sparse_mult_c
rows, cols = coords
C = np.zeros(rows.shape[0])
sparse_mult_c(a,b,rows,cols,C)
return sp.coo_matrix( (C,coords), (a.shape[0],b.shape[1]) )
This works fully sparse and also runs faster than even the original (memory-inefficient for me) solution.

How do i speed up a python nested loop?

I'm trying to calculate the gravity effect of a buried object by calculating the effect on each side of the body then summing up the contributions to get one measurement at one station, an repeating for a number of stations. the code is as follows( the body is a square and the code calculates clockwise around it, that's why it goes from -x back to -x coordinates)
grav = []
x=si.arange(-30.0,30.0,0.5)
#-9.79742526 9.78716693 22.32153704 27.07382349 2138.27146193
xcorn = (-9.79742526,9.78716693 ,9.78716693 ,-9.79742526,-9.79742526)
zcorn = (22.32153704,22.32153704,27.07382349,27.07382349,22.32153704)
gamma = (6.672*(10**-11))#'N m^2 / Kg^2'
rho = 2138.27146193#'Kg / m^3'
grav = []
iter_time=[]
def procedure():
for i in si.arange(len(x)):# cycles position
t0=time.clock()
sum_lines = 0.0
for n in si.arange(len(xcorn)-1):#cycles corners
x1 = xcorn[n]-x[i]
x2 = xcorn[n+1]-x[i]
z1 = zcorn[n]-0.0 #just depth to corner since all observations are on the surface.
z2 = zcorn[n+1]-0.0
r1 = ((z1**2) + (x1**2))**0.5
r2 = ((z2**2) + (x2**2))**0.5
O1 = si.arctan2(z1,x1)
O2 = si.arctan2(z2,x2)
denom = z2-z1
if denom == 0.0:
denom = 1.0e-6
alpha = (x2-x1)/denom
beta = ((x1*z2)-(x2*z1))/denom
factor = (beta/(1.0+(alpha**2)))
term1 = si.log(r2/r1)#log base 10
term2 = alpha*(O2-O1)
sum_lines = sum_lines + (factor*(term1-term2))
sum_lines = sum_lines*2*gamma*rho
grav.append(sum_lines)
t1 = time.clock()
dt = t1-t0
iter_time.append(dt)
Any help in speeding this loop up would be appreciated Thanks.
Your xcorn and zcorn values repeat, so consider caching the result of some of the computations.
Take a look at the timeit and profile modules to get more information about what is taking the most computational time.
It is very inefficient to access individual elements of a numpy array in a Python loop. For example, this Python loop:
for i in xrange(0, len(a), 2):
a[i] = i
would be much slower than:
a[::2] = np.arange(0, len(a), 2)
You could use a better algorithm (less time complexity) or use vector operations on numpy arrays as in the example above. But the quicker way might be just to compile the code using Cython:
#cython: boundscheck=False, wraparound=False
#procedure_module.pyx
import numpy as np
cimport numpy as np
ctypedef np.float64_t dtype_t
def procedure(np.ndarray[dtype_t,ndim=1] x,
np.ndarray[dtype_t,ndim=1] xcorn):
cdef:
Py_ssize_t i, j
dtype_t x1, x2, z1, z2, r1, r2, O1, O2
np.ndarray[dtype_t,ndim=1] grav = np.empty_like(x)
for i in range(x.shape[0]):
for j in range(xcorn.shape[0]-1):
x1 = xcorn[j]-x[i]
x2 = xcorn[j+1]-x[i]
...
grav[i] = ...
return grav
It is not necessary to define all types but if you need a significant speed up compared to Python you should define at least types of arrays and loop indexes.
You could use cProfile (Cython supports it) instead of manual calls to time.clock().
To call procedure():
#!/usr/bin/env python
import pyximport; pyximport.install() # pip install cython
import numpy as np
from procedure_module import procedure
x = np.arange(-30.0,30.0,0.5)
xcorn = np.array((-9.79742526,9.78716693 ,9.78716693 ,-9.79742526,-9.79742526))
grav = procedure(x, xcorn)

Categories