Unexpectedly slow cython convolution code - python

I need to quickly compute a matrix whose entries are obtained by convolving a filter with a vector for each row, subsampling the entries of the resulting vector, and then taking the dot product of the result with another vector. Specifically, I want to compute
M = [conv(e_j, f)*P_i*v_i ]_{i,j},
where i varies from 1 to n and j varies from 1 to m. Here e_j is the indicator (row) vector of size n with a one only in column j, f is the filter of length s, P_i is an (n+s-1)-by-k matrix which samples the appropriate k entries from the convolution, and v_i is a column vector of length k.
It takes O(n*s) operations to compute each entry of M, so O(n*s*n*m) overall to compute M. For n=6, m=7, s=3, one core of my computer (8GLOPs) should be able compute M in roughly .094 microseconds. Yet my very simple cython implementation, following the example given in the Cython documentation, takes more than 2 milliseconds to compute an example with those parameters. That is about 4 orders of magnitude difference!
Here is a shar file with the Cython implementation and test code. Copy and paste it to a file and run 'bash <fname>' in a clean directory to get the code, then run 'bash ./test.sh' to see the abysmal performance.
cat > fastcalcM.pyx <<'EOF'
import numpy as np
cimport numpy as np
cimport cython
from scipy.signal import convolve
DTYPE=np.float32
ctypedef np.float32_t DTYPE_t
#cython.boundscheck(False)
def calcM(np.ndarray[DTYPE_t, ndim=1, negative_indices=False] filtertaps, int
n, int m, np.ndarray[np.int_t, ndim=2, negative_indices=False]
keep_indices, np.ndarray[DTYPE_t, ndim=2, negative_indices=False] V):
""" Computes a numrows-by-k matrix M whose entries satisfy
M_{i,k} = [conv(e_j, f)^T * P_i * v_i],
where v_i^T is the i-th row of V, and P_i samples the entries from
conv(e_j, f)^T indicated by the ith row of the keep_indices matrix """
cdef int k = keep_indices.shape[1]
cdef np.ndarray M = np.zeros((n, m), dtype=DTYPE)
cdef np.ndarray ej = np.zeros((m,), dtype=DTYPE)
cdef np.ndarray convolution
cdef int rowidx, colidx, kidx
for rowidx in range(n):
for colidx in range(m):
ej[colidx] = 1
convolution = convolve(ej, filtertaps, mode='full')
for kidx in range(k):
M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
ej[colidx] = 0
return M
EOF
#-----------------------------------------------------------------------------
cat > test_calcM.py << 'EOF'
import numpy as np
from fastcalcM import calcM
filtertaps = np.array([-1, 2, -1]).astype(np.float32)
n, m = 6, 7
keep_indices = np.array([[1, 3],
[4, 5],
[2, 2],
[5, 5],
[3, 4],
[4, 5]]).astype(np.int)
V = np.random.random_integers(-5, 5, size=(6, 2)).astype(np.float32)
print calcM(filtertaps, n, m, keep_indices, V)
EOF
#-----------------------------------------------------------------------------
cat > test.sh << 'EOF'
python setup.py build_ext --inplace
echo -e "%run test_calcM\n%timeit calcM(filtertaps, n, m, keep_indices, V)" > script.ipy
ipython script.ipy
EOF
#-----------------------------------------------------------------------------
cat > setup.py << 'EOF'
from distutils.core import setup
from Cython.Build import cythonize
import numpy
setup(
name="Fast convolutions",
include_dirs = [numpy.get_include()],
ext_modules = cythonize("fastcalcM.pyx")
)
EOF
I thought maybe the call to scipy's convolve might be the culprit (I'm not certain that cython and scipy play well together), so I implemented my own convolution code ala the same example in Cython documentation, but this resulted in the overall code being about 10 times slower.
Any ideas on how to get closer to the theoretically possible speed, or reasons why the difference is so great?

For one thing, the typing of M, eg and convolution doesn't allow fast indexing. The typing you've done is not particularly helpful at all, actually.
But it doesn't matter, because you have two overheads. The first is converting between Cython and Python types. You should keep untyped arrays around if you want to pass them to Python a lot, to prevent the need to convert. Just moving this to Python sped it up for that reason (1ms → 0.65μs).
Then I profiled it:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 def calcM(filtertaps, n, m, keep_indices, V):
16 4111 3615 0.9 0.1 k = keep_indices.shape[1]
17 4111 8024 2.0 0.1 M = np.zeros((n, m), dtype=np.float32)
18 4111 6090 1.5 0.1 ej = np.zeros((m,), dtype=np.float32)
19
20 28777 18690 0.6 0.3 for rowidx in range(n):
21 197328 123284 0.6 2.2 for colidx in range(m):
22 172662 112348 0.7 2.0 ej[colidx] = 1
23 172662 4076225 23.6 73.6 convolution = convolve(ej, filtertaps, mode='full')
24 517986 395513 0.8 7.1 for kidx in range(k):
25 345324 668309 1.9 12.1 M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
26 172662 120271 0.7 2.2 ej[colidx] = 0
27
28 4111 2374 0.6 0.0 return M
Before you consider anything else, deal with convolve.
Why is convolve slow? Well, it's got a lot of overhead. numpy/scipy normally does; it's best for large datasets. If you know the size of your array is going to stay small, just reimplement convolve in Cython.
Oh, try to use the buffer syntax. Use DTYPE[:, :] for a 2D array, DTYPE[:] for a 1D array, etc. It's the memoryview protocol, and it's way better. There are cases where it has more overhead, but those are typically possible to work around and it's way better in most other ways.
EDIT:
You can try (recursively) inlining the scipy function:
import numpy as np
from scipy.signal.sigtools import _correlateND
def calcM(filtertaps, n, m, keep_indices, V):
k = keep_indices.shape[1]
M = np.zeros((n, m), dtype=np.float32)
ej = np.zeros((m,), dtype=np.float32)
slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
sliced_filtertaps_view = filtertaps[slice_obj]
ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
in1zpadded = np.zeros(ps, ej.dtype)
out = np.empty(ps, ej.dtype)
for rowidx in range(n):
for colidx in range(m):
in1zpadded[colidx] = 1
convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)
for kidx in range(k):
M[rowidx, colidx] += convolution[keep_indices[rowidx, kidx]] * V[rowidx, kidx]
in1zpadded[colidx] = 0
return M
Note that this uses private implementation details.
This is tailored for the particular dimensions, so I don't know if it'll work on your actual data. But it removes the vast majority of overhead. You can then improve this by typing things again:
import numpy as np
cimport numpy as np
from scipy.signal.sigtools import _correlateND
DTYPE=np.float32
ctypedef np.float32_t DTYPE_t
def calcM(filtertaps, int n, int m, np.int_t[:, :] t_keep_indices, DTYPE_t[:, :] t_V):
cdef int rowidx, colidx, kidx, k
cdef DTYPE_t[:, :] t_M
cdef DTYPE_t[:] t_in1zpadded, t_convolution
k = t_keep_indices.shape[1]
t_M = M = np.zeros((n, m), dtype=np.float32)
ej = np.zeros((m,), dtype=np.float32)
slice_obj = [slice(None, None, -1)] * len(filtertaps.shape)
sliced_filtertaps_view = filtertaps[slice_obj]
ps = ej.shape[0] + sliced_filtertaps_view.shape[0] - 1
t_in1zpadded = in1zpadded = np.zeros(ps, ej.dtype)
out = np.empty(ps, ej.dtype)
for rowidx in range(n):
for colidx in range(m):
t_in1zpadded[colidx] = 1
t_convolution = _correlateND(in1zpadded, sliced_filtertaps_view, out, 2)
for kidx in range(k):
t_M[rowidx, colidx] += t_convolution[<int>t_keep_indices[rowidx, kidx]] * t_V[rowidx, kidx]
t_in1zpadded[colidx] = 0
return M
It's over 10x as fast, but not as high as your pie-in-the-sky estimate. Then again, that calculation was a bit bogus to begin with ;).

Related

How to efficiently compute euclidean distance matrices without for loops in python?

I have a (51266,20,25,3) (N,F,J,C) matrix, where N is the example number, F is the frame number, J is the joint, and C is the xyz coordinates of the joint. I want to calculate the euclidean distance matrix for each frame in each example to have a matrix of dimensions (51266,20,25,25) My code is
from sklearn.metrics.pairwise import euclidean_distances as euc
from tqdm import tqdm
import numpy as np
Examples = np.load('allExamples.npy')
theEuclideanMethod = np.zeros((0,20,25,25))
for example in tqdm(range(Examples.shape[0])):
euclideanBox = np.zeros((0,25,25))
for frame in range(20):
euclideanBox = np.concatenate((euclideanBox,euc(Examples[example,frame,:,:])[np.newaxis,...]),axis=0)
euclideanBox = euclideanBox[np.newaxis,...]
theEuclideanMethod = np.concatenate((theEuclideanMethod,euclideanBox))
np.save("Euclidean examples.npy",theEuclideanMethod)
print(theEuclideanMethod.shape,"Euclidean shape")
The problem is I'm using for loops which are super slow. What are other ways I can modify my code to run faster ?
This should run pretty fast. Float32 used to keep the memory usage low, but is optional. Adjust batch_size to be greater for increased speed or lower for less memory usage.
import numpy as np
# Adjust batch_size depending on your memory
batch_size = 500
# Make some fake data
x = np.random.randn(51266,20,25,3).astype(np.float32)
y = np.random.randn(51266,20,25,3).astype(np.float32)
# distance_matrix
d = np.empty(x.shape[:-1] + (x.shape[-2],), dtype=np.float32)
# Number of batches
N = (x.shape[0]-1) // batch_size + 1
for i in range(N):
d[i*batch_size:(i+1)*batch_size] = np.sqrt(np.sum((
x[i*batch_size:(i+1)*batch_size,:,:,None] - \
y[i*batch_size:(i+1)*batch_size,:,None,:])**2, axis=-1))
You can use array broadcasting, like this:
import numpy as np
examples = np.random.uniform(size=(5, 6, 7, 3))
N, F, J, C = examples.shape
# deltas.shape == (N, F, J, J, C) - Cartesian deltas
deltas = examples.reshape(N, F, J, 1, C) - examples.reshape(N, F, 1, J, C)
# distances.shape == (N, F, J, J)
distances = np.sqrt((deltas**2).sum(axis=-1), dtype=np.float32)
del deltas # release memory (only needed for interactive use)
This is a bit memory-hungry: with the values of N, F, J, C that you mentioned, the intermediate results (deltas) will take 16 GB, assuming double precision. It will be more efficient (6x less memory and better use of cache) if you preallocate the output array in single precision and loop over the N axis:
distances = np.empty((N, F, J, J))
for i, ex in enumerate(examples):
# deltas.shape = (F, J, J, C) - Cartesian deltas
deltas = ex.reshape(F, J, 1, C) - ex.reshape(F, 1, J, C)
distances[i] = np.sqrt((deltas**2).sum(axis=-1))

How To Optimise This Cython Function?

I have a Cython module:
#!python
#cython: language_level=3, boundscheck=False, nonecheck=False
import numpy as np
cimport numpy as np
def portfolio_s2( double[:,:] cv, double[:] weights ):
""" Calculate portfolio variance"""
cdef double s0
cdef double s1
cdef double s2
s0 = 0.0
for i in range( weights.shape[0] ):
s0 += weights[i]*weights[i]*cv[i,i]
s1 = 0.0
for i in range( weights.shape[0]-1 ):
s2 = 0.0
for j in range( i+1, weights.shape[0] ):
s2 += weights[j]*cv[i,j]
s1+= weights[i]*s2
return s0+2.0*s1
I have the equivalent function in Numba:
#nb.jit( nopython=True )
def portfolio_s2( cv, weights ):
""" Calculate portfolio variance using numba """
s0 = 0.0
for i in range( weights.shape[0] ):
s0 += weights[i]*weights[i]*cv[i,i]
s1 = 0.0
for i in range( weights.shape[0]-1 ):
s2 = 0.0
for j in range( i+1, weights.shape[0] ):
s2 += weights[j]*cv[i,j]
s1+= weights[i]*s2
return s0+2.0*s1
For a covariance matrix of size 10, the Numba version is 20 times faster than Cython. I assume that this is due to something I am doing wrong in Cython, but I am new to Cython and am not sure what to do.
Using Cel's Optimisation...
I have written a script to test Cel's code vs the Numba version:
sizes = [ 2, 3, 4, 6, 8, 12, 16, 32, 48, 64, 96, 128, 196, 256 ]
cython_timings = []
numba_timings = []
for size in sizes:
X = np.random.randn(100,size)
cv = np.cov( X, rowvar=0 )
w = np.ones( cv.shape[0] )
num_tests=10
pm.portfolio_s2( cv, w )
with Timer( 'Cython' ) as cython_timer:
for _ in range( num_tests ):
s2_cython = pm.portfolio_s2_opt( cv, w )
cython_timings.append( cython_timer.interval )
helpers.portfolio_s2( cv, w )
with Timer( 'Numba' ) as numba_timer:
for _ in range( num_tests ):
s2_numba = helpers.portfolio_s2( cv, w )
numba_timings.append( numba_timer.interval )
plt.plot( sizes, cython_timings, label='Cython' )
plt.plot( sizes, numba_timings, label='Numba' )
plt.title( 'Execution Time By Covariance Size' )
plt.legend()
plt.show()
The resulting chart looks like this:
The chart shows that for small covariance matrices, Numba performs better. But as the covariance matrix size increases, Cython scales better and eventually outperforms by a large margin.
Is there some sort of function call overhead that is causing Cython to have such poor performance for small matrices? My use case for this code will involve calculating covariances for lots of small covariance matrices. So I need better performance for small matrices rather than large.
The important thing when using Cython is to make sure that everything is statically typed.
In your example the loop variables i and j were not typed. The declaration cdef size_t i, j already gives you a massive speedup.
There are nice examples in the Working with NumPy section of cython's docs.
This is my setup and the evaluation:
import numpy as np
n = 100
cv = np.random.rand(n,n)
weights= np.random.rand(n)
The original version:
%timeit portfolio_s2(cv, weights)
10000 loops, best of 3: 147 µs per loop
The optimized version:
%timeit portfolio_s2_opt(cv, weights)
100000 loops, best of 3: 10 µs per loop
And here is the code:
import numpy as np
cimport numpy as np
def portfolio_s2_opt(double[:,:] cv, double[:] weights):
""" Calculate portfolio variance"""
cdef double s0
cdef double s1
cdef double s2
cdef size_t i, j
s0 = 0.0
for i in range( weights.shape[0] ):
s0 += weights[i]*weights[i]*cv[i,i]
s1 = 0.0
for i in range( weights.shape[0]-1 ):
s2 = 0.0
for j in range( i+1, weights.shape[0] ):
s2 += weights[j]*cv[i,j]
s1+= weights[i]*s2
return s0+2.0*s1

Using Cython correctly in sample code with numpy

I was wondering if I'm missing something when using Cython with Numpy because I haven't seen much of an improvement. I wrote this code as an example.
Naive version:
import numpy as np
from skimage.util import view_as_windows
it = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (it, it), it)
container = np.zeros((windows.shape[0], windows.shape[1]))
def test(windows):
for i in range(windows.shape[0]):
for j in range(windows.shape[1]):
container[i,j] = np.mean(windows[i,j])
return container
%%timeit
test(windows)
1 loops, best of 3: 131 ms per loop
Cythonized version:
%%cython --annotate
import numpy as np
cimport numpy as np
from skimage.util import view_as_windows
import cython
cdef int step = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (step, step), step)
#cython.boundscheck(False)
def cython_test(np.ndarray[np.float64_t, ndim=4] windows):
cdef np.ndarray[np.float64_t, ndim=2] container = np.zeros((windows.shape[0], windows.shape[1]),dtype=np.float64)
cdef int i, j
I = windows.shape[0]
J = windows.shape[1]
for i in range(I):
for j in range(J):
container[i,j] = np.mean(windows[i,j])
return container
%timeit cython_test(windows)
10 loops, best of 3: 126 ms per loop
As you can see, there is a very modest improvement, so maybe I'm doing something wrong. By the way, the annotation that Cython produces the following:
As you can see, the numpy lines have a yellow background even after including the efficient indexing syntax np.ndarray[DTYPE_t, ndim=2]. Why?
By the way, in my view the ideal outcome is being able to use most numpy functions but still get some reasonable improvement after taking advantage of efficient indexing syntax or maybe memory views as in HYRY's answer.
UPDATE
It seems I'm not doing anything wrong in the code I posted above and that the yellow background in some lines is normal, so I was left wondering the following: In which situations I can get a benefit from typing cdef np.ndarray[np.float64_t, ndim=2] in front of numpy arrays? I suppose there are specific instances where this is helpful, otherwise there wouldn't be much purpose in doing it.
You need to implement the mean() function yourself to speedup the code, this is because the overhead of calling a numpy function is very high.
#cython.boundscheck(False)
#cython.wraparound(False)
def cython_test(double[:, :, :, :] windows):
cdef double[:, ::1] container
cdef int i, j, k, l
cdef int n0, n1, n2, n3
cdef double inv_n
cdef double s
n0, n1, n2, n3 = windows.base.shape
container = np.zeros((n0, n1))
inv_n = 1.0 / (n2 * n3)
for i in range(n0):
for j in range(n1):
s = 0
for k in range(n2):
for l in range(n3):
s += windows[i, j, k, l]
container[i,j] = s * inv_n
return container.base
Here is the %timeit results:
python_test(windows): 63.7 ms
cython_test(windows): 1.24 ms
np.mean(windows, axis=(2, 3)): 2.66 ms

Subset of a matrix multiplication, fast, and sparse

Converting a collaborative filtering code to use sparse matrices I'm puzzling on the following problem: given two full matrices X (m by l) and Theta (n by l), and a sparse matrix R (m by n), is there a fast way to calculate the sparse inner product . Large dimensions are m and n (order 100000), while l is small (order 10). This is probably a fairly common operation for big data since it shows up in the cost function of most linear regression problems, so I'd expect a solution built into scipy.sparse, but I haven't found anything obvious yet.
The naive way to do this in python is R.multiply(XTheta.T), but this will result in evaluation of the full matrix XTheta.T (m by n, order 100000**2) which occupies too much memory, then dumping most of the entries since R is sparse.
There is a pseudo solution already here on stackoverflow, but it is non-sparse in one step:
def sparse_mult_notreally(a, b, coords):
rows, cols = coords
rows, r_idx = np.unique(rows, return_inverse=True)
cols, c_idx = np.unique(cols, return_inverse=True)
C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is dense
return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) )
This works fine, and fast, for me on small enough arrays, but it barfs on my big datasets with the following error:
... in sparse_mult(a, b, coords)
132 rows, r_idx = np.unique(rows, return_inverse=True)
133 cols, c_idx = np.unique(cols, return_inverse=True)
--> 134 C = np.array(np.dot(a[rows, :], b[:, cols])) # this operation is not sparse
135 return sp.coo_matrix( (C[r_idx,c_idx],coords), (a.shape[0],b.shape[1]) )
ValueError: array is too big.
A solution which IS actually sparse, but very slow, is:
def sparse_mult(a, b, coords):
rows, cols = coords
n = len(rows)
C = np.array([ float(a[rows[i],:]*b[:,cols[i]]) for i in range(n) ]) # this is sparse, but VERY slow
return sp.coo_matrix( (C,coords), (a.shape[0],b.shape[1]) )
Does anyone know a fast, fully sparse way to do this?
I profiled 4 different solutions to your problem, and it looks like for any size of the array, the numba jit solution is the best. A close second is #Alexander's cython solution.
Here are the results (M is the number of rows in the x array):
M = 1000
function sparse_dense took 0.03 sec.
function sparse_loop took 0.07 sec.
function sparse_numba took 0.00 sec.
function sparse_cython took 0.09 sec.
M = 10000
function sparse_dense took 2.88 sec.
function sparse_loop took 0.68 sec.
function sparse_numba took 0.00 sec.
function sparse_cython took 0.01 sec.
M = 100000
function sparse_dense ran out of memory
function sparse_loop took 6.84 sec.
function sparse_numba took 0.09 sec.
function sparse_cython took 0.12 sec.
The script I used to profile these methods is:
import numpy as np
from scipy.sparse import coo_matrix
from numba import autojit, jit, float64, int32
import pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"],
"include_dirs":np.get_include()},
reload_support=True)
def sparse_dense(a,b,c):
return coo_matrix(c.multiply(np.dot(a,b)))
def sparse_loop(a,b,c):
"""Multiply sparse matrix `c` by np.dot(a,b) by looping over non-zero
entries in `c` and using `np.dot()` for each entry."""
N = c.size
data = np.empty(N,dtype=float)
for i in range(N):
data[i] = c.data[i]*np.dot(a[c.row[i],:],b[:,c.col[i]])
return coo_matrix((data,(c.row,c.col)),shape=(a.shape[0],b.shape[1]))
##autojit
def _sparse_mult4(a,b,cd,cr,cc):
N = cd.size
data = np.empty_like(cd)
for i in range(N):
num = 0.0
for j in range(a.shape[1]):
num += a[cr[i],j]*b[j,cc[i]]
data[i] = cd[i]*num
return data
_fast_sparse_mult4 = \
jit(float64[:,:](float64[:,:],float64[:,:],float64[:],int32[:],int32[:]))(_sparse_mult4)
def sparse_numba(a,b,c):
"""Multiply sparse matrix `c` by np.dot(a,b) using Numba's jit."""
assert c.shape == (a.shape[0],b.shape[1])
data = _fast_sparse_mult4(a,b,c.data,c.row,c.col)
return coo_matrix((data,(c.row,c.col)),shape=(a.shape[0],b.shape[1]))
def sparse_cython(a, b, c):
"""Computes c.multiply(np.dot(a,b)) using cython."""
from sparse_mult_c import sparse_mult_c
data = np.empty_like(c.data)
sparse_mult_c(a,b,c.data,c.row,c.col,data)
return coo_matrix((data,(c.row,c.col)),shape=(a.shape[0],b.shape[1]))
def unique_rows(a):
a = np.ascontiguousarray(a)
unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
if __name__ == '__main__':
import time
for M in [1000,10000,100000]:
print 'M = %i' % M
N = M + 2
L = 10
x = np.random.rand(M,L)
t = np.random.rand(N,L).T
# number of non-zero entries in sparse r matrix
S = M*10
row = np.random.randint(M,size=S)
col = np.random.randint(N,size=S)
# remove duplicate rows and columns
row, col = unique_rows(np.dstack((row,col)).squeeze()).T
data = np.random.rand(row.size)
r = coo_matrix((data,(row,col)),shape=(M,N))
a2 = sparse_loop(x,t,r)
for f in [sparse_dense,sparse_loop,sparse_numba,sparse_cython]:
t0 = time.time()
try:
a = f(x,t,r)
except MemoryError:
print 'function %s ran out of memory' % f.__name__
continue
elapsed = time.time()-t0
try:
diff = abs(a-a2)
if diff.nnz > 0:
assert np.max(abs(a-a2).data) < 1e-5
except AssertionError:
print f.__name__
raise
print 'function %s took %.2f sec.' % (f.__name__,elapsed)
The cython function is a slightly modified version of #Alexander's code:
# working from tutorial at: http://docs.cython.org/src/tutorial/numpy.html
cimport numpy as np
# Turn bounds checking back on if there are ANY problems!
cimport cython
#cython.boundscheck(False) # turn of bounds-checking for entire function
def sparse_mult_c(np.ndarray[np.float64_t, ndim=2] a,
np.ndarray[np.float64_t, ndim=2] b,
np.ndarray[np.float64_t, ndim=1] data,
np.ndarray[np.int32_t, ndim=1] rows,
np.ndarray[np.int32_t, ndim=1] cols,
np.ndarray[np.float64_t, ndim=1] out):
cdef int n = rows.shape[0]
cdef int k = a.shape[1]
cdef int i,j
cdef double num
for i in range(n):
num = 0.0
for j in range(k):
num += a[rows[i],j] * b[j,cols[i]]
out[i] = data[i]*num
Based on the extra information on the comments, I think what's throwing you off is the call to np.unique. Try the following approach:
import numpy as np
import scipy.sparse as sps
from numpy.core.umath_tests import inner1d
n = 100000
x = np.random.rand(n, 10)
theta = np.random.rand(n, 10)
rows = np.arange(n)
cols = np.arange(n)
np.random.shuffle(rows)
np.random.shuffle(cols)
def sparse_multiply(x, theta, rows, cols):
data = inner1d(x[rows], theta[cols])
return sps.coo_matrix((data, (rows, cols)),
shape=(x.shape[0], theta.shape[0]))
I get the following timings:
n = 1000
%timeit sparse_multiply(x, theta, rows, cols)
1000 loops, best of 3: 465 us per loop
n = 10000
%timeit sparse_multiply(x, theta, rows, cols)
100 loops, best of 3: 4.29 ms per loop
n = 100000
%timeit sparse_multiply(x, theta, rows, cols)
10 loops, best of 3: 61.5 ms per loop
And of course, with n = 100:
>>> np.allclose(sparse_multiply(x, theta, rows, cols).toarray()[rows, cols],
x.dot(theta.T)[rows, cols])
>>> True
Haven't tested Jaime's answer yet (thanks again!), but I implemented another answer that works in the meantime using cython.
file sparse_mult_c.pyx:
# working from tutorial at: http://docs.cython.org/src/tutorial/numpy.html
cimport numpy as np
# Turn bounds checking back on if there are ANY problems!
cimport cython
#cython.boundscheck(False) # turn of bounds-checking for entire function
def sparse_mult_c(np.ndarray[np.float64_t, ndim=2] a,
np.ndarray[np.float64_t, ndim=2] b,
np.ndarray[np.int32_t, ndim=1] rows,
np.ndarray[np.int32_t, ndim=1] cols,
np.ndarray[np.float64_t, ndim=1] C ):
cdef int n = rows.shape[0]
cdef int k = a.shape[1]
cdef int i,j
for i in range(n):
for j in range(k):
C[i] += a[rows[i],j] * b[j,cols[i]]
Then compile it as per http://docs.cython.org/src/userguide/tutorial.html
Then in my python code, I include the following:
def sparse_mult(a, b, coords):
#a,b are np.ndarrays
from sparse_mult_c import sparse_mult_c
rows, cols = coords
C = np.zeros(rows.shape[0])
sparse_mult_c(a,b,rows,cols,C)
return sp.coo_matrix( (C,coords), (a.shape[0],b.shape[1]) )
This works fully sparse and also runs faster than even the original (memory-inefficient for me) solution.

How do i speed up a python nested loop?

I'm trying to calculate the gravity effect of a buried object by calculating the effect on each side of the body then summing up the contributions to get one measurement at one station, an repeating for a number of stations. the code is as follows( the body is a square and the code calculates clockwise around it, that's why it goes from -x back to -x coordinates)
grav = []
x=si.arange(-30.0,30.0,0.5)
#-9.79742526 9.78716693 22.32153704 27.07382349 2138.27146193
xcorn = (-9.79742526,9.78716693 ,9.78716693 ,-9.79742526,-9.79742526)
zcorn = (22.32153704,22.32153704,27.07382349,27.07382349,22.32153704)
gamma = (6.672*(10**-11))#'N m^2 / Kg^2'
rho = 2138.27146193#'Kg / m^3'
grav = []
iter_time=[]
def procedure():
for i in si.arange(len(x)):# cycles position
t0=time.clock()
sum_lines = 0.0
for n in si.arange(len(xcorn)-1):#cycles corners
x1 = xcorn[n]-x[i]
x2 = xcorn[n+1]-x[i]
z1 = zcorn[n]-0.0 #just depth to corner since all observations are on the surface.
z2 = zcorn[n+1]-0.0
r1 = ((z1**2) + (x1**2))**0.5
r2 = ((z2**2) + (x2**2))**0.5
O1 = si.arctan2(z1,x1)
O2 = si.arctan2(z2,x2)
denom = z2-z1
if denom == 0.0:
denom = 1.0e-6
alpha = (x2-x1)/denom
beta = ((x1*z2)-(x2*z1))/denom
factor = (beta/(1.0+(alpha**2)))
term1 = si.log(r2/r1)#log base 10
term2 = alpha*(O2-O1)
sum_lines = sum_lines + (factor*(term1-term2))
sum_lines = sum_lines*2*gamma*rho
grav.append(sum_lines)
t1 = time.clock()
dt = t1-t0
iter_time.append(dt)
Any help in speeding this loop up would be appreciated Thanks.
Your xcorn and zcorn values repeat, so consider caching the result of some of the computations.
Take a look at the timeit and profile modules to get more information about what is taking the most computational time.
It is very inefficient to access individual elements of a numpy array in a Python loop. For example, this Python loop:
for i in xrange(0, len(a), 2):
a[i] = i
would be much slower than:
a[::2] = np.arange(0, len(a), 2)
You could use a better algorithm (less time complexity) or use vector operations on numpy arrays as in the example above. But the quicker way might be just to compile the code using Cython:
#cython: boundscheck=False, wraparound=False
#procedure_module.pyx
import numpy as np
cimport numpy as np
ctypedef np.float64_t dtype_t
def procedure(np.ndarray[dtype_t,ndim=1] x,
np.ndarray[dtype_t,ndim=1] xcorn):
cdef:
Py_ssize_t i, j
dtype_t x1, x2, z1, z2, r1, r2, O1, O2
np.ndarray[dtype_t,ndim=1] grav = np.empty_like(x)
for i in range(x.shape[0]):
for j in range(xcorn.shape[0]-1):
x1 = xcorn[j]-x[i]
x2 = xcorn[j+1]-x[i]
...
grav[i] = ...
return grav
It is not necessary to define all types but if you need a significant speed up compared to Python you should define at least types of arrays and loop indexes.
You could use cProfile (Cython supports it) instead of manual calls to time.clock().
To call procedure():
#!/usr/bin/env python
import pyximport; pyximport.install() # pip install cython
import numpy as np
from procedure_module import procedure
x = np.arange(-30.0,30.0,0.5)
xcorn = np.array((-9.79742526,9.78716693 ,9.78716693 ,-9.79742526,-9.79742526))
grav = procedure(x, xcorn)

Categories