I am a researcher working on geophysical inversion. Which can requires solving linear system: Au = rhs. Here A is often sparse matrix, but rhs and u can are either dense matrix or vector. To proceed gradient-based inversion, we need sensitivity computation, and it requires a number of matrix-matrix and matrix-vector multiplication. Recently I have found a weird behaviour in matrix (sparse) - matrix (dense) multiplication, and below is an example:
import numpy as np
import scipy.sparse as sp
n = int(1e6)
m = int(100)
e = np.ones(n)
A = sp.spdiags(np.vstack((e, e, e)), np.array([-1, 0, 1]), n, n)
A = A.tocsr()
u = np.random.randn(n,m)
%timeit rhs = A*u[:,0]
#10 loops, best of 3: 22 ms per loop
%timeit rhs = A*u[:,:10]
#10 loops, best of 3: 98.4 ms per loop
%timeit rhs = A*u
#1 loop, best of 3: 570 ms per loop
I was expecting almost linear increase in compution time when I am increasing the size of dense matrix u multiplied by sparse matrix A (e.g. the second one A*u[:,:10] supposed to me 220 ms and the final one A*u[:,:10] 2.2s). However, it is much faster than I expected. Reversely, Matrix-vector multiplication is much slower than Matrix-Matrix multiplication. Can someone explain why? Further, is there an effective way to boost up Matrix-vector multiplication similar level of efficiency to Matrix-Matrix multiplication?
If you look at the source code, you can see that csr_matvec (which implements matrix-vector multiplication) is implemented as a straightforward sum loop in C code, while csr_matvecs (which implements matrix-matrix multiplication) is implemented as a call to the axpy BLAS routine. Depending on what BLAS library your installation is linked to, such a call can be far more efficient than the straightforward C implementation used for matrix-vector multiplication. That's likely why you're seeing matrix-vector multiplication be so slow.
Changing scipy so that it calls BLAS in the matrix-vector case could be a useful contribution to the package.
Related
I am computing a matrix multiplication at few thousand times during my algorithm. Therefore, I compute:
import numpy as np
import time
def mat_mul(mat1, mat2, mat3, mat4):
return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)
n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)
t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')
The matrices are of size:
n = (2000,2000) and mat2 only has entries along its diagonal. The remaining entries are zero.
On my machine I get the following:
time 0.3473696708679199 s
How can I speed this up?
Thanks.
The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.
Here is the implementation:
def mat_mul_opt(mat1, mat2, mat3, mat4):
tmp1 = np.empty((n,n))
tmp2 = np.empty((n,n))
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1 and tmp2 outside the function once (and then reuse them multiple times). Here is an example:
def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
Here are performance results on my i5-9600KF processor (6-cores):
mat_mul: 103.6 ms
mat_mul_opt1: 96.7 ms
mat_mul_opt2: 83.5 ms
np.dot time only: 74.4 ms (kind of practical lower-bound)
Optimal lower bound: 55 ms (quite optimistic)
cython is not going to speed it up, simply because numpy is using other tricks to speed things up like threading and SIMD, anyone that tries to implement such function with only cython is going to end up with much worse performance.
only 2 things are possible:
use a gpu based version of numpy (cupy)
use a different more optimized backend for numpy if you aren't using the best already (like intel MKL)
I have a scipy.sparse.csc_matrix sparse matrix A of shape (N, N) where N is about 15000. A has less than 1 % non-zero elements.
I need to solve for Ax=b as time efficient as possible.
Using scipy.sparse.linalg.spsolve takes about 350 ms using scikit-umfpack.
scipy.sparse.linalg.gmres is with 50 ms significantly faster when using an ILU preconditioner. Without preconditioner it takes more than a minute.
However, creating the preconditioner takes about 1.5 s. Given that, it would be more efficient to just use scipy.sparse.linalg.spsolve.
I'm creating the preconditioner M with
from scipy.sparse.linalg import LinearOperator, spilu
ilu = spilu(A)
Mx = lambda x: ilu.solve(x)
M = LinearOperator((N, N), Mx)
Is there a more efficient way of doing this, such that using scipy.sparse.linalg.gmres would be more profitable?
I have a set of of 2D arrays that I have to compute the 2D correlation of. I have been trying many different things (even programming it in Fortran), but I think the fastest way will be calculating it using FFT.
Based on my tests and on this answer I can use scipy.signal.fftconvolve and it works fine if I'm trying to reproduce the output of scipy.signal.correlate2d with boundary='fill'. So basically this
scipy.signal.fftconvolve(a, a[::-1, ::-1], mode='same')
is equal to this (with the exception of a slight shift)
scipy.signal.correlate2d(a, a, boundary='fill', mode='same')
The thing is that the arrays should be computed in wrapped mode, since they are 2D periodic arrays (i.e., boundary='wrap'). So if I'm trying to reproduce the output of
scipy.signal.correlate2d(a, a, boundary='wrap', mode='same')
I can't, or at least I don't see how to do it. (And I want to use the FFT method, since it's way faster.)
Apparently Scipy used to have something like that that might have done the trick, but apparently it got left behind and I can't find it, so I think Scipy might have dropped support for it.
Anyway, is there a way to use scipy's or numpy's FFT routines to calculate this correlation of period arrays?
The wrapped correlation can be implemented using the FFT. Here's some code to demonstrate how:
In [276]: import numpy as np
In [277]: from scipy.signal import correlate2d
Create a random array a to work with:
In [278]: a = np.random.randn(200, 200)
Compute the 2D correlation using scipy.signal.correlate2d:
In [279]: c = correlate2d(a, a, boundary='wrap', mode='same')
Now compute the same result, using the 2D FFT functions from numpy.fft. (This code assumes a is square.)
In [280]: from numpy.fft import fft2, ifft2
In [281]: fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
Verify that both methods give the same result:
In [282]: np.allclose(c, fc)
Out[282]: True
And as you point out, using the FFT is much faster. For this example, it is about 1000 times faster:
In [283]: %timeit c = correlate2d(a, a, boundary='wrap', mode='same')
1 loop, best of 3: 3.2 s per loop
In [284]: %timeit fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
100 loops, best of 3: 3.19 ms per loop
And that includes the duplicated computation of fft2(a). Of course, fft2(a) should only be computed once:
In [285]: fta = fft2(a)
In [286]: fc = np.roll(ifft2(fta.conj()*fta).real, (a.shape[0] - 1)//2, axis=(0,1))
I have a large bidimensional ndarray, A and I want to compute the SVD retrieving largest eigenvalue and associated eigenvector pair. Looking at NumPy docs it seems that NumPy can compute the complete SVD only (numpy.linalg.svd), while SciPy has method that does exactly what I need (scipy.sparse.linalg.svds), but with sparse matrices and I don't want to perform a conversion of A, since it would require additional computational time.
Until now, I have used SciPy svds directly on the A however the documentation discourages to pass ndarrays to these methods.
Is there a way to perform this task with a method that accepts ndarray objects?
If svds works with your dense A array, then continue to use it. You don't need to convert it to anything. svds does all the adaptation that it needs.
It's documentation says
A : {sparse matrix, LinearOperator}
Array to compute the SVD on, of shape (M, N)
But what is a LinearOperator? It is a wrapper around something that can perform a matrix product. For a dense array A.dot qualifies.
Look at the code for svds. The first thing is does is A = np.asarray(A) if A isn't already a Linear Operator or sparse matrix. Then it grabs A.dot and (hemetianA).dot and makes a new LinearOperator.
There's nothing special about a sparse matrix in this function. All that matters is having a compatible matrix product.
Look at these times:
In [358]: A=np.eye(10)
In [359]: Alg=splg.aslinearoperator(A)
In [360]: Am=sparse.csr_matrix(A)
In [361]: timeit splg.svds(A)
1000 loops, best of 3: 541 µs per loop
In [362]: timeit splg.svds(Alg)
1000 loops, best of 3: 964 µs per loop
In [363]: timeit splg.svds(Am)
1000 loops, best of 3: 939 µs per loop
Direct use of A is fastest. The conversions don't help, even when they are outside of the timing loop.
I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop