Sparse matrix solver with preconditioner - python

I have a scipy.sparse.csc_matrix sparse matrix A of shape (N, N) where N is about 15000. A has less than 1 % non-zero elements.
I need to solve for Ax=b as time efficient as possible.
Using scipy.sparse.linalg.spsolve takes about 350 ms using scikit-umfpack.
scipy.sparse.linalg.gmres is with 50 ms significantly faster when using an ILU preconditioner. Without preconditioner it takes more than a minute.
However, creating the preconditioner takes about 1.5 s. Given that, it would be more efficient to just use scipy.sparse.linalg.spsolve.
I'm creating the preconditioner M with
from scipy.sparse.linalg import LinearOperator, spilu
ilu = spilu(A)
Mx = lambda x: ilu.solve(x)
M = LinearOperator((N, N), Mx)
Is there a more efficient way of doing this, such that using scipy.sparse.linalg.gmres would be more profitable?

Related

Can I use the numpy normal distribution sampler to efficiently create a sample using few or no loops?

I would like to generate a sample that follows a normal distribution from M source values each with a standard deviation, with N samples per source value. Can this be done efficiently with numpy arrays?
My desired output is an MxN array. I expected this pseudocode to work, but it fails with an error:
import numpy as np
# initial data
M = 100
x = np.arange(M)
y = x**2
y_err = y * 0.1
# sample the data N times per datapoint
N = 1000
N_samples = np.random.normal(loc=y, scale=y_err, size=N)
Running this yields a broadcasting error since N and M are not the same:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I can imagine solutions that use loops, but is there a better/faster method that minimizes the use of loops? For example, many numpy functions are vectorized so I would expect there to be some numpy method that would be faster or at least avoid the use of loops.
I was able to create two methods: one that uses loops, and one that uses numpy functions. However, the numpy method is slower for large arrays, so I am curious as to why this is and whether there is an alternative method.
Method one: loop through each of the M source values and sample N points from that value, and proceed through the whole dataset so that the numpy sampler is used M times:
# initialize the sample array
y_sampled = np.zeros([M, N])
for i in range(M):
y_sampled[i] = prng.normal(loc=y[i], scale=y_err_abs[i], size=num_samples)
Method two: use numpy's vectorized methods on an adjusted dataset, wherein the source data is duplicated to be an MxN array, on which the numpy sampler is applied once
# duplicate the source data and error arrays horizontally N times
y_dup = np.repeat(np.vstack(y), N,axis=1)
y_err_dup = np.repeat(np.vstack(y_err), N, axis=1)
# apply the numpy sampler once on the entire 2D array
y_sampled = np.random.normal(loc=y_dup, scale=y_err_dup, size=(M,N))
I expected the second method to be faster since the sampler is applied only once, albeit on a 2D array. The walltime is similar for small arrays (M = 100) but different by a factor of ~2x for larger arrays (M = 1E5). Timing:
M = 100 N = 1000
Time used by loop method: 0.0156 seconds
Time used by numpy resize/duplicating method: 0.0199
M = 100000 N = 1000
Time used by loop method: 3.9298 seconds
Time used by numpy resize/duplicating method: 7.3371 seconds
I would expect there to be a built-in method to sample N times, instead of duplicating the dataset N times, but these methods work.

Using cupy to create a distance matrix from another matrix on GPU

I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9.
import numpy as np
#import cupy as np
def l1_distance(arr):
return np.linalg.norm(arr, 1)
X = np.random.randint(low=0, high=255, size=(700,4096))
distance = np.empty((700,700))
for i in range(700):
for j in range(700):
distance[i,j] = l1_distance(X[i,:] - X[j,:])
I attempted this on GPU using cupy by umcommenting the second import statement, but obviously the double for loop is drastically inefficient. It takes numpy approx 6 seconds, but cupy takes 26 seconds. I understand why but it's not immediately clear to me how to parallelize this process.
I know I'm going to need to write a reduction kernel of some sort, but I can't think of how to construct one cupy array from iterative operations on elements of another array.
Using broadcasting CuPy takes 0.10 seconds in a A100 GPU compared to NumPy which takes 6.6 seconds
for i in range(700):
distance[i,:] = np.abs(np.broadcast_to(X[i,:], X.shape) - X).sum(axis=1)
This vectorizes and makes the distance of one vector to all other ones in parallel.

Matrix (scipy sparse) - Matrix (dense; numpy array) multiplication efficiency

I am a researcher working on geophysical inversion. Which can requires solving linear system: Au = rhs. Here A is often sparse matrix, but rhs and u can are either dense matrix or vector. To proceed gradient-based inversion, we need sensitivity computation, and it requires a number of matrix-matrix and matrix-vector multiplication. Recently I have found a weird behaviour in matrix (sparse) - matrix (dense) multiplication, and below is an example:
import numpy as np
import scipy.sparse as sp
n = int(1e6)
m = int(100)
e = np.ones(n)
A = sp.spdiags(np.vstack((e, e, e)), np.array([-1, 0, 1]), n, n)
A = A.tocsr()
u = np.random.randn(n,m)
%timeit rhs = A*u[:,0]
#10 loops, best of 3: 22 ms per loop
%timeit rhs = A*u[:,:10]
#10 loops, best of 3: 98.4 ms per loop
%timeit rhs = A*u
#1 loop, best of 3: 570 ms per loop​
I was expecting almost linear increase in compution time when I am increasing the size of dense matrix u multiplied by sparse matrix A (e.g. the second one A*u[:,:10] supposed to me 220 ms and the final one A*u[:,:10] 2.2s). However, it is much faster than I expected. Reversely, Matrix-vector multiplication is much slower than Matrix-Matrix multiplication. Can someone explain why? Further, is there an effective way to boost up Matrix-vector multiplication similar level of efficiency to Matrix-Matrix multiplication?
If you look at the source code, you can see that csr_matvec (which implements matrix-vector multiplication) is implemented as a straightforward sum loop in C code, while csr_matvecs (which implements matrix-matrix multiplication) is implemented as a call to the axpy BLAS routine. Depending on what BLAS library your installation is linked to, such a call can be far more efficient than the straightforward C implementation used for matrix-vector multiplication. That's likely why you're seeing matrix-vector multiplication be so slow.
Changing scipy so that it calls BLAS in the matrix-vector case could be a useful contribution to the package.

Speeding up nested loops in python

How can I speed up this code in python?
while ( norm_corr > corr_len ):
correlation = 0.0
for i in xrange(6):
for j in xrange(6):
correlation += (p[i] * T_n[j][i]) * ((F[j] - Fbar) * (F[i] - Fbar))
Integral += correlation
T_n =np.mat(T_n) * np.mat(TT)
T_n = T_n.tolist()
norm_corr = correlation / variance
Here, TT is a fixed 6x6 matrix, p is a fixed 1x6 matrix, and F is fixed 1x6 matrix. T_n is the nth power of TT.
This while loop might be repeated for 10^4 times.
The way to do these things quickly is to use Numpy's built-in functions and operators to perform the operations. Numpy is implemented internally with optimized C code and if you set up your computation properly, it will run much faster.
But leveraging Numpy effectively can sometimes be tricky. It's called "vectorizing" your code - you have to figure out how to express it in a way that acts on whole arrays, rather than with explicit loops.
For example in your loop you have p[i] * T_n[j][i], which IMHO can be done with a vector-by-matrix multiplication: if v is 1x6 and m is 6x6 then v.dot(m) is 1x6 that computes dot products of v with the columns of m. You can use transposes and reshapes to work in different dimensions, if necessary.

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop

Categories