I have a large bidimensional ndarray, A and I want to compute the SVD retrieving largest eigenvalue and associated eigenvector pair. Looking at NumPy docs it seems that NumPy can compute the complete SVD only (numpy.linalg.svd), while SciPy has method that does exactly what I need (scipy.sparse.linalg.svds), but with sparse matrices and I don't want to perform a conversion of A, since it would require additional computational time.
Until now, I have used SciPy svds directly on the A however the documentation discourages to pass ndarrays to these methods.
Is there a way to perform this task with a method that accepts ndarray objects?
If svds works with your dense A array, then continue to use it. You don't need to convert it to anything. svds does all the adaptation that it needs.
It's documentation says
A : {sparse matrix, LinearOperator}
Array to compute the SVD on, of shape (M, N)
But what is a LinearOperator? It is a wrapper around something that can perform a matrix product. For a dense array A.dot qualifies.
Look at the code for svds. The first thing is does is A = np.asarray(A) if A isn't already a Linear Operator or sparse matrix. Then it grabs A.dot and (hemetianA).dot and makes a new LinearOperator.
There's nothing special about a sparse matrix in this function. All that matters is having a compatible matrix product.
Look at these times:
In [358]: A=np.eye(10)
In [359]: Alg=splg.aslinearoperator(A)
In [360]: Am=sparse.csr_matrix(A)
In [361]: timeit splg.svds(A)
1000 loops, best of 3: 541 µs per loop
In [362]: timeit splg.svds(Alg)
1000 loops, best of 3: 964 µs per loop
In [363]: timeit splg.svds(Am)
1000 loops, best of 3: 939 µs per loop
Direct use of A is fastest. The conversions don't help, even when they are outside of the timing loop.
Related
I have a set of of 2D arrays that I have to compute the 2D correlation of. I have been trying many different things (even programming it in Fortran), but I think the fastest way will be calculating it using FFT.
Based on my tests and on this answer I can use scipy.signal.fftconvolve and it works fine if I'm trying to reproduce the output of scipy.signal.correlate2d with boundary='fill'. So basically this
scipy.signal.fftconvolve(a, a[::-1, ::-1], mode='same')
is equal to this (with the exception of a slight shift)
scipy.signal.correlate2d(a, a, boundary='fill', mode='same')
The thing is that the arrays should be computed in wrapped mode, since they are 2D periodic arrays (i.e., boundary='wrap'). So if I'm trying to reproduce the output of
scipy.signal.correlate2d(a, a, boundary='wrap', mode='same')
I can't, or at least I don't see how to do it. (And I want to use the FFT method, since it's way faster.)
Apparently Scipy used to have something like that that might have done the trick, but apparently it got left behind and I can't find it, so I think Scipy might have dropped support for it.
Anyway, is there a way to use scipy's or numpy's FFT routines to calculate this correlation of period arrays?
The wrapped correlation can be implemented using the FFT. Here's some code to demonstrate how:
In [276]: import numpy as np
In [277]: from scipy.signal import correlate2d
Create a random array a to work with:
In [278]: a = np.random.randn(200, 200)
Compute the 2D correlation using scipy.signal.correlate2d:
In [279]: c = correlate2d(a, a, boundary='wrap', mode='same')
Now compute the same result, using the 2D FFT functions from numpy.fft. (This code assumes a is square.)
In [280]: from numpy.fft import fft2, ifft2
In [281]: fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
Verify that both methods give the same result:
In [282]: np.allclose(c, fc)
Out[282]: True
And as you point out, using the FFT is much faster. For this example, it is about 1000 times faster:
In [283]: %timeit c = correlate2d(a, a, boundary='wrap', mode='same')
1 loop, best of 3: 3.2 s per loop
In [284]: %timeit fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
100 loops, best of 3: 3.19 ms per loop
And that includes the duplicated computation of fft2(a). Of course, fft2(a) should only be computed once:
In [285]: fta = fft2(a)
In [286]: fc = np.roll(ifft2(fta.conj()*fta).real, (a.shape[0] - 1)//2, axis=(0,1))
I am a researcher working on geophysical inversion. Which can requires solving linear system: Au = rhs. Here A is often sparse matrix, but rhs and u can are either dense matrix or vector. To proceed gradient-based inversion, we need sensitivity computation, and it requires a number of matrix-matrix and matrix-vector multiplication. Recently I have found a weird behaviour in matrix (sparse) - matrix (dense) multiplication, and below is an example:
import numpy as np
import scipy.sparse as sp
n = int(1e6)
m = int(100)
e = np.ones(n)
A = sp.spdiags(np.vstack((e, e, e)), np.array([-1, 0, 1]), n, n)
A = A.tocsr()
u = np.random.randn(n,m)
%timeit rhs = A*u[:,0]
#10 loops, best of 3: 22 ms per loop
%timeit rhs = A*u[:,:10]
#10 loops, best of 3: 98.4 ms per loop
%timeit rhs = A*u
#1 loop, best of 3: 570 ms per loop
I was expecting almost linear increase in compution time when I am increasing the size of dense matrix u multiplied by sparse matrix A (e.g. the second one A*u[:,:10] supposed to me 220 ms and the final one A*u[:,:10] 2.2s). However, it is much faster than I expected. Reversely, Matrix-vector multiplication is much slower than Matrix-Matrix multiplication. Can someone explain why? Further, is there an effective way to boost up Matrix-vector multiplication similar level of efficiency to Matrix-Matrix multiplication?
If you look at the source code, you can see that csr_matvec (which implements matrix-vector multiplication) is implemented as a straightforward sum loop in C code, while csr_matvecs (which implements matrix-matrix multiplication) is implemented as a call to the axpy BLAS routine. Depending on what BLAS library your installation is linked to, such a call can be far more efficient than the straightforward C implementation used for matrix-vector multiplication. That's likely why you're seeing matrix-vector multiplication be so slow.
Changing scipy so that it calls BLAS in the matrix-vector case could be a useful contribution to the package.
I've got some big datasets to which I'd like to fit monoexponential time decays.
The data consists of multiple 4D datasets, acquired at different times, and the fit should thus run along a 5th dimension (through datasets).
The code I'm currently using is the following:
import numpy as np
import scipy.optimize as opt
[... load 4D datasets ....]
data = (dataset1, dataset2, dataset3)
times = (10, 20, 30)
def monoexponential(t, M0, t_const):
return M0*np.exp(-t/t_const)
# Starting guesses to initiate descent.
M0_init = 80.0
t_const_init = 50.0
init_guess = (M0_init, t_const_init)
def fit(vector):
try:
nlfit, nlpcov = opt.curve_fit(monoexponential, times, vector,
p0=init_guess,
sigma=None,
check_finite=False,
maxfev=100, ftol=0.5, xtol=1,
bounds=([0, 2000], [0, 800]))
M0, t_const = nlfit
except:
t_const = 0
return t_const
# Concatenate datasets in data into a single 5D array.
concat5D = np.concatenate([block[..., np.newaxis] for block in data],
axis=len(data[0].shape))
# And apply the curve fitting along the last dimension.
decay_map = np.apply_along_axis(fit, len(concat5D.shape) - 1, concat5D)
The code works fine, but takes forever (e.g, for dataset1.shape == (100,100,50,500)). I've read some other topics mentioning that apply_along_axis is very slow, so I'm guessing that's the culprit. Unfortunately, I don't really know what could be used as an alternative here (except maybe an explicit for loop?).
Does anyone have an idea of what I can do to avoid apply_along_axis and speed up curve_fit being called multiple times?
So you are applying a fit operation 100*100*50*500 times, to a 1d array (of 3 values in the example, more in real life?)?
apply_along_axis does iterate over all the dimensions of the input array, except for one. There's no compiling or doing this fit over multiple axes at once.
Without apply_along_axis the easiest approach is to reshape the array into a 2d one, compressing (100,100,50,500) to one (250...,) dimension, and then iterating on that. And then reshaping the result.
I was thinking that concatenating the datasets on a last axis might be slower than doing so on the first, but timings suggest otherwise.
np.stack is a new version of concatenate that makes it easy to add the new axis any where.
In [319]: x=np.ones((2,3,4,5),int)
In [320]: d=[x,x,x,x,x,x]
In [321]: np.stack(d,axis=0).shape # same as np.array(d)
Out[321]: (6, 2, 3, 4, 5)
In [322]: np.stack(d,axis=-1).shape
Out[322]: (2, 3, 4, 5, 6)
for a larger list (with a trivial sum function):
In [295]: d1=[x]*1000 # make a big list
In [296]: timeit np.apply_along_axis(sum,-1,np.stack(d1,-1)).shape
10 loops, best of 3: 39.7 ms per loop
In [297]: timeit np.apply_along_axis(sum,0,np.stack(d1,0)).shape
10 loops, best of 3: 39.2 ms per loop
an explicit loop using array reshape times about the same
In [312]: %%timeit
.....: d2=np.stack(d1,-1)
.....: d2=d2.reshape(-1,1000)
.....: res=np.stack([sum(i) for i in d2],0).reshape(d1[0].shape)
.....:
10 loops, best of 3: 39.1 ms per loop
But a function like sum can work on whole array, and do so much faster
In [315]: timeit np.stack(d1,-1).sum(-1).shape
100 loops, best of 3: 3.52 ms per loop
So changing the stacking and iteration methods doesn't make much difference in speed. But changing the 'fit' so it can work over more than one dimension can be a big help. I don't know enough of optimize.fit to know if that is possible.
====================
I just dug into the code for apply_along_axis. It basically constructs an index that looks like ind=(0,1,slice(None),2,1), and does func(arr[ind]), and then increments it, sort like long arithmetic with carry. So it is just systematically stepping through all elements, while keeping one axis a : slice.
In this particular case, where you're fitting a single exponential, you're likely better off to take the log of your data. Then fitting becomes linear and that is much faster than a nonlinear least squares, and can likely be vectorized since it becomes pretty much a linear algebra problem.
(And of course, if you have an idea of how to improve least_squares, that might be appreciated by the scipy devs.)
I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop
I just try figure out why my program is so slow and find the following result.
In [11]: n = 1000000
In [12]: x = randn(n)
In [13]: %timeit norm(x)
100 loops, best of 3: 2.25 ms per loop
In [14]: %timeit (x.dot(x))**0.5
1000 loops, best of 3: 387 µs per loop
I know the norm function will contain many if else detecting the input and select the right norm. But I am still wondering this big difference especially when calling in loops.
Is this normal in numpy?
Another examples is that computing the eigenvalue and eigenvector of a 10000x10000 random generated matrix from randn. Firstly I use Matlab compute and get the result in several minutes. But numpy took a very very very long time to compute this and finally I Ctrl+c the process. Both use the eig function respectively.