Curve_fit to apply_along_axis. How to speed it up?

Curve_fit to apply_along_axis. How to speed it up? - python

I've got some big datasets to which I'd like to fit monoexponential time decays.
The data consists of multiple 4D datasets, acquired at different times, and the fit should thus run along a 5th dimension (through datasets).
The code I'm currently using is the following:
import numpy as np
import scipy.optimize as opt
[... load 4D datasets ....]
data = (dataset1, dataset2, dataset3)
times = (10, 20, 30)
def monoexponential(t, M0, t_const):
return M0*np.exp(-t/t_const)
# Starting guesses to initiate descent.
M0_init = 80.0
t_const_init = 50.0
init_guess = (M0_init, t_const_init)
def fit(vector):
try:
nlfit, nlpcov = opt.curve_fit(monoexponential, times, vector,
p0=init_guess,
sigma=None,
check_finite=False,
maxfev=100, ftol=0.5, xtol=1,
bounds=([0, 2000], [0, 800]))
M0, t_const = nlfit
except:
t_const = 0
return t_const
# Concatenate datasets in data into a single 5D array.
concat5D = np.concatenate([block[..., np.newaxis] for block in data],
axis=len(data[0].shape))
# And apply the curve fitting along the last dimension.
decay_map = np.apply_along_axis(fit, len(concat5D.shape) - 1, concat5D)
The code works fine, but takes forever (e.g, for dataset1.shape == (100,100,50,500)). I've read some other topics mentioning that apply_along_axis is very slow, so I'm guessing that's the culprit. Unfortunately, I don't really know what could be used as an alternative here (except maybe an explicit for loop?).
Does anyone have an idea of what I can do to avoid apply_along_axis and speed up curve_fit being called multiple times?

So you are applying a fit operation 100*100*50*500 times, to a 1d array (of 3 values in the example, more in real life?)?
apply_along_axis does iterate over all the dimensions of the input array, except for one. There's no compiling or doing this fit over multiple axes at once.
Without apply_along_axis the easiest approach is to reshape the array into a 2d one, compressing (100,100,50,500) to one (250...,) dimension, and then iterating on that. And then reshaping the result.
I was thinking that concatenating the datasets on a last axis might be slower than doing so on the first, but timings suggest otherwise.
np.stack is a new version of concatenate that makes it easy to add the new axis any where.
In [319]: x=np.ones((2,3,4,5),int)
In [320]: d=[x,x,x,x,x,x]
In [321]: np.stack(d,axis=0).shape # same as np.array(d)
Out[321]: (6, 2, 3, 4, 5)
In [322]: np.stack(d,axis=-1).shape
Out[322]: (2, 3, 4, 5, 6)
for a larger list (with a trivial sum function):
In [295]: d1=[x]*1000 # make a big list
In [296]: timeit np.apply_along_axis(sum,-1,np.stack(d1,-1)).shape
10 loops, best of 3: 39.7 ms per loop
In [297]: timeit np.apply_along_axis(sum,0,np.stack(d1,0)).shape
10 loops, best of 3: 39.2 ms per loop
an explicit loop using array reshape times about the same
In [312]: %%timeit
.....: d2=np.stack(d1,-1)
.....: d2=d2.reshape(-1,1000)
.....: res=np.stack([sum(i) for i in d2],0).reshape(d1[0].shape)
.....:
10 loops, best of 3: 39.1 ms per loop
But a function like sum can work on whole array, and do so much faster
In [315]: timeit np.stack(d1,-1).sum(-1).shape
100 loops, best of 3: 3.52 ms per loop
So changing the stacking and iteration methods doesn't make much difference in speed. But changing the 'fit' so it can work over more than one dimension can be a big help. I don't know enough of optimize.fit to know if that is possible.
====================
I just dug into the code for apply_along_axis. It basically constructs an index that looks like ind=(0,1,slice(None),2,1), and does func(arr[ind]), and then increments it, sort like long arithmetic with carry. So it is just systematically stepping through all elements, while keeping one axis a : slice.

In this particular case, where you're fitting a single exponential, you're likely better off to take the log of your data. Then fitting becomes linear and that is much faster than a nonlinear least squares, and can likely be vectorized since it becomes pretty much a linear algebra problem.
(And of course, if you have an idea of how to improve least_squares, that might be appreciated by the scipy devs.)

Related

K-Means: assign clusters to new data points

I've implemented a k-means clustering algorithm in python, and now I want to label a new data with the clusters I got with my algorithm. My approach is to iterate through every data point and every centroid to find the minimum distance and the centroid associated with it. But I wonder if there are simpler or shorter ways to do it.
def assign_cluster(clusterDict, data):
clusterList = []
label = []
cen = list(clusterDict.values())
for i in range(len(data)):
for j in range(len(cen)):
# if cen[j] has the minimum distance with data[i]
# then clusterList[i] = cen[j]
Where clusterDict is a dictionary with keys as labels, [0,1,2,....] and values as coordinates of centroids.
Can someone help me implementing this?

This is a good use case for numba, because it lets you express this as a simple double loop without a big performance penalty, which in turn allows you to avoid the excessive extra memory of using np.tile to replicate the data across a third dimension just to do it in a vectorized manner.
Borrowing the standard vectorized numpy implementation from the other answer, I have these two implementations:
import numba
import numpy as np
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
#numba.jit
def kmeans_assignment2(centroids, points):
P, C = points.shape[0], centroids.shape[0]
distances = np.zeros((P, C), dtype=np.float32)
for p in range(P):
for c in range(C):
distances[p, c] = np.sum(np.square(centroids[c] - points[p]))
return np.argmin(distances, axis=1)
Then for some sample data, I did a few timing experiments:
In [12]: points = np.random.rand(10000, 50)
In [13]: centroids = np.random.rand(30, 50)
In [14]: %timeit kmeans_assignment(centroids, points)
196 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [15]: %timeit kmeans_assignment2(centroids, points)
127 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I won't go as far to say that the numba version is certainly faster than the np.tile version, but clearly it's very close while not incurring the extra memory cost of np.tile.
In fact, I noticed for my laptop that when I make the shapes larger and use (10000, 1000) for the shape of points and (200, 1000) for the shape of centroids, then np.tile generated a MemoryError, meanwhile the numba function runs in under 5 seconds with no memory error.
Separately, I actually noticed a slowdown when using numba.jit on the first version (withnp.tile), which is likely due to the extra array creation inside the jitted function combined with the fact that there's not much numba can optimize when you're already calling all vectorized functions.
And I also did not notice any significant improvement in the second version when trying to shorten the code by using broadcasting. E.g. shortening the double loop to be
for p in range(P):
distances[p, :] = np.sum(np.square(centroids - points[p, :]), axis=1)
did not really help anything (and would use more memory when repeatedly broadcasting points[p, :] across all of centroids).
This is one of the really nice benefits of numba. You really can write the algorithms in a very straightforward, loop-based way that comports with standard descriptions of algorithms and allows finer point of control over how the syntax unpacks into memory consumption or broadcasting... all without giving up runtime performance.

An efficient way to perform assignment phase is by doing vectorized computation. This approach assumes that you start with two 2D arrays: points and centroids, with the same number of columns (dimensionality of space), but possibly different number of rows. By using tiling (np.tile) we can then compute the distance matrix in a batch, then select the closest clusters per each point.
Here's the code:
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
See this GitHub gist for a complete runnable example.

Computing wrapped 2D correlation with fftconvolve

I have a set of of 2D arrays that I have to compute the 2D correlation of. I have been trying many different things (even programming it in Fortran), but I think the fastest way will be calculating it using FFT.
Based on my tests and on this answer I can use scipy.signal.fftconvolve and it works fine if I'm trying to reproduce the output of scipy.signal.correlate2d with boundary='fill'. So basically this
scipy.signal.fftconvolve(a, a[::-1, ::-1], mode='same')
is equal to this (with the exception of a slight shift)
scipy.signal.correlate2d(a, a, boundary='fill', mode='same')
The thing is that the arrays should be computed in wrapped mode, since they are 2D periodic arrays (i.e., boundary='wrap'). So if I'm trying to reproduce the output of
scipy.signal.correlate2d(a, a, boundary='wrap', mode='same')
I can't, or at least I don't see how to do it. (And I want to use the FFT method, since it's way faster.)
Apparently Scipy used to have something like that that might have done the trick, but apparently it got left behind and I can't find it, so I think Scipy might have dropped support for it.
Anyway, is there a way to use scipy's or numpy's FFT routines to calculate this correlation of period arrays?

The wrapped correlation can be implemented using the FFT. Here's some code to demonstrate how:
In [276]: import numpy as np
In [277]: from scipy.signal import correlate2d
Create a random array a to work with:
In [278]: a = np.random.randn(200, 200)
Compute the 2D correlation using scipy.signal.correlate2d:
In [279]: c = correlate2d(a, a, boundary='wrap', mode='same')
Now compute the same result, using the 2D FFT functions from numpy.fft. (This code assumes a is square.)
In [280]: from numpy.fft import fft2, ifft2
In [281]: fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
Verify that both methods give the same result:
In [282]: np.allclose(c, fc)
Out[282]: True
And as you point out, using the FFT is much faster. For this example, it is about 1000 times faster:
In [283]: %timeit c = correlate2d(a, a, boundary='wrap', mode='same')
1 loop, best of 3: 3.2 s per loop
In [284]: %timeit fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
100 loops, best of 3: 3.19 ms per loop
And that includes the duplicated computation of fft2(a). Of course, fft2(a) should only be computed once:
In [285]: fta = fft2(a)
In [286]: fc = np.roll(ifft2(fta.conj()*fta).real, (a.shape[0] - 1)//2, axis=(0,1))

largest singular value NumPy `ndarray`

I have a large bidimensional ndarray, A and I want to compute the SVD retrieving largest eigenvalue and associated eigenvector pair. Looking at NumPy docs it seems that NumPy can compute the complete SVD only (numpy.linalg.svd), while SciPy has method that does exactly what I need (scipy.sparse.linalg.svds), but with sparse matrices and I don't want to perform a conversion of A, since it would require additional computational time.
Until now, I have used SciPy svds directly on the A however the documentation discourages to pass ndarrays to these methods.
Is there a way to perform this task with a method that accepts ndarray objects?

If svds works with your dense A array, then continue to use it. You don't need to convert it to anything. svds does all the adaptation that it needs.
It's documentation says
A : {sparse matrix, LinearOperator}
Array to compute the SVD on, of shape (M, N)
But what is a LinearOperator? It is a wrapper around something that can perform a matrix product. For a dense array A.dot qualifies.
Look at the code for svds. The first thing is does is A = np.asarray(A) if A isn't already a Linear Operator or sparse matrix. Then it grabs A.dot and (hemetianA).dot and makes a new LinearOperator.
There's nothing special about a sparse matrix in this function. All that matters is having a compatible matrix product.
Look at these times:
In [358]: A=np.eye(10)
In [359]: Alg=splg.aslinearoperator(A)
In [360]: Am=sparse.csr_matrix(A)
In [361]: timeit splg.svds(A)
1000 loops, best of 3: 541 µs per loop
In [362]: timeit splg.svds(Alg)
1000 loops, best of 3: 964 µs per loop
In [363]: timeit splg.svds(Am)
1000 loops, best of 3: 939 µs per loop
Direct use of A is fastest. The conversions don't help, even when they are outside of the timing loop.

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.

Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.

np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop

how to speed up a vector cross product calculation

Hi I'm relatively new here and trying to do some calculations with numpy. I'm experiencing a long elapse time from one particular calculation and can't work out any faster way to achieve the same thing.
Basically its part of a ray triangle intersection algorithm and I need to calculate all the vector cros products from two matrices of different sizes.
The code I was using was :
allhvals1 = numpy.cross( dirvectors[:,None,:], trivectors2[None,:,:] )
where dirvectors is an array of n* vectors (xyz) and trivectors2 is an array of m*vectors(xyz). allhvals1 is an array of the cross products of size n*M*vector (xyz).
This works but is very slow. It's essentially the n*m matrix of each vector from each array. Hope that you understand. The sizes of each varies from approx 1-4000 depending on parameters (I basically chunk the dirvectors dependent on size).
Any advice appreciated. Unfortunately my matrix math is somewhat flakey.

If you look at the source code of np.cross, it basically moves the xyz dimension to the front of the shape tuple for all arrays, and then has the calculation of each of the components spelled out like this:
x = a[1]*b[2] - a[2]*b[1]
y = a[2]*b[0] - a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
In your case, each of those products requires allocating huge arrays, so the overall behavior is not very efficient.
Lets set up some test data:
u = np.random.rand(1000, 3)
v = np.random.rand(2000, 3)
In [13]: %timeit s1 = np.cross(u[:, None, :], v[None, :, :])
1 loops, best of 3: 591 ms per loop
We can try to compute it using Levi-Civita symbols and np.einsum as follows:
eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1
In [14]: %timeit s2 = np.einsum('ijk,uj,vk->uvi', eijk, u, v)
1 loops, best of 3: 706 ms per loop
In [15]: np.allclose(s1, s2)
Out[15]: True
So while it works, it has worse performance. The thing is that np.einsum has trouble when there are more than two operands, but has optimized pathways for two or less. So we can try to rewrite it in two steps, to see if it helps:
In [16]: %timeit s3 = np.einsum('iuk,vk->uvi', np.einsum('ijk,uj->iuk', eijk, u), v)
10 loops, best of 3: 63.4 ms per loop
In [17]: np.allclose(s1, s3)
Out[17]: True
Bingo! Close to an order of magnitude improvement...
Some performance figures for NumPy 1.11.0 with a=numpy.random.rand(n,3), b=numpy.random.rand(n,3):
The nested einsum is about twice as fast as cross for the largest n tested.

While writing dynamic simulations for underwater vehicles I have found this method for fast cross product:
https://github.com/simena86/Simulink-Underwater-Robotics-Simulator/blob/master/3rdparty/gnc_mfiles/Smtrx.m
Which works well, it is written in Matlab but the code is very simple. Just read the comments at the top.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Curve_fit to apply_along_axis. How to speed it up? - python

Related

K-Means: assign clusters to new data points

Computing wrapped 2D correlation with fftconvolve

largest singular value NumPy `ndarray`

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

how to speed up a vector cross product calculation

Categories

Resources