how to speed up a vector cross product calculation

how to speed up a vector cross product calculation - python

Hi I'm relatively new here and trying to do some calculations with numpy. I'm experiencing a long elapse time from one particular calculation and can't work out any faster way to achieve the same thing.
Basically its part of a ray triangle intersection algorithm and I need to calculate all the vector cros products from two matrices of different sizes.
The code I was using was :
allhvals1 = numpy.cross( dirvectors[:,None,:], trivectors2[None,:,:] )
where dirvectors is an array of n* vectors (xyz) and trivectors2 is an array of m*vectors(xyz). allhvals1 is an array of the cross products of size n*M*vector (xyz).
This works but is very slow. It's essentially the n*m matrix of each vector from each array. Hope that you understand. The sizes of each varies from approx 1-4000 depending on parameters (I basically chunk the dirvectors dependent on size).
Any advice appreciated. Unfortunately my matrix math is somewhat flakey.

If you look at the source code of np.cross, it basically moves the xyz dimension to the front of the shape tuple for all arrays, and then has the calculation of each of the components spelled out like this:
x = a[1]*b[2] - a[2]*b[1]
y = a[2]*b[0] - a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
In your case, each of those products requires allocating huge arrays, so the overall behavior is not very efficient.
Lets set up some test data:
u = np.random.rand(1000, 3)
v = np.random.rand(2000, 3)
In [13]: %timeit s1 = np.cross(u[:, None, :], v[None, :, :])
1 loops, best of 3: 591 ms per loop
We can try to compute it using Levi-Civita symbols and np.einsum as follows:
eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1
In [14]: %timeit s2 = np.einsum('ijk,uj,vk->uvi', eijk, u, v)
1 loops, best of 3: 706 ms per loop
In [15]: np.allclose(s1, s2)
Out[15]: True
So while it works, it has worse performance. The thing is that np.einsum has trouble when there are more than two operands, but has optimized pathways for two or less. So we can try to rewrite it in two steps, to see if it helps:
In [16]: %timeit s3 = np.einsum('iuk,vk->uvi', np.einsum('ijk,uj->iuk', eijk, u), v)
10 loops, best of 3: 63.4 ms per loop
In [17]: np.allclose(s1, s3)
Out[17]: True
Bingo! Close to an order of magnitude improvement...
Some performance figures for NumPy 1.11.0 with a=numpy.random.rand(n,3), b=numpy.random.rand(n,3):
The nested einsum is about twice as fast as cross for the largest n tested.

While writing dynamic simulations for underwater vehicles I have found this method for fast cross product:
https://github.com/simena86/Simulink-Underwater-Robotics-Simulator/blob/master/3rdparty/gnc_mfiles/Smtrx.m
Which works well, it is written in Matlab but the code is very simple. Just read the comments at the top.

Related

K-Means: assign clusters to new data points

I've implemented a k-means clustering algorithm in python, and now I want to label a new data with the clusters I got with my algorithm. My approach is to iterate through every data point and every centroid to find the minimum distance and the centroid associated with it. But I wonder if there are simpler or shorter ways to do it.
def assign_cluster(clusterDict, data):
clusterList = []
label = []
cen = list(clusterDict.values())
for i in range(len(data)):
for j in range(len(cen)):
# if cen[j] has the minimum distance with data[i]
# then clusterList[i] = cen[j]
Where clusterDict is a dictionary with keys as labels, [0,1,2,....] and values as coordinates of centroids.
Can someone help me implementing this?

This is a good use case for numba, because it lets you express this as a simple double loop without a big performance penalty, which in turn allows you to avoid the excessive extra memory of using np.tile to replicate the data across a third dimension just to do it in a vectorized manner.
Borrowing the standard vectorized numpy implementation from the other answer, I have these two implementations:
import numba
import numpy as np
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
#numba.jit
def kmeans_assignment2(centroids, points):
P, C = points.shape[0], centroids.shape[0]
distances = np.zeros((P, C), dtype=np.float32)
for p in range(P):
for c in range(C):
distances[p, c] = np.sum(np.square(centroids[c] - points[p]))
return np.argmin(distances, axis=1)
Then for some sample data, I did a few timing experiments:
In [12]: points = np.random.rand(10000, 50)
In [13]: centroids = np.random.rand(30, 50)
In [14]: %timeit kmeans_assignment(centroids, points)
196 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [15]: %timeit kmeans_assignment2(centroids, points)
127 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I won't go as far to say that the numba version is certainly faster than the np.tile version, but clearly it's very close while not incurring the extra memory cost of np.tile.
In fact, I noticed for my laptop that when I make the shapes larger and use (10000, 1000) for the shape of points and (200, 1000) for the shape of centroids, then np.tile generated a MemoryError, meanwhile the numba function runs in under 5 seconds with no memory error.
Separately, I actually noticed a slowdown when using numba.jit on the first version (withnp.tile), which is likely due to the extra array creation inside the jitted function combined with the fact that there's not much numba can optimize when you're already calling all vectorized functions.
And I also did not notice any significant improvement in the second version when trying to shorten the code by using broadcasting. E.g. shortening the double loop to be
for p in range(P):
distances[p, :] = np.sum(np.square(centroids - points[p, :]), axis=1)
did not really help anything (and would use more memory when repeatedly broadcasting points[p, :] across all of centroids).
This is one of the really nice benefits of numba. You really can write the algorithms in a very straightforward, loop-based way that comports with standard descriptions of algorithms and allows finer point of control over how the syntax unpacks into memory consumption or broadcasting... all without giving up runtime performance.

An efficient way to perform assignment phase is by doing vectorized computation. This approach assumes that you start with two 2D arrays: points and centroids, with the same number of columns (dimensionality of space), but possibly different number of rows. By using tiling (np.tile) we can then compute the distance matrix in a batch, then select the closest clusters per each point.
Here's the code:
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
See this GitHub gist for a complete runnable example.

Making grid triangular mesh quickly with Numpy

Consider a regular matrix that represents nodes numbered as shown in the figure:
I want to make a list with all the triangles represented in the figure. Which would result in the following 2 dimensional list:[[0,1,4],[1,5,4],[1,2,5],[2,6,5],...,[11,15,14]]
Assuming that the dimensions of the matrix are (NrXNc) ((4X4) in this case), I was able to achieve this result with the following code:
def MakeFaces(Nr,Nc):
Nfaces=(Nr-1)*(Nc-1)*2
Faces=np.zeros((Nfaces,3),dtype=np.int32)
for r in range(Nr-1):
for c in range(Nc-1):
fi=(r*(Nc-1)+c)*2
l1=r*Nc+c
l2=l1+1
l3=l1+Nc
l4=l3+1
Faces[fi]=[l1,l2,l3]
Faces[fi+1]=[l2,l4,l3]
return Faces
However, the double loop operations make this approach quite slow. Is there a way of using numpy in a smart way to do this faster?

We could play a multi-dimensional game based on slicing and multi-dim assignment that are perfect in NumPy environment on efficiency -
def MakeFacesVectorized1(Nr,Nc):
out = np.empty((Nr-1,Nc-1,2,3),dtype=int)
r = np.arange(Nr*Nc).reshape(Nr,Nc)
out[:,:, 0,0] = r[:-1,:-1]
out[:,:, 1,0] = r[:-1,1:]
out[:,:, 0,1] = r[:-1,1:]
out[:,:, 1,1] = r[1:,1:]
out[:,:, :,2] = r[1:,:-1,None]
out.shape =(-1,3)
return out
Runtime test and verification -
In [226]: Nr,Nc = 100, 100
In [227]: np.allclose(MakeFaces(Nr, Nc), MakeFacesVectorized1(Nr, Nc))
Out[227]: True
In [228]: %timeit MakeFaces(Nr, Nc)
100 loops, best of 3: 11.9 ms per loop
In [229]: %timeit MakeFacesVectorized1(Nr, Nc)
10000 loops, best of 3: 133 µs per loop
In [230]: 11900/133.0
Out[230]: 89.47368421052632
Around 90x speedup for Nr, Nc = 100, 100!

You can achieve a similar result without any explicit loops if you recast the problem correctly. One way would be to imagine the result as three arrays, each containing one of the vertices: first, second and third. You can then zip up or otherwise convert the arrays into whatever format you like in a fairly inexpensive operation.
You start with the actual matrix. This will make indexing and selecting elements much easier:
m = np.arange(Nr * Nc).reshape(Nr, Nc)
The first array will contain all the 90-degree corners:
c1 = np.concatenate((m[:-1, :-1].ravel(), m[1:, 1:].ravel()))
m[:-1, :-1] are the corners that are at the top, m[1:, 1:] are the corners that are at the bottom.
The second array will contain the corresponding top acute corners:
c2 = np.concatenate((m[:-1, 1:].ravel(), m[:-1, 1:].ravel()))
And the third array will contain the bottom corners:
c2 = np.concatenate((m[1:, :-1].ravel(), m[1:, :-1].ravel()))
You can now get an array like your original one back by zipping:
faces = list(zip(c1, c2, c3))
I am sure that you can find ways to improve this algorithm, but it is a start.

Efficient Kronecker product with identity matrix and regular matrix - NumPy/ Python

I am working on a python project and making use of numpy. I frequently have to compute Kronecker products of matrices by the identity matrix. These are a pretty big bottleneck in my code so I would like to optimize them. There are two kinds of products I have to take. The first one is:
np.kron(np.eye(N), A)
This one is pretty easy to optimize by simply using scipy.linalg.block_diag. The product is equivalent to:
la.block_diag(*[A]*N)
Which is about 10 times faster. However, I am unsure on how to optimize the second kind of product:
np.kron(A, np.eye(N))
Is there a similar trick I can use?

One approach would be to initialize an output array of 4D and then assign values into it from A. Such an assignment would broadcast values and this is where we would get efficiency in NumPy.
Thus, a solution would be like so -
# Get shape of A
m,n = A.shape
# Initialize output array as 4D
out = np.zeros((m,N,n,N))
# Get range array for indexing into the second and fourth axes
r = np.arange(N)
# Index into the second and fourth axes and selecting all elements along
# the rest to assign values from A. The values are broadcasted.
out[:,r,:,r] = A
# Finally reshape back to 2D
out.shape = (m*N,n*N)
Put as a function -
def kron_A_N(A, N): # Simulates np.kron(A, np.eye(N))
m,n = A.shape
out = np.zeros((m,N,n,N),dtype=A.dtype)
r = np.arange(N)
out[:,r,:,r] = A
out.shape = (m*N,n*N)
return out
To simulate np.kron(np.eye(N), A), simply swap the operations along the first and second and similarly for third and fourth axes -
def kron_N_A(A, N): # Simulates np.kron(np.eye(N), A)
m,n = A.shape
out = np.zeros((N,m,N,n),dtype=A.dtype)
r = np.arange(N)
out[r,:,r,:] = A
out.shape = (m*N,n*N)
return out
Timings -
In [174]: N = 100
...: A = np.random.rand(100,100)
...:
In [175]: np.allclose(np.kron(A, np.eye(N)), kron_A_N(A,N))
Out[175]: True
In [176]: %timeit np.kron(A, np.eye(N))
1 loops, best of 3: 458 ms per loop
In [177]: %timeit kron_A_N(A, N)
10 loops, best of 3: 58.4 ms per loop
In [178]: 458/58.4
Out[178]: 7.842465753424658

Curve_fit to apply_along_axis. How to speed it up?

I've got some big datasets to which I'd like to fit monoexponential time decays.
The data consists of multiple 4D datasets, acquired at different times, and the fit should thus run along a 5th dimension (through datasets).
The code I'm currently using is the following:
import numpy as np
import scipy.optimize as opt
[... load 4D datasets ....]
data = (dataset1, dataset2, dataset3)
times = (10, 20, 30)
def monoexponential(t, M0, t_const):
return M0*np.exp(-t/t_const)
# Starting guesses to initiate descent.
M0_init = 80.0
t_const_init = 50.0
init_guess = (M0_init, t_const_init)
def fit(vector):
try:
nlfit, nlpcov = opt.curve_fit(monoexponential, times, vector,
p0=init_guess,
sigma=None,
check_finite=False,
maxfev=100, ftol=0.5, xtol=1,
bounds=([0, 2000], [0, 800]))
M0, t_const = nlfit
except:
t_const = 0
return t_const
# Concatenate datasets in data into a single 5D array.
concat5D = np.concatenate([block[..., np.newaxis] for block in data],
axis=len(data[0].shape))
# And apply the curve fitting along the last dimension.
decay_map = np.apply_along_axis(fit, len(concat5D.shape) - 1, concat5D)
The code works fine, but takes forever (e.g, for dataset1.shape == (100,100,50,500)). I've read some other topics mentioning that apply_along_axis is very slow, so I'm guessing that's the culprit. Unfortunately, I don't really know what could be used as an alternative here (except maybe an explicit for loop?).
Does anyone have an idea of what I can do to avoid apply_along_axis and speed up curve_fit being called multiple times?

So you are applying a fit operation 100*100*50*500 times, to a 1d array (of 3 values in the example, more in real life?)?
apply_along_axis does iterate over all the dimensions of the input array, except for one. There's no compiling or doing this fit over multiple axes at once.
Without apply_along_axis the easiest approach is to reshape the array into a 2d one, compressing (100,100,50,500) to one (250...,) dimension, and then iterating on that. And then reshaping the result.
I was thinking that concatenating the datasets on a last axis might be slower than doing so on the first, but timings suggest otherwise.
np.stack is a new version of concatenate that makes it easy to add the new axis any where.
In [319]: x=np.ones((2,3,4,5),int)
In [320]: d=[x,x,x,x,x,x]
In [321]: np.stack(d,axis=0).shape # same as np.array(d)
Out[321]: (6, 2, 3, 4, 5)
In [322]: np.stack(d,axis=-1).shape
Out[322]: (2, 3, 4, 5, 6)
for a larger list (with a trivial sum function):
In [295]: d1=[x]*1000 # make a big list
In [296]: timeit np.apply_along_axis(sum,-1,np.stack(d1,-1)).shape
10 loops, best of 3: 39.7 ms per loop
In [297]: timeit np.apply_along_axis(sum,0,np.stack(d1,0)).shape
10 loops, best of 3: 39.2 ms per loop
an explicit loop using array reshape times about the same
In [312]: %%timeit
.....: d2=np.stack(d1,-1)
.....: d2=d2.reshape(-1,1000)
.....: res=np.stack([sum(i) for i in d2],0).reshape(d1[0].shape)
.....:
10 loops, best of 3: 39.1 ms per loop
But a function like sum can work on whole array, and do so much faster
In [315]: timeit np.stack(d1,-1).sum(-1).shape
100 loops, best of 3: 3.52 ms per loop
So changing the stacking and iteration methods doesn't make much difference in speed. But changing the 'fit' so it can work over more than one dimension can be a big help. I don't know enough of optimize.fit to know if that is possible.
====================
I just dug into the code for apply_along_axis. It basically constructs an index that looks like ind=(0,1,slice(None),2,1), and does func(arr[ind]), and then increments it, sort like long arithmetic with carry. So it is just systematically stepping through all elements, while keeping one axis a : slice.

In this particular case, where you're fitting a single exponential, you're likely better off to take the log of your data. Then fitting becomes linear and that is much faster than a nonlinear least squares, and can likely be vectorized since it becomes pretty much a linear algebra problem.
(And of course, if you have an idea of how to improve least_squares, that might be appreciated by the scipy devs.)

Python, simultaneous pseudo-inversion of many 3x3, singular, symmetric, matrices

I have a 3D image with dimensions rows x cols x deps. For every voxel in the image, I have computed a 3x3 real symmetric matrix. They are stored in the array D, which therefore has shape (rows, cols, deps, 6).
D stores the 6 unique elements of the 3x3 symmetric matrix for every voxel in my image. I need to find the Moore-Penrose pseudo inverse of all row*cols*deps matrices simultaneously/in vectorized code (looping through every image voxel and inverting is far too slow in Python).
Some of these 3x3 symmetric matrices are non-singular, and I can find their inverses, in vectorized code, using the analytical formula for the true inverse of a non-singular 3x3 symmetric matrix, and I've done that.
However, for those matrices that ARE singular (and there are sure to be some) I need the Moore-Penrose pseudo inverse. I could derive an analytical formula for the MP of a real, singular, symmetric 3x3 matrix, but it's a really nasty/lengthy formula, and would therefore involve a VERY large number of (element-wise) matrix arithmetic and quite a bit of confusing looking code.
Hence, I would like to know if there is a way to simultaneously find the MP pseudo inverse for all these matrices at once numerically. Is there a way to do this?
Gratefully,
GF

NumPy 1.8 included linear algebra gufuncs, which do exactly what you are after. While np.linalg.pinv is not gufunc-ed, np.linalg.svd is, and behind the scenes that is the function that gets called. So you can define your own gupinv function, based on the source code of the original function, as follows:
def gu_pinv(a, rcond=1e-15):
a = np.asarray(a)
swap = np.arange(a.ndim)
swap[[-2, -1]] = swap[[-1, -2]]
u, s, v = np.linalg.svd(a)
cutoff = np.maximum.reduce(s, axis=-1, keepdims=True) * rcond
mask = s > cutoff
s[mask] = 1. / s[mask]
s[~mask] = 0
return np.einsum('...uv,...vw->...uw',
np.transpose(v, swap) * s[..., None, :],
np.transpose(u, swap))
And you can now do things like:
a = np.random.rand(50, 40, 30, 6)
b = np.empty(a.shape[:-1] + (3, 3), dtype=a.dtype)
# Expand the unique items into a full symmetrical matrix
b[..., 0, :] = a[..., :3]
b[..., 1:, 0] = a[..., 1:3]
b[..., 1, 1:] = a[..., 3:5]
b[..., 2, 1:] = a[..., 4:]
# make matrix at [1, 2, 3] singular
b[1, 2, 3, 2] = b[1, 2, 3, 0] + b[1, 2, 3, 1]
# Find all the pseudo-inverses
pi = gu_pinv(b)
And of course the results are correct, both for singular and non-singular matrices:
>>> np.allclose(pi[0, 0, 0], np.linalg.pinv(b[0, 0, 0]))
True
>>> np.allclose(pi[1, 2, 3], np.linalg.pinv(b[1, 2, 3]))
True
And for this example, with 50 * 40 * 30 = 60,000 pseudo-inverses calculated:
In [2]: %timeit pi = gu_pinv(b)
1 loops, best of 3: 422 ms per loop
Which is really not that bad, although it is noticeably (4x) slower than simply calling np.linalg.inv, but this of course fails to properly handle the singular arrays:
In [8]: %timeit np.linalg.inv(b)
10 loops, best of 3: 98.8 ms per loop

EDIT: See #Jaime's answer. Only the discussion in the comments to this answer is useful now, and only for the specific problem at hand.
You can do this matrix by matrix, using scipy, that provides pinv (link) to calculate the Moore-Penrose pseudo inverse. An example follows:
from scipy.linalg import det,eig,pinv
from numpy.random import randint
#generate a random singular matrix M first
while True:
M = randint(0,10,9).reshape(3,3)
if det(M)==0:
break
M = M.astype(float)
#this is the method you need
MPpseudoinverse = pinv(M)
This does not exploit the fact that the matrix is symmetric though. You may also want to try the version of pinv exposed by numpy, that is supposedely faster, and different. See this post.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to speed up a vector cross product calculation - python

Related

K-Means: assign clusters to new data points

Making grid triangular mesh quickly with Numpy

Efficient Kronecker product with identity matrix and regular matrix - NumPy/ Python

Curve_fit to apply_along_axis. How to speed it up?

Python, simultaneous pseudo-inversion of many 3x3, singular, symmetric, matrices

Categories

Resources