Performance of 2 vector-matrix dot product

Performance of 2 vector-matrix dot product - python

The question is more focused on performance of calculation.
I have 2 vector-matrix. This means that they have a 3 depth dimension for X,Y,Z. Each element of the matrix has to make dot product with the element on the same position of the other matriz.
A simple and non efficient code will be this one:
import numpy as np
a = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
b = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
c = np.zeros((1000,1000))
numRow,numCol,numDepth = np.shape(a)
for idRow in range(numRow):
for idCol in range(numCol):
# Angle in radians
c[idRow,idCol] = math.acos(a[idRow,idCol,0]*b[idRow,idCol,0] + a[idRow,idCol,1]*b[idRow,idCol,1] + a[idRow,idCol,2]*b[idRow,idCol,2])
However, the numpy functions can speed up the calculations as the following ones, making code much faster:
# Angle in radians
d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) + np.multiply(a[:,:,1],b[:,:,1]) + np.multiply(a[:,:,2],b[:,:,2]))
However, I would like to know if there are other sintaxis that improve this one above with maybe other functions, indices,...
First code takes 4.658s while second takes 0.354s

You can do this with np.einsum, which multiplies and then sums over any axes:
np.arccos(np.einsum('ijk,ijk->ij', a, b))
The more straightforward way to do what you posted in the question is to use np.sum, where you sum along the last axis (-1):
np.arccos(np.sum(a*b, -1))
They all give the same answer but einsum is the fastest and sum is next:
In [36]: timeit np.arccos(np.einsum('ijk,ijk->ij', a, b))
10000 loops, best of 3: 20.4 µs per loop
In [37]: timeit e = np.arccos(np.sum(a*b, -1))
10000 loops, best of 3: 29.8 µs per loop
In [38]: %%timeit
....: d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) +
....: np.multiply(a[:,:,1],b[:,:,1]) +
....: np.multiply(a[:,:,2],b[:,:,2]))
....:
10000 loops, best of 3: 34.6 µs per loop

The Pythran compiler can further optimize your original expression by:
Removing temporary arrays
Using SIMD instructions
Using multithreading
As showcased by this example:
$ cat cross.py
#pythran export cross(float[][][], float[][][])
import numpy as np
def cross(a,b):
return np.arccos(np.multiply(a[:, :, 0], b[:, :, 0]) + np.multiply(a[:, :, 1],b[:, :, 1]) + np.multiply(a[:, :, 2], b[:, :, 2]))
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
10 loops, best of 3: 35.4 msec per loop
$ pythran cross.py -DUSE_BOOST_SIMD -fopenmp -march=native
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
100 loops, best of 3: 11.8 msec per loop

Related

Python: Sum of all permutations of outer products of numpy arrays of arrays

I have a numpy array of arrays Ai and I want each outer product (np.outer(Ai[i],Ai[j])) to be summed with a scaling multiplier to produce H. I can step through and make them then tensordot them with a matrix of scaling factors. I think things could be significantly simplified, but haven't figured out a general/efficient way to do this for ND. How can Arr2D and H more easily be produced? Note: Arr2D could be 64 2D arrays rather than 8x8 2D arrays.
Ai = np.random.random((8,101))
Arr2D = np.zeros((Ai.shape[0], Ai.shape[0], Ai.shape[1], Ai.shape[1]))
Arr2D[:,:,:,:] = np.asarray([ np.outer(Ai[i], Ai[j]) for i in range(Ai.shape[0])
for j in range(Ai.shape[0]) ]).reshape(Ai.shape[0],Ai.shape[0],Ai[0].size,Ai[0].size)
arr = np.random.random( (Ai.shape[0] * Ai.shape[0]) )
arr2D = arr.reshape(Ai.shape[0], Ai.shape[0])
H = np.tensordot(Arr2D, arr2D, axes=([0,1],[0,1]))

Good setup to leverage einsum!
np.einsum('ij,kl,ik->jl',Ai,Ai,arr2D,optimize=True)
Timings -
In [71]: # Setup inputs
...: Ai = np.random.random((8,101))
...: arr = np.random.random( (Ai.shape[0] * Ai.shape[0]) )
...: arr2D = arr.reshape(Ai.shape[0], Ai.shape[0])
In [74]: %%timeit # Original soln
...: Arr2D = np.zeros((Ai.shape[0], Ai.shape[0], Ai.shape[1], Ai.shape[1]))
...: Arr2D[:,:,:,:] = np.asarray([ np.outer(Ai[i], Ai[j]) for i in range(Ai.shape[0])
...: for j in range(Ai.shape[0]) ]).reshape(Ai.shape[0],Ai.shape[0],Ai[0].size,Ai[0].size)
...: H = np.tensordot(Arr2D, arr2D, axes=([0,1],[0,1]))
100 loops, best of 3: 4.5 ms per loop
In [75]: %timeit np.einsum('ij,kl,ik->jl',Ai,Ai,arr2D,optimize=True)
10000 loops, best of 3: 146 µs per loop
30x+ speedup there!

numpy - efficiently copy values from matrix to matrix using some precalculated map

I have an input matrix A of size I*J
And an output matrix B of size N*M
And some precalculated map of size N*M*2 that dictates for each coordinate in B, which coordinate in A to take. The map has no specific rule or linearity that I can use. Just a map that seems random.
The matrices are pretty big (~5000*~3000) so creating a mapping matrix is out of the question (5000*3000*5000*3000)
I managed to do it using a simple map and loop:
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
And I managed to do it using indexing:
B[coords_y, coords_x] = A[some_mapping[:, 0], some_mapping[:, 1]]
# Where coords_x, coords_y are defined as all of the coordinates:
# [[0,0],[0,1]..[0,M-1],[1,0],[1,1]...[N-1,M-1]]
This works much better, but still kind of slow.
I have infinite time in advance to calculate the mapping or any other utility calculation. But after these precalculations, this mapping should happen as fast as possible.
Currently, the only other option that I see is just to reimplement this in C or something faster...
(Just to make it clear if someone is curious, I'm creating an image out of some other, differently shaped and oriented image with some encoding. But its' mapping is very complicated and not something simple or linear that can be used)

If you have infinity time for precomputing you can get a slight speedup by going to flat indexing:
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
Then simply do:
A.ravel()[map_f]
Please note that this speedup is on top of the large speedup we get from fancy indexing. For example:
>>> A = np.random.random((5000, 3000))
>>> mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
>>>
>>> map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
>>>
>>> np.all(A.ravel()[map_f] == A[mapping[..., 0], mapping[..., 1]])
True
>>>
>>> timeit('A[mapping[:, :, 0], mappping[:, :, 1]]', globals=globals(), number=10)
4.101239089999581
>>> timeit('A.ravel()[map_f]', globals=globals(), number=10)
2.7831342950012186
If we were to compare to the original loopy code, the speedup would be more like ~40x.
Finally, note that this solution does not only avoid the additional dependency and potential installation nightmare that is numba, but is also simpler, shorter and faster:
numba:
precomp: 132.957 ms
main 238.359 ms
flat indexing:
precomp: 76.223 ms
main: 219.910 ms
Code:
import numpy as np
from numba import jit
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
from timeit import timeit
A = np.random.random((5000, 3000))
mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
a = np.random.random((5, 3))
m = np.random.randint(0, 15, (5, 3, 2)) % [5, 3]
print('numba:')
print(f"precomp: {timeit('b = fast(a, np.empty_like(a), m)', globals=globals(), number=1)*1000:10.3f} ms")
print(f"main {timeit('B = fast(A, np.empty_like(A), mapping)', globals=globals(), number=10)*100:10.3f} ms")
print('\nflat indexing:')
print(f"precomp: {timeit('map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)', globals=globals(), number=10)*100:10.3f} ms")
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
print(f"main: {timeit('B = A.ravel()[map_f]', globals=globals(), number=10)*100:10.3f} ms")

One very nice solution to these types of performance critical problems is to keep it simple and utilize one of the high performance packages. The easiest might be Numba which provides the jit decorator that compiles array and loop heavy code to optimized LLVM. Below is a full example:
from time import time
import numpy as np
from numba import jit
# Function doing the computation
def normal(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# The same exact function, but with the Numba jit decorator
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# Create sample data
def create_sample_data(I, J, N, M):
A = np.random.random((I, J))
B = np.empty((N, M))
mapping = np.asarray(np.stack((
np.random.random((N, M))*I,
np.random.random((N, M))*J,
), axis=2), dtype=int)
return A, B, mapping
A, B, mapping = create_sample_data(500, 600, 700, 800)
# Run normally
t0 = time()
B = normal(A, B, mapping)
t1 = time()
print('normal took', t1 - t0, 'seconds')
# Run using Numba.
# First we should run the function with smaller arrays,
# just to compile the code.
fast(*create_sample_data(5, 6, 7, 8))
# Now, run with real data
t0 = time()
B = fast(A, B, mapping)
t1 = time()
print('fast took', t1 - t0, 'seconds')
This uses your own looping solution, which is inherently slow using standard Python, but as fast as C when using Numba. On my machine the normal function executes in 0.270 seconds, while the fast function executes in 0.00248 seconds. That is, Numba gives us a 109x speedup (!) pretty much for free.
Note that the fast Numba function is called twice, first with small input arrays and only then with the real data. This is a critical step which is often neglected. Without it, you will find that the performance increase is not nearly as good, as the first call is used to compile the code. The types and dimensions of the input arrays should be the same in this initial call, but the size in each dimension is not important.
I create B outside of the function(s) and passed it as an argument (to be "filled with values"). You might just as well allocate B inside of the function, Numba does not care.
The easiest way to get Numba is properly via the Anaconda distribution.

One option would be to use numba, which can often provide substantial improvements in this kind of simple algorithmic code.
import numpy as np
from numba import njit
I, J = 5000, 5000
N, M = 3000, 3000
A = np.random.randint(0, 10, [I, J])
B = np.random.randint(0, 10, [N, M])
mapping = np.dstack([np.random.randint(0, I - 1, (N, M)),
np.random.randint(0, J - 1, (N, M))])
B0 = B.copy()
def orig(A, B, mapping):
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
new = njit(orig)
which gives us matching results:
In [313]: Bold = B0.copy()
In [314]: orig(A, Bold, mapping)
In [315]: Bnew = B0.copy()
In [316]: new(A, Bnew, mapping)
In [317]: (Bold == Bnew).all()
Out[317]: True
and is much faster:
In [320]: %time orig(A, B0.copy(), mapping)
Wall time: 6.11 s
In [321]: %time new(A, B0.copy(), mapping)
Wall time: 257 ms
and faster still after the first call, when it has to do its jit work:
In [322]: %time new(A, B0.copy(), mapping)
Wall time: 171 ms
In [323]: %time new(A, B0.copy(), mapping)
Wall time: 163 ms
for a 30x improvement for adding two lines of code.

The most straightforward optimization you can do is drop the native python loops and use fancy numpy indexing. You already have the array to do that:
import numpy as np
A = np.random.rand(2000,3000)
B = np.empty((2500,3500)) # just for shape, really
# this is the same as your original, but with random indices
mapping = np.stack([np.random.randint(0, A.shape[0] - 1, B.shape),
np.random.randint(0, A.shape[1] - 1, B.shape)],
axis=-1)
# your loopy original
def loopy(A, B, mapping):
B = B.copy()
for i in range(B.shape[0]):
for j in range(B.shape[1]):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# vectorization with fancy indexing
def fancy(A, mapping):
return A[mapping[...,0], mapping[...,1]]
Note that the fancy advanced-indexing function doesn't need preallocation of a B array, as a new array is constructed by the indexing operation.
There's a slight variation of the fancy indexing version which could be marginally more efficient: put your last dimension of mapping first, in this way both indexing arrays are contiguous blocks of memory. It turns out from my timing test that this happens to be slower in the above setup. Anyway:
mapping_T = mapping.transpose(2, 0, 1).copy() # but it's actually `mapping` without axis=-1 kwarg
# has shape (2, N, M)
def fancy_T(A, mapping_T):
return A[tuple(mapping_T)]
As Paul Panzer noted in a comment, just calling .transpose on mapping will not create a copy, but rather implement the transpose using stride tricks. In order to end up with a contiguous array (which is the point of the optimization) we need to force the creation of a copy.
I get the following timings in ipython:
# loopy(A, B, mapping)
6.63 s ± 141 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy(A, mapping)
250 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy_T(A, mapping_T)
277 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To be honest I don't understand why the original array order is faster compared to the transposed, but there's that.

Is there a more efficient way to slice a multi dimensional array

I noticed that indexing a multi dimensional array takes more time than indexing a single dimensional array
a1 = np.arange(1000000)
a2 = np.arange(1000000).reshape(1000, 1000)
a3 = np.arange(1000000).reshape(100, 100, 100)
When I index a1
%%timeit
a1[500000]
The slowest run took 39.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 84.6 ns per loop
%%timeit
a2[500, 0]
The slowest run took 31.85 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
%%timeit
a3[50, 0, 0]
The slowest run took 46.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 119 ns per loop
At what point should I consider an alternative way to index or slice a multi-dimensional array? What are the circumstances that make it worth the effort and loss of transparency?

One alternative to slicing an (n, m) array is to flatten the array and derive what it's one dimensional position must be.
consider a = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
we can get the 2nd row, 3rd column with a[1, 2] and get 5
or we can calculate that 1 * a.shape[1] + 2 is the one dimensional position if we flatten a with order='C'
thus we can perform the equivalent slice with a.ravel()[1 * a.shape[1] + 2]
Is this efficient? No, for indexing a single number from an array, it isn't worth the trouble.
What about if we want to slice many numbers from the array? I devised the following test for a 2-D array
2-D test
from timeit import timeit
n, m = 10000, 10000
a = np.random.rand(n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(n, size=k)
c = np.random.randint(m, size=k)
kw = dict(setup='from __main__ import a, b, c', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] + c]', **kw)
r.div(r.sum(1), 0).plot.bar()
It appears that when slicing more than 100,000 numbers, it's better to flatten the array.
What about 3-D
3-D test
from timeit import timeit
l, n, m = 1000, 1000, 1000
a = np.random.rand(l, n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(l, size=k)
c = np.random.randint(m, size=k)
d = np.random.randint(n, size=k)
kw = dict(setup='from __main__ import a, b, c, d', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c, d]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] * a.shape[2] + c * a.shape[1] + d]', **kw)
r.div(r.sum(1), 0).plot.bar()
Similar results, maybe more dramatic.
Conclusion
For 2 dimensional arrays, consider flattening and deriving flatten positions if you need to pull more than 100,000 elements from the array.
For 3 or more dimensions, it seems clear that flattening the array is almost always better.
Criticism is welcome
Did I do something wrong? Did I not think of something obvious?

Can numpy einsum() perform a cross-product between segments of a trajectory

I perform the cross product of contiguous segments of a trajectory (xy coordinates) using the following script:
In [129]:
def func1(xy, s):
size = xy.shape[0]-2*s
out = np.zeros(size)
for i in range(size):
p1, p2 = xy[i], xy[i+s] #segment 1
p3, p4 = xy[i+s], xy[i+2*s] #segment 2
out[i] = np.cross(p1-p2, p4-p3)
return out
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
tmp1 = p1-p2
tmp2 = p4-p3
return tmp1[:, 0] * tmp2[:, 1] - tmp2[:, 0] * tmp1[:, 1]
In [136]:
xy = np.array([[1,2],[2,3],[3,4],[5,6],[7,8],[2,4],[5,2],[9,9],[1,1]])
func2(xy, 2)
Out[136]:
array([ 0, -3, 16, 1, 22])
func1 is particularly slow because of the inner loop so I rewrote the cross-product myself (func2) which is orders of magnitude faster.
Is it possible to use the numpy einsum function to make the same calculation?

einsum computes sums of products only, but you could shoehorn the cross-product into a sum of products by reversing the columns of tmp2 and changing the sign of the first column:
def func3(xy, s):
size = xy.shape[0]-2*s
tmp1 = xy[0:size] - xy[s:size+s]
tmp2 = xy[2*s:size+2*s] - xy[s:size+s]
tmp2 = tmp2[:, ::-1]
tmp2[:, 0] *= -1
return np.einsum('ij,ij->i', tmp1, tmp2)
But func3 is slower than func2.
In [80]: xy = np.tile(xy, (1000, 1))
In [104]: %timeit func1(xy, 2)
10 loops, best of 3: 67.5 ms per loop
In [105]: %timeit func2(xy, 2)
10000 loops, best of 3: 73.2 µs per loop
In [106]: %timeit func3(xy, 2)
10000 loops, best of 3: 108 µs per loop
Sanity check:
In [86]: np.allclose(func1(xy, 2), func3(xy, 2))
Out[86]: True
I think the reason why func2 is beating einsum here is because the cost of setting of the loop in einsum for just 2 iterations is too expensive compared to just manually writing out the sum, and the reversing and multiplying eat up some time as well.

np.cross is a smart little beast, that can handle broadcasting without any issue. So you can rewrite your func2 as:
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
return np.cross(p1-p2, p4-p3)
and it will produce the correct result:
>>> func2(xy, 2)
array([ 0, -3, 16, 1, 22])
In the latest numpy it will likely run a tad faster than your code, as it was rewritten to minimize intermediate array creation. You can look at the source code (pure Python) here.

Apply a function to each row of a ndarray

I have this function to calculate squared Mahalanobis distance of vector x to mean:
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions' mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
I have an numpy array that has a shape of (25, 4). So, I want to apply that function to all 25 rows of my array without a for loop. So, basically, how can I write the vectorized form of this loop:
for r in d1:
mahalanobis_sqdist(r[0:4], mean1, Sig1)
where mean1 and Sig1 are :
>>> mean1
array([ 5.028, 3.48 , 1.46 , 0.248])
>>> Sig1 = np.cov(d1[0:25, 0:4].T)
>>> Sig1
array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
I have tried the following but it didn't work:
>>> vecdist = np.vectorize(mahalanobis_sqdist)
>>> vecdist(d1, mean1, Sig1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1862, in __call__
theout = self.thefunc(*newargs)
File "<stdin>", line 6, in mahalanobis_sqdist
File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range

To apply a function to each row of an array, you could use:
np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
In this case, however, there is a better way. You don't have to apply a function to each row. Instead, you can apply NumPy operations to the entire d1 array to calculate the same result. np.einsum can replace the for-loop and the two calls to np.dot:
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
Here are some benchmarks:
import numpy as np
np.random.seed(1)
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
def using_loop(d1, mean, Sigma):
expected = []
for r in d1:
expected.append(mahalanobis_sqdist(r[0:4], mean1, Sig1))
return np.array(expected)
d1 = np.random.random((25,4))
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
Sig1 = np.cov(d1[0:25, 0:4].T)
expected = using_loop(d1, mean1, Sig1)
result = np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
result2 = mahalanobis_sqdist2(d1, mean1, Sig1)
assert np.allclose(expected, result)
assert np.allclose(expected, result2)
In [92]: %timeit mahalanobis_sqdist2(d1, mean1, Sig1)
10000 loops, best of 3: 31.1 µs per loop
In [94]: %timeit using_loop(d1, mean1, Sig1)
1000 loops, best of 3: 569 µs per loop
In [91]: %timeit np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
1000 loops, best of 3: 806 µs per loop
Thus mahalanobis_sqdist2 is about 18x faster than a for-loop, and 26x faster than using np.apply_along_axis.
Note that np.apply_along_axis, np.vectorize, np.frompyfunc are Python utility functions. Under the hood they use for- or while-loops. There is no real "vectorization" going on here. They can provide syntactic assistance, but don't expect them to make your code perform any better than a for-loop you write yourself.

The answer by #unutbu works very nicely for applying any function to the rows of an array.
In this particular case, there are some mathematical symmetries you can use that will speed things up considerably if you are working with large arrays.
Here is a modified version of your function:
def mahalanobis_sqdist3(x, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
return (xdiff.dot(Sigma_inv)*xdiff).sum(axis=-1)
If you end up using any sort of large Sigma, I would recommend that you cache Sigma_inv and pass that in as an argument to your function instead.
Since it is 4x4 in this example, this doesn't matter.
I'll show how to deal with large Sigma anyway for anyone else who comes across this.
If you aren't going to be using the same Sigma repeatedly, you won't be able to cache it, so, instead of inverting the matrix, you could use a different method to solve the linear system.
Here I'll use the LU decomposition built in to SciPy.
This only improves the time if the number of columns of x is large relative to its number of rows.
Here is a function that shows that approach:
from scipy.linalg import lu_factor, lu_solve
def mahalanobis_sqdist4(x, mean, Sigma):
xdiff = x - mean
Sigma_inv = lu_factor(Sigma)
return (xdiff.T*lu_solve(Sigma_inv, xdiff.T)).sum(axis=0)
Here are some timings.
I'll include the version with einsum as mentioned in the other answer.
import numpy as np
Sig1 = np.array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
x = np.random.rand(25, 4)
%timeit np.apply_along_axis(mahalanobis_sqdist, 1, x, mean1, Sig1)
%timeit mahalanobis_sqdist2(x, mean1, Sig1)
%timeit mahalanobis_sqdist3(x, mean1, Sig1)
%timeit mahalanobis_sqdist4(x, mean1, Sig1)
giving:
1000 loops, best of 3: 973 µs per loop
10000 loops, best of 3: 36.2 µs per loop
10000 loops, best of 3: 40.8 µs per loop
10000 loops, best of 3: 83.2 µs per loop
However, changing the sizes of the arrays involved changes the timing results.
For example, letting x = np.random.rand(2500, 4), the timings are:
10 loops, best of 3: 95 ms per loop
1000 loops, best of 3: 355 µs per loop
10000 loops, best of 3: 131 µs per loop
1000 loops, best of 3: 337 µs per loop
And letting x = np.random.rand(1000, 1000), Sigma1 = np.random.rand(1000, 1000), and mean1 = np.random.rand(1000), the timings are:
1 loops, best of 3: 1min 24s per loop
1 loops, best of 3: 2.39 s per loop
10 loops, best of 3: 155 ms per loop
10 loops, best of 3: 99.9 ms per loop
Edit: I noticed that one of the other answers used the Cholesky decomposition.
Given that Sigma is symmetric and positive definite, we can actually do better than my above results.
There are some good routines from BLAS and LAPACK available through SciPy that can work with symmetric positive-definite matrices.
Here are two faster versions.
from scipy.linalg.fblas import dsymm
def mahalanobis_sqdist5(x, mean, Sigma_inv):
xdiff = x - mean
Sigma_inv = la.inv(Sigma)
return np.einsum('...i,...i->...',dsymm(1., Sigma_inv, xdiff.T).T, xdiff)
from scipy.linalg.flapack import dposv
def mahalanobis_sqdist6(x, mean, Sigma):
xdiff = x - mean
return np.einsum('...i,...i->...', xdiff, dposv(Sigma, xdiff.T)[1].T)
The first one still inverts Sigma.
If you pre-compute the inverse and reuse it, it is much faster (the 1000x1000 case takes 35.6ms on my machine with the pre-computed inverse).
I also used einsum to take the product then sum along the last axis.
This ended up being marginally faster than doing something like (A * B).sum(axis=-1).
These two functions give the following timings:
First test case:
10000 loops, best of 3: 55.3 µs per loop
100000 loops, best of 3: 14.2 µs per loop
Second test case:
10000 loops, best of 3: 121 µs per loop
10000 loops, best of 3: 79 µs per loop
Third test case:
10 loops, best of 3: 92.5 ms per loop
10 loops, best of 3: 48.2 ms per loop

Just saw a really nice comment on reddit that might speed things up even a little more:
This is not surprising to anyone who uses numpy regularly. For loops
in python are horribly slow. Actually, einsum is pretty slow too.
Here's a version that is faster if you have lots of vectors (500
vectors in 4 dimensions is enough to make this version faster than
einsum on my machine):
def no_einsum(d, mean, Sigma):
L_inv = np.linalg.inv(numpy.linalg.cholesky(Sigma))
xdiff = d - mean
return np.sum(np.dot(xdiff, L_inv.T)**2, axis=1)
If your points are also high dimensional then computing the inverse is
slow (and generally a bad idea anyway) and you can save time by
solving the system directly (500 vectors in 250 dimensions is enough
to make this version the fastest on my machine):
def no_einsum_solve(d, mean, Sigma):
L = numpy.linalg.cholesky(Sigma)
xdiff = d - mean
return np.sum(np.linalg.solve(L, xdiff.T)**2, axis=0)

The problem is that np.vectorize vectorizes over all arguments, but you need to vectorize only over the first one. You need to use excluded keyword argument to vectorize:
np.vectorize(mahalanobis_sqdist, excluded=[1, 2])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performance of 2 vector-matrix dot product - python

Related

Python: Sum of all permutations of outer products of numpy arrays of arrays

numpy - efficiently copy values from matrix to matrix using some precalculated map

Is there a more efficient way to slice a multi dimensional array

Can numpy einsum() perform a cross-product between segments of a trajectory

Apply a function to each row of a ndarray

Categories

Resources