For a project, I need to generate sample from function. I would like to be able to generate those samples as quickly as possible.
I have this example (in the final version, the function lambda will be provided in the arguments) The goal is to generate ys of the n points linespaced xs between start and stop using the lambda function.
def get_ys(coefficients, num_outputs=20, start=0., stop=1.):
function = lambda x, args: args[0]*(x-args[1])**2 + args[2]*(x-args[3]) + args[4]
xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
ys = [function(x, coefficients) for x in xs]
return ys
%%time
n = 1000
xs = np.random.random((n,5))
ys = np.apply_along_axis(get_ys, 1, xs)
Wall time: 616 ms
I am trying to vectorize it, and found numpy.apply_along_axis
%%time
for i in range(1000):
xs = np.random.random(5)
ys = get_ys(xs)
Wall time: 622 ms
Unfortunately it is still pretty slow :/
I am not so familiar with function vectorization, can someone guide me a little bit on how to improve the speed of the script ?
Thanks!
Edit:
example of input/output:
xs = np.ones(5)
ys = get_ys(xs)
[1.0, 0.9501385041551247, 0.9058171745152355, 0.8670360110803323, 0.8337950138504155,0.8060941828254848, 0.7839335180055402, 0.7673130193905817, 0.7562326869806094, 0.7506925207756232, 0.7506925207756232, 0.7562326869806094, 0.7673130193905817, 0.7839335180055401, 0.8060941828254847, 0.8337950138504155, 0.8670360110803323, 0.9058171745152354, 0.9501385041551246, 1.0]
def get_ys(coefficients, num_outputs=20, start=0., stop=1.):
function = lambda x, args: args[0]*(x-args[1])**2 + args[2]*(x-args[3]) + args[4]
xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
ys = [function(x, coefficients) for x in xs]
return ys
You are trying to get around calling get_ys 1000 times, once for each row of xs.
What will it take to pass xs as a whole to get_ys? In other words, what if coefficients was (n,5) instead of (5,)?
xs is (20,), and the ys will be same (right)?
The lambda is write to expect a scalar x and (5,) args. Can it be changed to work with a (20,) x and (n,5) args?
As a first step, what does function produce if given xs? That is instead of
ys = [function(x, coefficients) for x in xs]
ys = function(xs, coefficients)
As written your code iterates (at slow Python speeds) of the n (1000) rows, and the 20 linspace. So function is called 20,000 times. That's what makes your code slow.
Lets try that change
A sample run with your function:
In [126]: np.array(get_ys(np.arange(5)))
Out[126]:
array([-2. , -1.89473684, -1.78947368, -1.68421053, -1.57894737,
-1.47368421, -1.36842105, -1.26315789, -1.15789474, -1.05263158,
-0.94736842, -0.84210526, -0.73684211, -0.63157895, -0.52631579,
-0.42105263, -0.31578947, -0.21052632, -0.10526316, 0. ])
Replace the list comprehension with just one call to function:
In [127]: def get_ys1(coefficients, num_outputs=20, start=0., stop=1.):
...: function = lambda x, args: args[0]*(x-args[1])**2 + args[2]*(x-args[3]) + args[4]
...:
...: xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
...: ys = function(xs, coefficients)
...: return ys
...:
...:
Same values:
In [128]: get_ys1(np.arange(5))
Out[128]:
array([-2. , -1.89473684, -1.78947368, -1.68421053, -1.57894737,
-1.47368421, -1.36842105, -1.26315789, -1.15789474, -1.05263158,
-0.94736842, -0.84210526, -0.73684211, -0.63157895, -0.52631579,
-0.42105263, -0.31578947, -0.21052632, -0.10526316, 0. ])
Comparative timings:
In [129]: timeit np.array(get_ys(np.arange(5)))
345 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: timeit get_ys1(np.arange(5))
89.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That's what we mean by "vectorization" - replacing python level iterations (a list comprehension) with an equivalent that makes fuller user of numpy array methods.
I suspect we can move on to work with a (n,5) coefficients, but this should be enough to get you started.
fully vectorized
By broadcasting the (n,5) against (20,) I can get a function that does not have any python loops:
def get_ys2(coefficients, num_outputs=20, start=0., stop=1.):
function = lambda x, args: args[:,0]*(x-args[:,1])**2 + args[:,2]*(x-args[:,3]) + args[:,4]
xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
ys = function(xs[:,None], coefficients)
return ys.T
And with a (1,5) input:
In [156]: get_ys2(np.arange(5)[None,:])
Out[156]:
array([[-2. , -1.89473684, -1.78947368, -1.68421053, -1.57894737,
-1.47368421, -1.36842105, -1.26315789, -1.15789474, -1.05263158,
-0.94736842, -0.84210526, -0.73684211, -0.63157895, -0.52631579,
-0.42105263, -0.31578947, -0.21052632, -0.10526316, 0. ]])
With your test case:
In [146]: n = 1000
...: xs = np.random.random((n,5))
...: ys = np.apply_along_axis(get_ys, 1, xs)
In [147]: ys.shape
Out[147]: (1000, 20)
Two timings:
In [148]: timeit ys = np.apply_along_axis(get_ys, 1, xs)
...:
106 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [149]: timeit ys = np.apply_along_axis(get_ys1, 1, xs)
...:
88 ms ± 98.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and testing this
In [150]: ys2 = get_ys2(xs)
In [151]: ys2.shape
Out[151]: (1000, 20)
In [152]: np.allclose(ys, ys2)
Out[152]: True
In [153]: timeit ys2 = get_ys2(xs)
424 µs ± 484 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It matches values, and improves speed a lot.
In the new function, args can now be (n,5). And if x is (20,1), the result is (20,n), which I transpose on the return.
Related
I would like to write a function f that takes an arbitrary function g with the signature g : R^n -> R^n -> int that "lifts" g so that it operates on (R^{nxm}, R^{kxm}) by behaving like a dot product. Meaning I want f to have the signature f : R^{nxm} -> R^{mxk} -> R^{nxk} by applying g to all pairs of rows and columns in constructing a matrix M where M_ij = g(A[i,:], B[:,j]).
Is that possible?
For example scipy.spatial.distance.cosine expects two vectors. Now I would lift cosine with f:
from scipy.spatial.distance import cosine
A = np.random.randint(0, 3, (3,4))
B = np.random.randint(0, 3, (5,4))
cosine_lifted = f(cosine)
cosine_lifted(A, B)
This would then produce the same output as
def sim(A, B):
ignored_states = np.seterr(divide='raise')
return 1 - np.divide(np.dot(A, B.T), np.outer(np.linalg.norm(A, axis=1), np.linalg.norm(B, axis=1)))
Which is the same as sklearn.metrics.pairwise.cosine_similarity plus the 1 - blah part.
But if there was not sklearn.metrics.pairwise.cosine_similarity, I would have to implement this lifted version of cosine myself (which I of course did here...). But I don't want to do that for all function that behave basically the same as the dot product in regard to how they mechanically process their argument. Therefore, I would like o have this f function.
I wrote my other answer assuming your
np.dot(A, B.T)
with a (3,4) and (5,4) inputs was the primary dot functionality that you were trying to emulate. In other words, (3,4), (4,5) => (3,5) with summation on the common size 4 dimension. My answer showed how that 2d calculation can be performed with element-wise multiplications.
For what it's worth, np.dot gets much of its speed by passing the task to BLAS (or similar) optimized libraries. These have been written in C or Fortran, and optimized by generations of numerical-analysis coders.
But your signature description may be talking about a different thing. It's a bit confusing.
g : R^n -> R^n -> int
Does this mean that g(x,y) takes two (n,) shape arrays, and returns an integer? And it can't be generalized to work with 2d arrays?
f : R^{nxm} -> R^{kxm} -> R^{nxm}
Does this mean f(A, B) takes a (n,m) shape, and a (k,m) shape, and returns a (n,m) shape? What happened to the k shape? Is that k a typo?
Alternatively you talk about doing (I believe)
M = np.zeros((N,N)) # (N,M) ok?
for i in range(N):
for j in range(N):
x = A[i,:]; y = B[:,j]
M[i,j] = g(x, y)
alternatively:
M = np.array([[g(x,y) for y in B.T] for x in A])
Assuming g is a python function that can only work with 2 1d arrays (of matching length), and cannot be generalized to 2d arrays, there isn't any mechanism in numpy to compile the above double loop. g has to be evaluated N**2 times. And assuming g is not trivial, those N*2 evaluations will dominate the total evaluation time, not the iteration mechanism.
np.vectorize normally takes a function that accepts scalar inputs, but with a signature parameter it can work with your g:
f = np.vectorize(g, signature='(n),(n)') # signature syntax may be wrong
M = f(A, B.T)
but in my testing vectorize has always been slower than an explicit iteration. With a signature it's even slower. So I kind of hesitate even mentioning it.
Are you asking for a function with a signature as simple as what follows (able to multiply two matrices), or do you want to emulate the entire np.dot api surface?
def lift(f):
def dot(A, B):
return np.array([[f(v,w) for w in zip(*B)] for v in A])
return dot
A major source of inefficiency in the above code is the allocations for all the intermediate lists. Since we know the final return value those are easy to avoid:
def lift(f):
def dot(A, B):
result = np.empty((A.shape[0], B.shape[1]))
for i,v in enumerate(A):
for j,w in enumerate(zip(*B)):
result[i,j] = f(v,w)
return result
return dot
Loops are fairly expensive in Python, but since f is operating on k elements it seems reasonable to assume that this overhead is small. You could reduce it further by compiling with pypy or cython.
matmul has been cast as a ufunc, and formally has a signature. np.dot is an earlier version and doesn't have a signature.
But given 2d arrays, np.dot is effectively a broadcasted form of multiplication followed by summation, or 'sum of products':
In [587]: A = np.arange(12).reshape(3,4)
In [588]: B = np.arange(8).reshape(2,4)
In [589]: np.dot(A, B.T)
Out[589]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
equivalent:
In [591]: (A[:,None,:]*B[None,:,:]).sum(axis=2)
Out[591]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
Some find the einsum style of signature easier to follow:
In [594]: np.einsum('ij,kj->ik', A, B)
Out[594]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
where the repeated j signals dot like summation.
===
Illustrating the iteration in my other answer:
In [601]: def g(x,y):
...: return (x*y).sum()
...:
In [602]: A.shape, B.shape
Out[602]: ((3, 4), (2, 4))
In [603]: np.array([[g(x,y) for y in B] for x in A])
Out[603]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
and the vectorize version:
In [614]: f = np.vectorize(g, signature='(n),(n)->()')
In [615]: f(A[:,None,:], B[None,:,:])
Out[615]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
comparative times:
In [616]: timeit f(A[:,None,:], B[None,:,:])
255 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [617]: timeit np.array([[g(x,y) for y in B] for x in A])
69.4 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [618]: timeit np.dot(A, B.T)
3.15 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and using #hans' 2nd lift:
In [623]: h = lift(g)
In [624]: h(A,B.T)
Out[624]:
array([[ 14., 38.],
[ 38., 126.],
[ 62., 214.]])
In [625]: timeit h(A,B.T)
102 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Assume you have a numpy array with shape (a,b,c) and a boolean mask of shape (a,b,c,d).
I would like to apply the mask to the array iterating over the last axis, sum the masked array along the first three axes, and obtain a list (or an array) of length/shape (d,).
I tried to do this with a list comprehension:
Result = [np.sum(Array[Mask[:,:,:,i]], axis=(0,1,2)) for i in range(d)]
It works, but it does not look very pythonic and it is a bit slow as well.
I also tried something like
Array = Array[:,:,:,np.newaxis]
Result = np.sum(Array[Mask], axis=(0,1,2))
but of course this doesn't work, since the dimension of the Mask along the last axis, d, is larger than the dimension of the last axis of the Array, 1.
Also, consider that each axis could have dimension of order 100 or 200, so repeating the Array d times along a new last axis using np.repeat would be really memory intensive, and I would like to avoid this.
Are there any other faster and more pythonic alternatives to the list comprehension?
The most straightforward way of broadcasting a N-dimensional array to a matching (N+1)-dimensional array is to use np.broadcast_to():
import numpy as np
arr = np.random.randint(0, 100, (2, 3))
mask = np.random.randint(0, 2, (2, 3, 4), dtype=bool)
b_arr = np.broadcast_to(arr[..., None], mask.shape)
print(mask.shape == b_arr.shape)
# True
However, as #hpaulj already pointed out, you cannot use mask for slicing b_arr without loosing the dimensions.
Given that you want to just sum the elements together and summing zeroes "does not hurt", you could simply multiply element-wise your array and your mask so as to keep the correct dimension but the elements that are False in the mask are irrelevant for the subsequent sum of the corresponding array elements:
result = np.sum(b_arr * mask, axis=tuple(range(mask.ndim - 1)))
or, since * will do the broadcasting automatically:
result = np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
without the need to use np.broadcast_to() in the first place (but you still need to match the number of dimension, i.e. using arr[..., None] and not just arr).
As #PaulPanzer already pointed out, since you want to sum up over all but one dimensions, this can be further simplified using np.matmul()/#:
result2 = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(np.all(result == result2))
# True
For fancier operations involving the summation, please have a look at np.einsum().
EDIT
The catch with broadcasting is that it will create temporary arrays during the evaluation of your expressions.
With the number you seems to be dealing with, I simply cannot use the broadcasted arrays as I run into MemoryError, but time-wise the element-wise multiplication may still be a better approach than what you originally proposed.
Alternatively, if you are after speed, you could do this at a somewhat lower level with explicit looping in Cython or Numba.
Below you can find a couple of Numba-based solutions (working on ravel()-ed data):
_vector_matrix_product(): does not use any temporary array
_vector_matrix_product_mp(): some as above but using parallel execution
_vector_matrix_product_sum(): uses np.sum() and parallel execution
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def _vector_matrix_product(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in range(rows):
for j in range(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in range(rows):
for j in range(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_mp(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_sum(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
result_arr[i] = np.sum(vect_arr * mat_arr[i, :])
else:
for j in nb.prange(cols):
result_arr[j] = np.sum(vect_arr * mat_arr[:, j])
def vector_matrix_product(
vect_arr,
mat_arr,
swap=False,
dtype=None,
mode=None):
rows, cols = mat_arr.shape
if not dtype:
dtype = (vect_arr[0] * mat_arr[0, 0]).dtype
if not swap:
result_arr = np.zeros(cols, dtype=dtype)
else:
result_arr = np.zeros(rows, dtype=dtype)
if mode == 'sum':
_vector_matrix_product_sum(vect_arr, mat_arr, result_arr)
elif mode == 'mp':
_vector_matrix_product_mp(vect_arr, mat_arr, result_arr)
else:
_vector_matrix_product(vect_arr, mat_arr, result_arr)
return result_arr
np.random.seed(0)
arr = np.random.randint(0, 100, (2, 3, 4))
mask = np.random.randint(0, 2, (2, 3, 4, 5), dtype=bool)
target = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(target)
# [820 723 861 486 408]
result1 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
print(result1)
# [820 723 861 486 408]
result2 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
print(result2)
# [820 723 861 486 408]
result3 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
print(result3)
# [820 723 861 486 408]
with improved timing over any list-comprehension-based solutions:
arr = np.random.randint(0, 100, (256, 256, 256))
mask = np.random.randint(0, 2, (256, 256, 256, 128), dtype=bool)
%timeit np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
# MemoryError
%timeit arr.ravel() # mask.reshape(-1, mask.shape[-1])
# MemoryError
%timeit np.array([np.sum(arr * mask[..., i], axis=tuple(range(mask.ndim - 1))) for i in range(mask.shape[-1])])
# 24.1 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array([np.sum(arr[mask[..., i]]) for i in range(mask.shape[-1])])
# 46 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
# 408 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
# 1.63 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
# 7.17 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As expected, the JIT accelerated version is the fastest, and enforcing parallelism on the code does not result in improved speed-ups.
Note also that the approach with element-wise multiplication is faster than slicing (approx. twice as fast for these benchmarks).
EDIT 2
Following #max9111 suggestion, looping first by rows and then by cols cause the most time-consuming loop to run on contiguous data, resulting in significant speed-up.
Without this trick, _vector_matrix_product_sum() and _vector_matrix_product_mp() would run at essentially the same speed.
How about
Array.reshape(-1)#Mask.reshape(-1,d)
Since you are summing over the first three axes anyway you may as well merge them after which it is easy to see that the operation can be written as matrix-vector product
Example:
a,b,c,d = 4,5,6,7
Mask = np.random.randint(0,2,(a,b,c,d),bool)
Array = np.random.randint(0,10,(a,b,c))
[np.sum(Array[Mask[:,:,:,i]]) for i in range(d)]
# [310, 237, 253, 261, 229, 268, 184]
Array.reshape(-1)#Mask.reshape(-1,d)
# array([310, 237, 253, 261, 229, 268, 184])
I have an array
import numpy as np
X = np.array([[0.7513, 0.6991, 0.5472, 0.2575],
[0.2551, 0.8909, 0.1386, 0.8407],
[0.5060, 0.9593, 0.1493, 0.2543],
[0.5060, 0.9593, 0.1493, 0.2543]])
y = np.array([[1,2,3,4]])
How to replace the diagonal of X with y. We can write a loop but any faster way?
A fast and reliable method is np.einsum:
>>> diag_view = np.einsum('ii->i', X)
This creates a view of the diagonal:
>>> diag_view
array([0.7513, 0.8909, 0.1493, 0.2543])
This view is writable:
>>> diag_view[None] = y
>>> X
array([[1. , 0.6991, 0.5472, 0.2575],
[0.2551, 2. , 0.1386, 0.8407],
[0.506 , 0.9593, 3. , 0.2543],
[0.506 , 0.9593, 0.1493, 4. ]])
This works for contiguous and non-contiguous arrays and is very fast:
contiguous:
loop 21.146424998732982
diag_indices 2.595232878000388
einsum 1.0271988900003635
flatten 1.5372659160002513
non contiguous:
loop 20.133818001340842
diag_indices 2.618005960001028
einsum 1.0305795049989683
Traceback (most recent call last): <- flatten does not work here
...
How does it work? Under the hood einsum does an advanced version of #Julien's trick: It adds the strides of arr:
>>> arr.strides
(3200, 16)
>>> np.einsum('ii->i', arr).strides
(3216,)
One can convince oneself that this will always work as long as arr is organized in strides, which is the case for numpy arrays.
While this use of einsum is pretty neat it is also almost impossible to find if one doesn't know. So spread the word!
Code to recreate the timings and the crash:
import numpy as np
n = 100
arr = np.zeros((n, n))
replace = np.ones(n)
def loop():
for i in range(len(arr)):
arr[i,i] = replace[i]
def other():
l = len(arr)
arr.shape = -1
arr[::l+1] = replace
arr.shape = l,l
def di():
arr[np.diag_indices(arr.shape[0])] = replace
def es():
np.einsum('ii->i', arr)[...] = replace
from timeit import timeit
print('\ncontiguous:')
print('loop ', timeit(loop, number=1000)*1000)
print('diag_indices ', timeit(di))
print('einsum ', timeit(es))
print('flatten ', timeit(other))
arr = np.zeros((2*n, 2*n))[::2, ::2]
print('\nnon contiguous:')
print('loop ', timeit(loop, number=1000)*1000)
print('diag_indices ', timeit(di))
print('einsum ', timeit(es))
print('flatten ', timeit(other))
This should be pretty fast (especially for bigger arrays, for your example it's about twice slower):
arr = np.zeros((4,4))
replace = [1,2,3,4]
l = len(arr)
arr.shape = -1
arr[::l+1] = replace
arr.shape = l,l
Test on bigger array:
n = 100
arr = np.zeros((n,n))
replace = np.ones(n)
def loop():
for i in range(len(arr)):
arr[i,i] = replace[i]
def other():
l = len(arr)
arr.shape = -1
arr[::l+1] = replace
arr.shape = l,l
%timeit(loop())
%timeit(other())
14.7 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.55 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Use diag_indices for a vectorized solution:
X[np.diag_indices(X.shape[0])] = y
array([[1. , 0.6991, 0.5472, 0.2575],
[0.2551, 2. , 0.1386, 0.8407],
[0.506 , 0.9593, 3. , 0.2543],
[0.506 , 0.9593, 0.1493, 4. ]])
The question is more focused on performance of calculation.
I have 2 vector-matrix. This means that they have a 3 depth dimension for X,Y,Z. Each element of the matrix has to make dot product with the element on the same position of the other matriz.
A simple and non efficient code will be this one:
import numpy as np
a = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
b = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
c = np.zeros((1000,1000))
numRow,numCol,numDepth = np.shape(a)
for idRow in range(numRow):
for idCol in range(numCol):
# Angle in radians
c[idRow,idCol] = math.acos(a[idRow,idCol,0]*b[idRow,idCol,0] + a[idRow,idCol,1]*b[idRow,idCol,1] + a[idRow,idCol,2]*b[idRow,idCol,2])
However, the numpy functions can speed up the calculations as the following ones, making code much faster:
# Angle in radians
d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) + np.multiply(a[:,:,1],b[:,:,1]) + np.multiply(a[:,:,2],b[:,:,2]))
However, I would like to know if there are other sintaxis that improve this one above with maybe other functions, indices,...
First code takes 4.658s while second takes 0.354s
You can do this with np.einsum, which multiplies and then sums over any axes:
np.arccos(np.einsum('ijk,ijk->ij', a, b))
The more straightforward way to do what you posted in the question is to use np.sum, where you sum along the last axis (-1):
np.arccos(np.sum(a*b, -1))
They all give the same answer but einsum is the fastest and sum is next:
In [36]: timeit np.arccos(np.einsum('ijk,ijk->ij', a, b))
10000 loops, best of 3: 20.4 µs per loop
In [37]: timeit e = np.arccos(np.sum(a*b, -1))
10000 loops, best of 3: 29.8 µs per loop
In [38]: %%timeit
....: d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) +
....: np.multiply(a[:,:,1],b[:,:,1]) +
....: np.multiply(a[:,:,2],b[:,:,2]))
....:
10000 loops, best of 3: 34.6 µs per loop
The Pythran compiler can further optimize your original expression by:
Removing temporary arrays
Using SIMD instructions
Using multithreading
As showcased by this example:
$ cat cross.py
#pythran export cross(float[][][], float[][][])
import numpy as np
def cross(a,b):
return np.arccos(np.multiply(a[:, :, 0], b[:, :, 0]) + np.multiply(a[:, :, 1],b[:, :, 1]) + np.multiply(a[:, :, 2], b[:, :, 2]))
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
10 loops, best of 3: 35.4 msec per loop
$ pythran cross.py -DUSE_BOOST_SIMD -fopenmp -march=native
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
100 loops, best of 3: 11.8 msec per loop
I have this function to calculate squared Mahalanobis distance of vector x to mean:
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions' mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
I have an numpy array that has a shape of (25, 4). So, I want to apply that function to all 25 rows of my array without a for loop. So, basically, how can I write the vectorized form of this loop:
for r in d1:
mahalanobis_sqdist(r[0:4], mean1, Sig1)
where mean1 and Sig1 are :
>>> mean1
array([ 5.028, 3.48 , 1.46 , 0.248])
>>> Sig1 = np.cov(d1[0:25, 0:4].T)
>>> Sig1
array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
I have tried the following but it didn't work:
>>> vecdist = np.vectorize(mahalanobis_sqdist)
>>> vecdist(d1, mean1, Sig1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1862, in __call__
theout = self.thefunc(*newargs)
File "<stdin>", line 6, in mahalanobis_sqdist
File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range
To apply a function to each row of an array, you could use:
np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
In this case, however, there is a better way. You don't have to apply a function to each row. Instead, you can apply NumPy operations to the entire d1 array to calculate the same result. np.einsum can replace the for-loop and the two calls to np.dot:
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
Here are some benchmarks:
import numpy as np
np.random.seed(1)
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
def using_loop(d1, mean, Sigma):
expected = []
for r in d1:
expected.append(mahalanobis_sqdist(r[0:4], mean1, Sig1))
return np.array(expected)
d1 = np.random.random((25,4))
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
Sig1 = np.cov(d1[0:25, 0:4].T)
expected = using_loop(d1, mean1, Sig1)
result = np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
result2 = mahalanobis_sqdist2(d1, mean1, Sig1)
assert np.allclose(expected, result)
assert np.allclose(expected, result2)
In [92]: %timeit mahalanobis_sqdist2(d1, mean1, Sig1)
10000 loops, best of 3: 31.1 µs per loop
In [94]: %timeit using_loop(d1, mean1, Sig1)
1000 loops, best of 3: 569 µs per loop
In [91]: %timeit np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
1000 loops, best of 3: 806 µs per loop
Thus mahalanobis_sqdist2 is about 18x faster than a for-loop, and 26x faster than using np.apply_along_axis.
Note that np.apply_along_axis, np.vectorize, np.frompyfunc are Python utility functions. Under the hood they use for- or while-loops. There is no real "vectorization" going on here. They can provide syntactic assistance, but don't expect them to make your code perform any better than a for-loop you write yourself.
The answer by #unutbu works very nicely for applying any function to the rows of an array.
In this particular case, there are some mathematical symmetries you can use that will speed things up considerably if you are working with large arrays.
Here is a modified version of your function:
def mahalanobis_sqdist3(x, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
return (xdiff.dot(Sigma_inv)*xdiff).sum(axis=-1)
If you end up using any sort of large Sigma, I would recommend that you cache Sigma_inv and pass that in as an argument to your function instead.
Since it is 4x4 in this example, this doesn't matter.
I'll show how to deal with large Sigma anyway for anyone else who comes across this.
If you aren't going to be using the same Sigma repeatedly, you won't be able to cache it, so, instead of inverting the matrix, you could use a different method to solve the linear system.
Here I'll use the LU decomposition built in to SciPy.
This only improves the time if the number of columns of x is large relative to its number of rows.
Here is a function that shows that approach:
from scipy.linalg import lu_factor, lu_solve
def mahalanobis_sqdist4(x, mean, Sigma):
xdiff = x - mean
Sigma_inv = lu_factor(Sigma)
return (xdiff.T*lu_solve(Sigma_inv, xdiff.T)).sum(axis=0)
Here are some timings.
I'll include the version with einsum as mentioned in the other answer.
import numpy as np
Sig1 = np.array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
x = np.random.rand(25, 4)
%timeit np.apply_along_axis(mahalanobis_sqdist, 1, x, mean1, Sig1)
%timeit mahalanobis_sqdist2(x, mean1, Sig1)
%timeit mahalanobis_sqdist3(x, mean1, Sig1)
%timeit mahalanobis_sqdist4(x, mean1, Sig1)
giving:
1000 loops, best of 3: 973 µs per loop
10000 loops, best of 3: 36.2 µs per loop
10000 loops, best of 3: 40.8 µs per loop
10000 loops, best of 3: 83.2 µs per loop
However, changing the sizes of the arrays involved changes the timing results.
For example, letting x = np.random.rand(2500, 4), the timings are:
10 loops, best of 3: 95 ms per loop
1000 loops, best of 3: 355 µs per loop
10000 loops, best of 3: 131 µs per loop
1000 loops, best of 3: 337 µs per loop
And letting x = np.random.rand(1000, 1000), Sigma1 = np.random.rand(1000, 1000), and mean1 = np.random.rand(1000), the timings are:
1 loops, best of 3: 1min 24s per loop
1 loops, best of 3: 2.39 s per loop
10 loops, best of 3: 155 ms per loop
10 loops, best of 3: 99.9 ms per loop
Edit: I noticed that one of the other answers used the Cholesky decomposition.
Given that Sigma is symmetric and positive definite, we can actually do better than my above results.
There are some good routines from BLAS and LAPACK available through SciPy that can work with symmetric positive-definite matrices.
Here are two faster versions.
from scipy.linalg.fblas import dsymm
def mahalanobis_sqdist5(x, mean, Sigma_inv):
xdiff = x - mean
Sigma_inv = la.inv(Sigma)
return np.einsum('...i,...i->...',dsymm(1., Sigma_inv, xdiff.T).T, xdiff)
from scipy.linalg.flapack import dposv
def mahalanobis_sqdist6(x, mean, Sigma):
xdiff = x - mean
return np.einsum('...i,...i->...', xdiff, dposv(Sigma, xdiff.T)[1].T)
The first one still inverts Sigma.
If you pre-compute the inverse and reuse it, it is much faster (the 1000x1000 case takes 35.6ms on my machine with the pre-computed inverse).
I also used einsum to take the product then sum along the last axis.
This ended up being marginally faster than doing something like (A * B).sum(axis=-1).
These two functions give the following timings:
First test case:
10000 loops, best of 3: 55.3 µs per loop
100000 loops, best of 3: 14.2 µs per loop
Second test case:
10000 loops, best of 3: 121 µs per loop
10000 loops, best of 3: 79 µs per loop
Third test case:
10 loops, best of 3: 92.5 ms per loop
10 loops, best of 3: 48.2 ms per loop
Just saw a really nice comment on reddit that might speed things up even a little more:
This is not surprising to anyone who uses numpy regularly. For loops
in python are horribly slow. Actually, einsum is pretty slow too.
Here's a version that is faster if you have lots of vectors (500
vectors in 4 dimensions is enough to make this version faster than
einsum on my machine):
def no_einsum(d, mean, Sigma):
L_inv = np.linalg.inv(numpy.linalg.cholesky(Sigma))
xdiff = d - mean
return np.sum(np.dot(xdiff, L_inv.T)**2, axis=1)
If your points are also high dimensional then computing the inverse is
slow (and generally a bad idea anyway) and you can save time by
solving the system directly (500 vectors in 250 dimensions is enough
to make this version the fastest on my machine):
def no_einsum_solve(d, mean, Sigma):
L = numpy.linalg.cholesky(Sigma)
xdiff = d - mean
return np.sum(np.linalg.solve(L, xdiff.T)**2, axis=0)
The problem is that np.vectorize vectorizes over all arguments, but you need to vectorize only over the first one. You need to use excluded keyword argument to vectorize:
np.vectorize(mahalanobis_sqdist, excluded=[1, 2])