I have this function to calculate squared Mahalanobis distance of vector x to mean:
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions' mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
I have an numpy array that has a shape of (25, 4). So, I want to apply that function to all 25 rows of my array without a for loop. So, basically, how can I write the vectorized form of this loop:
for r in d1:
mahalanobis_sqdist(r[0:4], mean1, Sig1)
where mean1 and Sig1 are :
>>> mean1
array([ 5.028, 3.48 , 1.46 , 0.248])
>>> Sig1 = np.cov(d1[0:25, 0:4].T)
>>> Sig1
array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
I have tried the following but it didn't work:
>>> vecdist = np.vectorize(mahalanobis_sqdist)
>>> vecdist(d1, mean1, Sig1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1862, in __call__
theout = self.thefunc(*newargs)
File "<stdin>", line 6, in mahalanobis_sqdist
File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range
To apply a function to each row of an array, you could use:
np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
In this case, however, there is a better way. You don't have to apply a function to each row. Instead, you can apply NumPy operations to the entire d1 array to calculate the same result. np.einsum can replace the for-loop and the two calls to np.dot:
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
Here are some benchmarks:
import numpy as np
np.random.seed(1)
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
def using_loop(d1, mean, Sigma):
expected = []
for r in d1:
expected.append(mahalanobis_sqdist(r[0:4], mean1, Sig1))
return np.array(expected)
d1 = np.random.random((25,4))
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
Sig1 = np.cov(d1[0:25, 0:4].T)
expected = using_loop(d1, mean1, Sig1)
result = np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
result2 = mahalanobis_sqdist2(d1, mean1, Sig1)
assert np.allclose(expected, result)
assert np.allclose(expected, result2)
In [92]: %timeit mahalanobis_sqdist2(d1, mean1, Sig1)
10000 loops, best of 3: 31.1 µs per loop
In [94]: %timeit using_loop(d1, mean1, Sig1)
1000 loops, best of 3: 569 µs per loop
In [91]: %timeit np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
1000 loops, best of 3: 806 µs per loop
Thus mahalanobis_sqdist2 is about 18x faster than a for-loop, and 26x faster than using np.apply_along_axis.
Note that np.apply_along_axis, np.vectorize, np.frompyfunc are Python utility functions. Under the hood they use for- or while-loops. There is no real "vectorization" going on here. They can provide syntactic assistance, but don't expect them to make your code perform any better than a for-loop you write yourself.
The answer by #unutbu works very nicely for applying any function to the rows of an array.
In this particular case, there are some mathematical symmetries you can use that will speed things up considerably if you are working with large arrays.
Here is a modified version of your function:
def mahalanobis_sqdist3(x, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
return (xdiff.dot(Sigma_inv)*xdiff).sum(axis=-1)
If you end up using any sort of large Sigma, I would recommend that you cache Sigma_inv and pass that in as an argument to your function instead.
Since it is 4x4 in this example, this doesn't matter.
I'll show how to deal with large Sigma anyway for anyone else who comes across this.
If you aren't going to be using the same Sigma repeatedly, you won't be able to cache it, so, instead of inverting the matrix, you could use a different method to solve the linear system.
Here I'll use the LU decomposition built in to SciPy.
This only improves the time if the number of columns of x is large relative to its number of rows.
Here is a function that shows that approach:
from scipy.linalg import lu_factor, lu_solve
def mahalanobis_sqdist4(x, mean, Sigma):
xdiff = x - mean
Sigma_inv = lu_factor(Sigma)
return (xdiff.T*lu_solve(Sigma_inv, xdiff.T)).sum(axis=0)
Here are some timings.
I'll include the version with einsum as mentioned in the other answer.
import numpy as np
Sig1 = np.array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
x = np.random.rand(25, 4)
%timeit np.apply_along_axis(mahalanobis_sqdist, 1, x, mean1, Sig1)
%timeit mahalanobis_sqdist2(x, mean1, Sig1)
%timeit mahalanobis_sqdist3(x, mean1, Sig1)
%timeit mahalanobis_sqdist4(x, mean1, Sig1)
giving:
1000 loops, best of 3: 973 µs per loop
10000 loops, best of 3: 36.2 µs per loop
10000 loops, best of 3: 40.8 µs per loop
10000 loops, best of 3: 83.2 µs per loop
However, changing the sizes of the arrays involved changes the timing results.
For example, letting x = np.random.rand(2500, 4), the timings are:
10 loops, best of 3: 95 ms per loop
1000 loops, best of 3: 355 µs per loop
10000 loops, best of 3: 131 µs per loop
1000 loops, best of 3: 337 µs per loop
And letting x = np.random.rand(1000, 1000), Sigma1 = np.random.rand(1000, 1000), and mean1 = np.random.rand(1000), the timings are:
1 loops, best of 3: 1min 24s per loop
1 loops, best of 3: 2.39 s per loop
10 loops, best of 3: 155 ms per loop
10 loops, best of 3: 99.9 ms per loop
Edit: I noticed that one of the other answers used the Cholesky decomposition.
Given that Sigma is symmetric and positive definite, we can actually do better than my above results.
There are some good routines from BLAS and LAPACK available through SciPy that can work with symmetric positive-definite matrices.
Here are two faster versions.
from scipy.linalg.fblas import dsymm
def mahalanobis_sqdist5(x, mean, Sigma_inv):
xdiff = x - mean
Sigma_inv = la.inv(Sigma)
return np.einsum('...i,...i->...',dsymm(1., Sigma_inv, xdiff.T).T, xdiff)
from scipy.linalg.flapack import dposv
def mahalanobis_sqdist6(x, mean, Sigma):
xdiff = x - mean
return np.einsum('...i,...i->...', xdiff, dposv(Sigma, xdiff.T)[1].T)
The first one still inverts Sigma.
If you pre-compute the inverse and reuse it, it is much faster (the 1000x1000 case takes 35.6ms on my machine with the pre-computed inverse).
I also used einsum to take the product then sum along the last axis.
This ended up being marginally faster than doing something like (A * B).sum(axis=-1).
These two functions give the following timings:
First test case:
10000 loops, best of 3: 55.3 µs per loop
100000 loops, best of 3: 14.2 µs per loop
Second test case:
10000 loops, best of 3: 121 µs per loop
10000 loops, best of 3: 79 µs per loop
Third test case:
10 loops, best of 3: 92.5 ms per loop
10 loops, best of 3: 48.2 ms per loop
Just saw a really nice comment on reddit that might speed things up even a little more:
This is not surprising to anyone who uses numpy regularly. For loops
in python are horribly slow. Actually, einsum is pretty slow too.
Here's a version that is faster if you have lots of vectors (500
vectors in 4 dimensions is enough to make this version faster than
einsum on my machine):
def no_einsum(d, mean, Sigma):
L_inv = np.linalg.inv(numpy.linalg.cholesky(Sigma))
xdiff = d - mean
return np.sum(np.dot(xdiff, L_inv.T)**2, axis=1)
If your points are also high dimensional then computing the inverse is
slow (and generally a bad idea anyway) and you can save time by
solving the system directly (500 vectors in 250 dimensions is enough
to make this version the fastest on my machine):
def no_einsum_solve(d, mean, Sigma):
L = numpy.linalg.cholesky(Sigma)
xdiff = d - mean
return np.sum(np.linalg.solve(L, xdiff.T)**2, axis=0)
The problem is that np.vectorize vectorizes over all arguments, but you need to vectorize only over the first one. You need to use excluded keyword argument to vectorize:
np.vectorize(mahalanobis_sqdist, excluded=[1, 2])
Related
For a project, I need to generate sample from function. I would like to be able to generate those samples as quickly as possible.
I have this example (in the final version, the function lambda will be provided in the arguments) The goal is to generate ys of the n points linespaced xs between start and stop using the lambda function.
def get_ys(coefficients, num_outputs=20, start=0., stop=1.):
function = lambda x, args: args[0]*(x-args[1])**2 + args[2]*(x-args[3]) + args[4]
xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
ys = [function(x, coefficients) for x in xs]
return ys
%%time
n = 1000
xs = np.random.random((n,5))
ys = np.apply_along_axis(get_ys, 1, xs)
Wall time: 616 ms
I am trying to vectorize it, and found numpy.apply_along_axis
%%time
for i in range(1000):
xs = np.random.random(5)
ys = get_ys(xs)
Wall time: 622 ms
Unfortunately it is still pretty slow :/
I am not so familiar with function vectorization, can someone guide me a little bit on how to improve the speed of the script ?
Thanks!
Edit:
example of input/output:
xs = np.ones(5)
ys = get_ys(xs)
[1.0, 0.9501385041551247, 0.9058171745152355, 0.8670360110803323, 0.8337950138504155,0.8060941828254848, 0.7839335180055402, 0.7673130193905817, 0.7562326869806094, 0.7506925207756232, 0.7506925207756232, 0.7562326869806094, 0.7673130193905817, 0.7839335180055401, 0.8060941828254847, 0.8337950138504155, 0.8670360110803323, 0.9058171745152354, 0.9501385041551246, 1.0]
def get_ys(coefficients, num_outputs=20, start=0., stop=1.):
function = lambda x, args: args[0]*(x-args[1])**2 + args[2]*(x-args[3]) + args[4]
xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
ys = [function(x, coefficients) for x in xs]
return ys
You are trying to get around calling get_ys 1000 times, once for each row of xs.
What will it take to pass xs as a whole to get_ys? In other words, what if coefficients was (n,5) instead of (5,)?
xs is (20,), and the ys will be same (right)?
The lambda is write to expect a scalar x and (5,) args. Can it be changed to work with a (20,) x and (n,5) args?
As a first step, what does function produce if given xs? That is instead of
ys = [function(x, coefficients) for x in xs]
ys = function(xs, coefficients)
As written your code iterates (at slow Python speeds) of the n (1000) rows, and the 20 linspace. So function is called 20,000 times. That's what makes your code slow.
Lets try that change
A sample run with your function:
In [126]: np.array(get_ys(np.arange(5)))
Out[126]:
array([-2. , -1.89473684, -1.78947368, -1.68421053, -1.57894737,
-1.47368421, -1.36842105, -1.26315789, -1.15789474, -1.05263158,
-0.94736842, -0.84210526, -0.73684211, -0.63157895, -0.52631579,
-0.42105263, -0.31578947, -0.21052632, -0.10526316, 0. ])
Replace the list comprehension with just one call to function:
In [127]: def get_ys1(coefficients, num_outputs=20, start=0., stop=1.):
...: function = lambda x, args: args[0]*(x-args[1])**2 + args[2]*(x-args[3]) + args[4]
...:
...: xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
...: ys = function(xs, coefficients)
...: return ys
...:
...:
Same values:
In [128]: get_ys1(np.arange(5))
Out[128]:
array([-2. , -1.89473684, -1.78947368, -1.68421053, -1.57894737,
-1.47368421, -1.36842105, -1.26315789, -1.15789474, -1.05263158,
-0.94736842, -0.84210526, -0.73684211, -0.63157895, -0.52631579,
-0.42105263, -0.31578947, -0.21052632, -0.10526316, 0. ])
Comparative timings:
In [129]: timeit np.array(get_ys(np.arange(5)))
345 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: timeit get_ys1(np.arange(5))
89.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That's what we mean by "vectorization" - replacing python level iterations (a list comprehension) with an equivalent that makes fuller user of numpy array methods.
I suspect we can move on to work with a (n,5) coefficients, but this should be enough to get you started.
fully vectorized
By broadcasting the (n,5) against (20,) I can get a function that does not have any python loops:
def get_ys2(coefficients, num_outputs=20, start=0., stop=1.):
function = lambda x, args: args[:,0]*(x-args[:,1])**2 + args[:,2]*(x-args[:,3]) + args[:,4]
xs = np.linspace(start, stop, num=num_outputs, endpoint=True)
ys = function(xs[:,None], coefficients)
return ys.T
And with a (1,5) input:
In [156]: get_ys2(np.arange(5)[None,:])
Out[156]:
array([[-2. , -1.89473684, -1.78947368, -1.68421053, -1.57894737,
-1.47368421, -1.36842105, -1.26315789, -1.15789474, -1.05263158,
-0.94736842, -0.84210526, -0.73684211, -0.63157895, -0.52631579,
-0.42105263, -0.31578947, -0.21052632, -0.10526316, 0. ]])
With your test case:
In [146]: n = 1000
...: xs = np.random.random((n,5))
...: ys = np.apply_along_axis(get_ys, 1, xs)
In [147]: ys.shape
Out[147]: (1000, 20)
Two timings:
In [148]: timeit ys = np.apply_along_axis(get_ys, 1, xs)
...:
106 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [149]: timeit ys = np.apply_along_axis(get_ys1, 1, xs)
...:
88 ms ± 98.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and testing this
In [150]: ys2 = get_ys2(xs)
In [151]: ys2.shape
Out[151]: (1000, 20)
In [152]: np.allclose(ys, ys2)
Out[152]: True
In [153]: timeit ys2 = get_ys2(xs)
424 µs ± 484 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It matches values, and improves speed a lot.
In the new function, args can now be (n,5). And if x is (20,1), the result is (20,n), which I transpose on the return.
I would like to write a function f that takes an arbitrary function g with the signature g : R^n -> R^n -> int that "lifts" g so that it operates on (R^{nxm}, R^{kxm}) by behaving like a dot product. Meaning I want f to have the signature f : R^{nxm} -> R^{mxk} -> R^{nxk} by applying g to all pairs of rows and columns in constructing a matrix M where M_ij = g(A[i,:], B[:,j]).
Is that possible?
For example scipy.spatial.distance.cosine expects two vectors. Now I would lift cosine with f:
from scipy.spatial.distance import cosine
A = np.random.randint(0, 3, (3,4))
B = np.random.randint(0, 3, (5,4))
cosine_lifted = f(cosine)
cosine_lifted(A, B)
This would then produce the same output as
def sim(A, B):
ignored_states = np.seterr(divide='raise')
return 1 - np.divide(np.dot(A, B.T), np.outer(np.linalg.norm(A, axis=1), np.linalg.norm(B, axis=1)))
Which is the same as sklearn.metrics.pairwise.cosine_similarity plus the 1 - blah part.
But if there was not sklearn.metrics.pairwise.cosine_similarity, I would have to implement this lifted version of cosine myself (which I of course did here...). But I don't want to do that for all function that behave basically the same as the dot product in regard to how they mechanically process their argument. Therefore, I would like o have this f function.
I wrote my other answer assuming your
np.dot(A, B.T)
with a (3,4) and (5,4) inputs was the primary dot functionality that you were trying to emulate. In other words, (3,4), (4,5) => (3,5) with summation on the common size 4 dimension. My answer showed how that 2d calculation can be performed with element-wise multiplications.
For what it's worth, np.dot gets much of its speed by passing the task to BLAS (or similar) optimized libraries. These have been written in C or Fortran, and optimized by generations of numerical-analysis coders.
But your signature description may be talking about a different thing. It's a bit confusing.
g : R^n -> R^n -> int
Does this mean that g(x,y) takes two (n,) shape arrays, and returns an integer? And it can't be generalized to work with 2d arrays?
f : R^{nxm} -> R^{kxm} -> R^{nxm}
Does this mean f(A, B) takes a (n,m) shape, and a (k,m) shape, and returns a (n,m) shape? What happened to the k shape? Is that k a typo?
Alternatively you talk about doing (I believe)
M = np.zeros((N,N)) # (N,M) ok?
for i in range(N):
for j in range(N):
x = A[i,:]; y = B[:,j]
M[i,j] = g(x, y)
alternatively:
M = np.array([[g(x,y) for y in B.T] for x in A])
Assuming g is a python function that can only work with 2 1d arrays (of matching length), and cannot be generalized to 2d arrays, there isn't any mechanism in numpy to compile the above double loop. g has to be evaluated N**2 times. And assuming g is not trivial, those N*2 evaluations will dominate the total evaluation time, not the iteration mechanism.
np.vectorize normally takes a function that accepts scalar inputs, but with a signature parameter it can work with your g:
f = np.vectorize(g, signature='(n),(n)') # signature syntax may be wrong
M = f(A, B.T)
but in my testing vectorize has always been slower than an explicit iteration. With a signature it's even slower. So I kind of hesitate even mentioning it.
Are you asking for a function with a signature as simple as what follows (able to multiply two matrices), or do you want to emulate the entire np.dot api surface?
def lift(f):
def dot(A, B):
return np.array([[f(v,w) for w in zip(*B)] for v in A])
return dot
A major source of inefficiency in the above code is the allocations for all the intermediate lists. Since we know the final return value those are easy to avoid:
def lift(f):
def dot(A, B):
result = np.empty((A.shape[0], B.shape[1]))
for i,v in enumerate(A):
for j,w in enumerate(zip(*B)):
result[i,j] = f(v,w)
return result
return dot
Loops are fairly expensive in Python, but since f is operating on k elements it seems reasonable to assume that this overhead is small. You could reduce it further by compiling with pypy or cython.
matmul has been cast as a ufunc, and formally has a signature. np.dot is an earlier version and doesn't have a signature.
But given 2d arrays, np.dot is effectively a broadcasted form of multiplication followed by summation, or 'sum of products':
In [587]: A = np.arange(12).reshape(3,4)
In [588]: B = np.arange(8).reshape(2,4)
In [589]: np.dot(A, B.T)
Out[589]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
equivalent:
In [591]: (A[:,None,:]*B[None,:,:]).sum(axis=2)
Out[591]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
Some find the einsum style of signature easier to follow:
In [594]: np.einsum('ij,kj->ik', A, B)
Out[594]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
where the repeated j signals dot like summation.
===
Illustrating the iteration in my other answer:
In [601]: def g(x,y):
...: return (x*y).sum()
...:
In [602]: A.shape, B.shape
Out[602]: ((3, 4), (2, 4))
In [603]: np.array([[g(x,y) for y in B] for x in A])
Out[603]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
and the vectorize version:
In [614]: f = np.vectorize(g, signature='(n),(n)->()')
In [615]: f(A[:,None,:], B[None,:,:])
Out[615]:
array([[ 14, 38],
[ 38, 126],
[ 62, 214]])
comparative times:
In [616]: timeit f(A[:,None,:], B[None,:,:])
255 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [617]: timeit np.array([[g(x,y) for y in B] for x in A])
69.4 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [618]: timeit np.dot(A, B.T)
3.15 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and using #hans' 2nd lift:
In [623]: h = lift(g)
In [624]: h(A,B.T)
Out[624]:
array([[ 14., 38.],
[ 38., 126.],
[ 62., 214.]])
In [625]: timeit h(A,B.T)
102 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For a conceptual idea of what I mean, I have 2 data points:
x_0 = np.array([0.6, 1.4])[:, None]
x_1 = np.array([2.6, 3.4])[:, None]
And a 2x2 matrix:
y = np.array([[2, 2], [2, 2]])
If I perform x_0.T # y # x_0, I get array([[ 8.]]). Similarly, x_1.T # y # x_1 returns array([[ 72.]]).
But is there a way to perform both of these calculations in one go, without a for loop? Obviously the speed-up here is negligible, but I am working with much more data points than presented here.
With x as the column stacked version of x_0, x_1 and so on, we can use np.einsum -
np.einsum('ji,jk,ki->i',x,y,x)
With a mix of np.einsum and matrix-multiplcation -
np.einsum('ij,ji->i',x.T.dot(y),x)
As stated earlier, x was assumed to be column-stacked, like so :
x = np.column_stack((x_0, x_1))
Runtime test -
In [236]: x = np.random.randint(0,255,(3,100000))
In [237]: y = np.random.randint(0,255,(3,3))
# Proposed in #titipata's post/comments under this post
In [238]: %timeit (x.T.dot(y)*x.T).sum(1)
100 loops, best of 3: 3.45 ms per loop
# Proposed earlier in this post
In [239]: %timeit np.einsum('ji,jk,ki->i',x,y,x)
1000 loops, best of 3: 832 µs per loop
# Proposed earlier in this post
In [240]: %timeit np.einsum('ij,ji->i',x.T.dot(y),x)
100 loops, best of 3: 2.6 ms per loop
Basically, you want to do the operation (x.T).dot(A).dot(x) for all x that you have.
x_0 = np.array([0.6, 1.4])[:, None]
x_1 = np.array([2.6, 3.4])[:, None]
x = np.hstack((x_0, x_1)) # [[ 0.6 2.6], [ 1.4 3.4]]
The easy way to think about it is to do multiplication for all x_i that you have with y as
[x_i.dot(y).dot(x_i) for x_i in x.T]
>> [8.0, 72.0]
But of course this is not too efficient. However, you can do the trick where you can do dot product of x with y first and multiply back with itself and sum over column i.e. you manually do dot product. This will make the calculation much faster:
x = x.T
(x.dot(y) * x).sum(axis=1)
>> array([ 8., 72.])
Note that I transpose the matrix first because we want to multiply column of y to each row of x
The question is more focused on performance of calculation.
I have 2 vector-matrix. This means that they have a 3 depth dimension for X,Y,Z. Each element of the matrix has to make dot product with the element on the same position of the other matriz.
A simple and non efficient code will be this one:
import numpy as np
a = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
b = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
c = np.zeros((1000,1000))
numRow,numCol,numDepth = np.shape(a)
for idRow in range(numRow):
for idCol in range(numCol):
# Angle in radians
c[idRow,idCol] = math.acos(a[idRow,idCol,0]*b[idRow,idCol,0] + a[idRow,idCol,1]*b[idRow,idCol,1] + a[idRow,idCol,2]*b[idRow,idCol,2])
However, the numpy functions can speed up the calculations as the following ones, making code much faster:
# Angle in radians
d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) + np.multiply(a[:,:,1],b[:,:,1]) + np.multiply(a[:,:,2],b[:,:,2]))
However, I would like to know if there are other sintaxis that improve this one above with maybe other functions, indices,...
First code takes 4.658s while second takes 0.354s
You can do this with np.einsum, which multiplies and then sums over any axes:
np.arccos(np.einsum('ijk,ijk->ij', a, b))
The more straightforward way to do what you posted in the question is to use np.sum, where you sum along the last axis (-1):
np.arccos(np.sum(a*b, -1))
They all give the same answer but einsum is the fastest and sum is next:
In [36]: timeit np.arccos(np.einsum('ijk,ijk->ij', a, b))
10000 loops, best of 3: 20.4 µs per loop
In [37]: timeit e = np.arccos(np.sum(a*b, -1))
10000 loops, best of 3: 29.8 µs per loop
In [38]: %%timeit
....: d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) +
....: np.multiply(a[:,:,1],b[:,:,1]) +
....: np.multiply(a[:,:,2],b[:,:,2]))
....:
10000 loops, best of 3: 34.6 µs per loop
The Pythran compiler can further optimize your original expression by:
Removing temporary arrays
Using SIMD instructions
Using multithreading
As showcased by this example:
$ cat cross.py
#pythran export cross(float[][][], float[][][])
import numpy as np
def cross(a,b):
return np.arccos(np.multiply(a[:, :, 0], b[:, :, 0]) + np.multiply(a[:, :, 1],b[:, :, 1]) + np.multiply(a[:, :, 2], b[:, :, 2]))
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
10 loops, best of 3: 35.4 msec per loop
$ pythran cross.py -DUSE_BOOST_SIMD -fopenmp -march=native
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
100 loops, best of 3: 11.8 msec per loop
I have a numpy array containing 10^8 floats and want to count how many of them are >= a given threshold. Speed is crucial because the operation has to be done on large numbers of such arrays. The contestants so far are
np.sum(myarray >= thresh)
np.size(np.where(np.reshape(myarray,-1) >= thresh))
The answers at Count all values in a matrix greater than a value suggest that np.where() would be faster, but I've found inconsistent timing results. What I mean by this is for some realizations and Boolean conditions np.size(np.where(cond)) is faster than np.sum(cond), but for some it is slower.
Specifically, if a large fraction of entries fulfil the condition then np.sum(cond) is significantly faster but if a small fraction (maybe less than a tenth) do then np.size(np.where(cond)) wins.
The question breaks down into 2 parts:
Any other suggestions?
Does it make sense that the time taken by np.size(np.where(cond)) increases with the number of entries for which cond is true?
Using cython might be a decent alternative.
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange
DTYPE_f64 = np.float64
ctypedef np.float64_t DTYPE_f64_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef int count_above_cython(DTYPE_f64_t [:] arr_view, DTYPE_f64_t thresh) nogil:
cdef int length, i, total
total = 0
length = arr_view.shape[0]
for i in prange(length):
if arr_view[i] >= thresh:
total += 1
return total
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def count_above(np.ndarray arr, DTYPE_f64_t thresh):
cdef DTYPE_f64_t [:] arr_view = arr.ravel()
cdef int total
with nogil:
total = count_above_cython(arr_view, thresh)
return total
Timing of different proposed methods.
myarr = np.random.random((1000,1000))
thresh = 0.33
In [6]: %timeit count_above(myarr, thresh)
1000 loops, best of 3: 693 µs per loop
In [9]: %timeit np.count_nonzero(myarr >= thresh)
100 loops, best of 3: 4.45 ms per loop
In [11]: %timeit np.sum(myarr >= thresh)
100 loops, best of 3: 4.86 ms per loop
In [12]: %timeit np.size(np.where(np.reshape(myarr,-1) >= thresh))
10 loops, best of 3: 61.6 ms per loop
With a larger array:
In [13]: myarr = np.random.random(10**8)
In [14]: %timeit count_above(myarr, thresh)
10 loops, best of 3: 63.4 ms per loop
In [15]: %timeit np.count_nonzero(myarr >= thresh)
1 loops, best of 3: 473 ms per loop
In [16]: %timeit np.sum(myarr >= thresh)
1 loops, best of 3: 511 ms per loop
In [17]: %timeit np.size(np.where(np.reshape(myarr,-1) >= thresh))
1 loops, best of 3: 6.07 s per loop