Fastest way to count array values above a threshold in numpy - python

I have a numpy array containing 10^8 floats and want to count how many of them are >= a given threshold. Speed is crucial because the operation has to be done on large numbers of such arrays. The contestants so far are
np.sum(myarray >= thresh)
np.size(np.where(np.reshape(myarray,-1) >= thresh))
The answers at Count all values in a matrix greater than a value suggest that np.where() would be faster, but I've found inconsistent timing results. What I mean by this is for some realizations and Boolean conditions np.size(np.where(cond)) is faster than np.sum(cond), but for some it is slower.
Specifically, if a large fraction of entries fulfil the condition then np.sum(cond) is significantly faster but if a small fraction (maybe less than a tenth) do then np.size(np.where(cond)) wins.
The question breaks down into 2 parts:
Any other suggestions?
Does it make sense that the time taken by np.size(np.where(cond)) increases with the number of entries for which cond is true?

Using cython might be a decent alternative.
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange
DTYPE_f64 = np.float64
ctypedef np.float64_t DTYPE_f64_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef int count_above_cython(DTYPE_f64_t [:] arr_view, DTYPE_f64_t thresh) nogil:
cdef int length, i, total
total = 0
length = arr_view.shape[0]
for i in prange(length):
if arr_view[i] >= thresh:
total += 1
return total
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def count_above(np.ndarray arr, DTYPE_f64_t thresh):
cdef DTYPE_f64_t [:] arr_view = arr.ravel()
cdef int total
with nogil:
total = count_above_cython(arr_view, thresh)
return total
Timing of different proposed methods.
myarr = np.random.random((1000,1000))
thresh = 0.33
In [6]: %timeit count_above(myarr, thresh)
1000 loops, best of 3: 693 µs per loop
In [9]: %timeit np.count_nonzero(myarr >= thresh)
100 loops, best of 3: 4.45 ms per loop
In [11]: %timeit np.sum(myarr >= thresh)
100 loops, best of 3: 4.86 ms per loop
In [12]: %timeit np.size(np.where(np.reshape(myarr,-1) >= thresh))
10 loops, best of 3: 61.6 ms per loop
With a larger array:
In [13]: myarr = np.random.random(10**8)
In [14]: %timeit count_above(myarr, thresh)
10 loops, best of 3: 63.4 ms per loop
In [15]: %timeit np.count_nonzero(myarr >= thresh)
1 loops, best of 3: 473 ms per loop
In [16]: %timeit np.sum(myarr >= thresh)
1 loops, best of 3: 511 ms per loop
In [17]: %timeit np.size(np.where(np.reshape(myarr,-1) >= thresh))
1 loops, best of 3: 6.07 s per loop

Related

Broadcasting N-dim array to (N+1)-dim array and summing on all but 1 dim

Assume you have a numpy array with shape (a,b,c) and a boolean mask of shape (a,b,c,d).
I would like to apply the mask to the array iterating over the last axis, sum the masked array along the first three axes, and obtain a list (or an array) of length/shape (d,).
I tried to do this with a list comprehension:
Result = [np.sum(Array[Mask[:,:,:,i]], axis=(0,1,2)) for i in range(d)]
It works, but it does not look very pythonic and it is a bit slow as well.
I also tried something like
Array = Array[:,:,:,np.newaxis]
Result = np.sum(Array[Mask], axis=(0,1,2))
but of course this doesn't work, since the dimension of the Mask along the last axis, d, is larger than the dimension of the last axis of the Array, 1.
Also, consider that each axis could have dimension of order 100 or 200, so repeating the Array d times along a new last axis using np.repeat would be really memory intensive, and I would like to avoid this.
Are there any other faster and more pythonic alternatives to the list comprehension?
The most straightforward way of broadcasting a N-dimensional array to a matching (N+1)-dimensional array is to use np.broadcast_to():
import numpy as np
arr = np.random.randint(0, 100, (2, 3))
mask = np.random.randint(0, 2, (2, 3, 4), dtype=bool)
b_arr = np.broadcast_to(arr[..., None], mask.shape)
print(mask.shape == b_arr.shape)
# True
However, as #hpaulj already pointed out, you cannot use mask for slicing b_arr without loosing the dimensions.
Given that you want to just sum the elements together and summing zeroes "does not hurt", you could simply multiply element-wise your array and your mask so as to keep the correct dimension but the elements that are False in the mask are irrelevant for the subsequent sum of the corresponding array elements:
result = np.sum(b_arr * mask, axis=tuple(range(mask.ndim - 1)))
or, since * will do the broadcasting automatically:
result = np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
without the need to use np.broadcast_to() in the first place (but you still need to match the number of dimension, i.e. using arr[..., None] and not just arr).
As #PaulPanzer already pointed out, since you want to sum up over all but one dimensions, this can be further simplified using np.matmul()/#:
result2 = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(np.all(result == result2))
# True
For fancier operations involving the summation, please have a look at np.einsum().
EDIT
The catch with broadcasting is that it will create temporary arrays during the evaluation of your expressions.
With the number you seems to be dealing with, I simply cannot use the broadcasted arrays as I run into MemoryError, but time-wise the element-wise multiplication may still be a better approach than what you originally proposed.
Alternatively, if you are after speed, you could do this at a somewhat lower level with explicit looping in Cython or Numba.
Below you can find a couple of Numba-based solutions (working on ravel()-ed data):
_vector_matrix_product(): does not use any temporary array
_vector_matrix_product_mp(): some as above but using parallel execution
_vector_matrix_product_sum(): uses np.sum() and parallel execution
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def _vector_matrix_product(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in range(rows):
for j in range(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in range(rows):
for j in range(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_mp(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[i] += vect_arr[j] * mat_arr[i, j]
else:
for i in nb.prange(rows):
for j in nb.prange(cols):
result_arr[j] += vect_arr[i] * mat_arr[i, j]
#nb.jit(nopython=True, parallel=True)
def _vector_matrix_product_sum(
vect_arr,
mat_arr,
result_arr):
rows, cols = mat_arr.shape
if vect_arr.shape == result_arr.shape:
for i in nb.prange(rows):
result_arr[i] = np.sum(vect_arr * mat_arr[i, :])
else:
for j in nb.prange(cols):
result_arr[j] = np.sum(vect_arr * mat_arr[:, j])
def vector_matrix_product(
vect_arr,
mat_arr,
swap=False,
dtype=None,
mode=None):
rows, cols = mat_arr.shape
if not dtype:
dtype = (vect_arr[0] * mat_arr[0, 0]).dtype
if not swap:
result_arr = np.zeros(cols, dtype=dtype)
else:
result_arr = np.zeros(rows, dtype=dtype)
if mode == 'sum':
_vector_matrix_product_sum(vect_arr, mat_arr, result_arr)
elif mode == 'mp':
_vector_matrix_product_mp(vect_arr, mat_arr, result_arr)
else:
_vector_matrix_product(vect_arr, mat_arr, result_arr)
return result_arr
np.random.seed(0)
arr = np.random.randint(0, 100, (2, 3, 4))
mask = np.random.randint(0, 2, (2, 3, 4, 5), dtype=bool)
target = arr.ravel() # mask.reshape(-1, mask.shape[-1])
print(target)
# [820 723 861 486 408]
result1 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
print(result1)
# [820 723 861 486 408]
result2 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
print(result2)
# [820 723 861 486 408]
result3 = vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
print(result3)
# [820 723 861 486 408]
with improved timing over any list-comprehension-based solutions:
arr = np.random.randint(0, 100, (256, 256, 256))
mask = np.random.randint(0, 2, (256, 256, 256, 128), dtype=bool)
%timeit np.sum(arr[..., None] * mask, axis=tuple(range(mask.ndim - 1)))
# MemoryError
%timeit arr.ravel() # mask.reshape(-1, mask.shape[-1])
# MemoryError
%timeit np.array([np.sum(arr * mask[..., i], axis=tuple(range(mask.ndim - 1))) for i in range(mask.shape[-1])])
# 24.1 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.array([np.sum(arr[mask[..., i]]) for i in range(mask.shape[-1])])
# 46 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]))
# 408 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='mp')
# 1.63 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vector_matrix_product(arr.ravel(), mask.reshape(-1, mask.shape[-1]), mode='sum')
# 7.17 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As expected, the JIT accelerated version is the fastest, and enforcing parallelism on the code does not result in improved speed-ups.
Note also that the approach with element-wise multiplication is faster than slicing (approx. twice as fast for these benchmarks).
EDIT 2
Following #max9111 suggestion, looping first by rows and then by cols cause the most time-consuming loop to run on contiguous data, resulting in significant speed-up.
Without this trick, _vector_matrix_product_sum() and _vector_matrix_product_mp() would run at essentially the same speed.
How about
Array.reshape(-1)#Mask.reshape(-1,d)
Since you are summing over the first three axes anyway you may as well merge them after which it is easy to see that the operation can be written as matrix-vector product
Example:
a,b,c,d = 4,5,6,7
Mask = np.random.randint(0,2,(a,b,c,d),bool)
Array = np.random.randint(0,10,(a,b,c))
[np.sum(Array[Mask[:,:,:,i]]) for i in range(d)]
# [310, 237, 253, 261, 229, 268, 184]
Array.reshape(-1)#Mask.reshape(-1,d)
# array([310, 237, 253, 261, 229, 268, 184])

Python: Sum of all permutations of outer products of numpy arrays of arrays

I have a numpy array of arrays Ai and I want each outer product (np.outer(Ai[i],Ai[j])) to be summed with a scaling multiplier to produce H. I can step through and make them then tensordot them with a matrix of scaling factors. I think things could be significantly simplified, but haven't figured out a general/efficient way to do this for ND. How can Arr2D and H more easily be produced? Note: Arr2D could be 64 2D arrays rather than 8x8 2D arrays.
Ai = np.random.random((8,101))
Arr2D = np.zeros((Ai.shape[0], Ai.shape[0], Ai.shape[1], Ai.shape[1]))
Arr2D[:,:,:,:] = np.asarray([ np.outer(Ai[i], Ai[j]) for i in range(Ai.shape[0])
for j in range(Ai.shape[0]) ]).reshape(Ai.shape[0],Ai.shape[0],Ai[0].size,Ai[0].size)
arr = np.random.random( (Ai.shape[0] * Ai.shape[0]) )
arr2D = arr.reshape(Ai.shape[0], Ai.shape[0])
H = np.tensordot(Arr2D, arr2D, axes=([0,1],[0,1]))
Good setup to leverage einsum!
np.einsum('ij,kl,ik->jl',Ai,Ai,arr2D,optimize=True)
Timings -
In [71]: # Setup inputs
...: Ai = np.random.random((8,101))
...: arr = np.random.random( (Ai.shape[0] * Ai.shape[0]) )
...: arr2D = arr.reshape(Ai.shape[0], Ai.shape[0])
In [74]: %%timeit # Original soln
...: Arr2D = np.zeros((Ai.shape[0], Ai.shape[0], Ai.shape[1], Ai.shape[1]))
...: Arr2D[:,:,:,:] = np.asarray([ np.outer(Ai[i], Ai[j]) for i in range(Ai.shape[0])
...: for j in range(Ai.shape[0]) ]).reshape(Ai.shape[0],Ai.shape[0],Ai[0].size,Ai[0].size)
...: H = np.tensordot(Arr2D, arr2D, axes=([0,1],[0,1]))
100 loops, best of 3: 4.5 ms per loop
In [75]: %timeit np.einsum('ij,kl,ik->jl',Ai,Ai,arr2D,optimize=True)
10000 loops, best of 3: 146 µs per loop
30x+ speedup there!

Performance of 2 vector-matrix dot product

The question is more focused on performance of calculation.
I have 2 vector-matrix. This means that they have a 3 depth dimension for X,Y,Z. Each element of the matrix has to make dot product with the element on the same position of the other matriz.
A simple and non efficient code will be this one:
import numpy as np
a = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
b = np.random.uniform(low=-1.0, high=1.0, size=(1000,1000,3))
c = np.zeros((1000,1000))
numRow,numCol,numDepth = np.shape(a)
for idRow in range(numRow):
for idCol in range(numCol):
# Angle in radians
c[idRow,idCol] = math.acos(a[idRow,idCol,0]*b[idRow,idCol,0] + a[idRow,idCol,1]*b[idRow,idCol,1] + a[idRow,idCol,2]*b[idRow,idCol,2])
However, the numpy functions can speed up the calculations as the following ones, making code much faster:
# Angle in radians
d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) + np.multiply(a[:,:,1],b[:,:,1]) + np.multiply(a[:,:,2],b[:,:,2]))
However, I would like to know if there are other sintaxis that improve this one above with maybe other functions, indices,...
First code takes 4.658s while second takes 0.354s
You can do this with np.einsum, which multiplies and then sums over any axes:
np.arccos(np.einsum('ijk,ijk->ij', a, b))
The more straightforward way to do what you posted in the question is to use np.sum, where you sum along the last axis (-1):
np.arccos(np.sum(a*b, -1))
They all give the same answer but einsum is the fastest and sum is next:
In [36]: timeit np.arccos(np.einsum('ijk,ijk->ij', a, b))
10000 loops, best of 3: 20.4 µs per loop
In [37]: timeit e = np.arccos(np.sum(a*b, -1))
10000 loops, best of 3: 29.8 µs per loop
In [38]: %%timeit
....: d = np.arccos(np.multiply(a[:,:,0],b[:,:,0]) +
....: np.multiply(a[:,:,1],b[:,:,1]) +
....: np.multiply(a[:,:,2],b[:,:,2]))
....:
10000 loops, best of 3: 34.6 µs per loop
The Pythran compiler can further optimize your original expression by:
Removing temporary arrays
Using SIMD instructions
Using multithreading
As showcased by this example:
$ cat cross.py
#pythran export cross(float[][][], float[][][])
import numpy as np
def cross(a,b):
return np.arccos(np.multiply(a[:, :, 0], b[:, :, 0]) + np.multiply(a[:, :, 1],b[:, :, 1]) + np.multiply(a[:, :, 2], b[:, :, 2]))
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
10 loops, best of 3: 35.4 msec per loop
$ pythran cross.py -DUSE_BOOST_SIMD -fopenmp -march=native
$ python -m timeit -s 'import numpy as np; a = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); b = np.random.uniform(low=-1.0, high=1.0, size=(1000, 1000, 3)); c = np.zeros((1000, 1000)); from cross import cross' 'cross(a,b)'
100 loops, best of 3: 11.8 msec per loop

Apply a function to each row of a ndarray

I have this function to calculate squared Mahalanobis distance of vector x to mean:
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions' mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
I have an numpy array that has a shape of (25, 4). So, I want to apply that function to all 25 rows of my array without a for loop. So, basically, how can I write the vectorized form of this loop:
for r in d1:
mahalanobis_sqdist(r[0:4], mean1, Sig1)
where mean1 and Sig1 are :
>>> mean1
array([ 5.028, 3.48 , 1.46 , 0.248])
>>> Sig1 = np.cov(d1[0:25, 0:4].T)
>>> Sig1
array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
I have tried the following but it didn't work:
>>> vecdist = np.vectorize(mahalanobis_sqdist)
>>> vecdist(d1, mean1, Sig1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1862, in __call__
theout = self.thefunc(*newargs)
File "<stdin>", line 6, in mahalanobis_sqdist
File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range
To apply a function to each row of an array, you could use:
np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
In this case, however, there is a better way. You don't have to apply a function to each row. Instead, you can apply NumPy operations to the entire d1 array to calculate the same result. np.einsum can replace the for-loop and the two calls to np.dot:
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
Here are some benchmarks:
import numpy as np
np.random.seed(1)
def mahalanobis_sqdist(x, mean, Sigma):
'''
Calculates squared Mahalanobis Distance of vector x
to distibutions mean
'''
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
sqmdist = np.dot(np.dot(xdiff, Sigma_inv), xdiff)
return sqmdist
def mahalanobis_sqdist2(d, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = d - mean
return np.einsum('ij,im,mj->i', xdiff, xdiff, Sigma_inv)
def using_loop(d1, mean, Sigma):
expected = []
for r in d1:
expected.append(mahalanobis_sqdist(r[0:4], mean1, Sig1))
return np.array(expected)
d1 = np.random.random((25,4))
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
Sig1 = np.cov(d1[0:25, 0:4].T)
expected = using_loop(d1, mean1, Sig1)
result = np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
result2 = mahalanobis_sqdist2(d1, mean1, Sig1)
assert np.allclose(expected, result)
assert np.allclose(expected, result2)
In [92]: %timeit mahalanobis_sqdist2(d1, mean1, Sig1)
10000 loops, best of 3: 31.1 µs per loop
In [94]: %timeit using_loop(d1, mean1, Sig1)
1000 loops, best of 3: 569 µs per loop
In [91]: %timeit np.apply_along_axis(mahalanobis_sqdist, 1, d1, mean1, Sig1)
1000 loops, best of 3: 806 µs per loop
Thus mahalanobis_sqdist2 is about 18x faster than a for-loop, and 26x faster than using np.apply_along_axis.
Note that np.apply_along_axis, np.vectorize, np.frompyfunc are Python utility functions. Under the hood they use for- or while-loops. There is no real "vectorization" going on here. They can provide syntactic assistance, but don't expect them to make your code perform any better than a for-loop you write yourself.
The answer by #unutbu works very nicely for applying any function to the rows of an array.
In this particular case, there are some mathematical symmetries you can use that will speed things up considerably if you are working with large arrays.
Here is a modified version of your function:
def mahalanobis_sqdist3(x, mean, Sigma):
Sigma_inv = np.linalg.inv(Sigma)
xdiff = x - mean
return (xdiff.dot(Sigma_inv)*xdiff).sum(axis=-1)
If you end up using any sort of large Sigma, I would recommend that you cache Sigma_inv and pass that in as an argument to your function instead.
Since it is 4x4 in this example, this doesn't matter.
I'll show how to deal with large Sigma anyway for anyone else who comes across this.
If you aren't going to be using the same Sigma repeatedly, you won't be able to cache it, so, instead of inverting the matrix, you could use a different method to solve the linear system.
Here I'll use the LU decomposition built in to SciPy.
This only improves the time if the number of columns of x is large relative to its number of rows.
Here is a function that shows that approach:
from scipy.linalg import lu_factor, lu_solve
def mahalanobis_sqdist4(x, mean, Sigma):
xdiff = x - mean
Sigma_inv = lu_factor(Sigma)
return (xdiff.T*lu_solve(Sigma_inv, xdiff.T)).sum(axis=0)
Here are some timings.
I'll include the version with einsum as mentioned in the other answer.
import numpy as np
Sig1 = np.array([[ 0.16043333, 0.11808333, 0.02408333, 0.01943333],
[ 0.11808333, 0.13583333, 0.00625 , 0.02225 ],
[ 0.02408333, 0.00625 , 0.03916667, 0.00658333],
[ 0.01943333, 0.02225 , 0.00658333, 0.01093333]])
mean1 = np.array([ 5.028, 3.48 , 1.46 , 0.248])
x = np.random.rand(25, 4)
%timeit np.apply_along_axis(mahalanobis_sqdist, 1, x, mean1, Sig1)
%timeit mahalanobis_sqdist2(x, mean1, Sig1)
%timeit mahalanobis_sqdist3(x, mean1, Sig1)
%timeit mahalanobis_sqdist4(x, mean1, Sig1)
giving:
1000 loops, best of 3: 973 µs per loop
10000 loops, best of 3: 36.2 µs per loop
10000 loops, best of 3: 40.8 µs per loop
10000 loops, best of 3: 83.2 µs per loop
However, changing the sizes of the arrays involved changes the timing results.
For example, letting x = np.random.rand(2500, 4), the timings are:
10 loops, best of 3: 95 ms per loop
1000 loops, best of 3: 355 µs per loop
10000 loops, best of 3: 131 µs per loop
1000 loops, best of 3: 337 µs per loop
And letting x = np.random.rand(1000, 1000), Sigma1 = np.random.rand(1000, 1000), and mean1 = np.random.rand(1000), the timings are:
1 loops, best of 3: 1min 24s per loop
1 loops, best of 3: 2.39 s per loop
10 loops, best of 3: 155 ms per loop
10 loops, best of 3: 99.9 ms per loop
Edit: I noticed that one of the other answers used the Cholesky decomposition.
Given that Sigma is symmetric and positive definite, we can actually do better than my above results.
There are some good routines from BLAS and LAPACK available through SciPy that can work with symmetric positive-definite matrices.
Here are two faster versions.
from scipy.linalg.fblas import dsymm
def mahalanobis_sqdist5(x, mean, Sigma_inv):
xdiff = x - mean
Sigma_inv = la.inv(Sigma)
return np.einsum('...i,...i->...',dsymm(1., Sigma_inv, xdiff.T).T, xdiff)
from scipy.linalg.flapack import dposv
def mahalanobis_sqdist6(x, mean, Sigma):
xdiff = x - mean
return np.einsum('...i,...i->...', xdiff, dposv(Sigma, xdiff.T)[1].T)
The first one still inverts Sigma.
If you pre-compute the inverse and reuse it, it is much faster (the 1000x1000 case takes 35.6ms on my machine with the pre-computed inverse).
I also used einsum to take the product then sum along the last axis.
This ended up being marginally faster than doing something like (A * B).sum(axis=-1).
These two functions give the following timings:
First test case:
10000 loops, best of 3: 55.3 µs per loop
100000 loops, best of 3: 14.2 µs per loop
Second test case:
10000 loops, best of 3: 121 µs per loop
10000 loops, best of 3: 79 µs per loop
Third test case:
10 loops, best of 3: 92.5 ms per loop
10 loops, best of 3: 48.2 ms per loop
Just saw a really nice comment on reddit that might speed things up even a little more:
This is not surprising to anyone who uses numpy regularly. For loops
in python are horribly slow. Actually, einsum is pretty slow too.
Here's a version that is faster if you have lots of vectors (500
vectors in 4 dimensions is enough to make this version faster than
einsum on my machine):
def no_einsum(d, mean, Sigma):
L_inv = np.linalg.inv(numpy.linalg.cholesky(Sigma))
xdiff = d - mean
return np.sum(np.dot(xdiff, L_inv.T)**2, axis=1)
If your points are also high dimensional then computing the inverse is
slow (and generally a bad idea anyway) and you can save time by
solving the system directly (500 vectors in 250 dimensions is enough
to make this version the fastest on my machine):
def no_einsum_solve(d, mean, Sigma):
L = numpy.linalg.cholesky(Sigma)
xdiff = d - mean
return np.sum(np.linalg.solve(L, xdiff.T)**2, axis=0)
The problem is that np.vectorize vectorizes over all arguments, but you need to vectorize only over the first one. You need to use excluded keyword argument to vectorize:
np.vectorize(mahalanobis_sqdist, excluded=[1, 2])

Why is numpy's einsum faster than numpy's built in functions?

Lets start with three arrays of dtype=np.double. Timings are performed on a intel CPU using numpy 1.7.1 compiled with icc and linked to intel's mkl. A AMD cpu with numpy 1.6.1 compiled with gcc without mkl was also used to verify the timings. Please note the timings scale nearly linearly with system size and are not due to the small overhead incurred in the numpy functions if statements these difference will show up in microseconds not milliseconds:
arr_1D=np.arange(500,dtype=np.double)
large_arr_1D=np.arange(100000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
First lets look at the np.sum function:
np.all(np.sum(arr_3D)==np.einsum('ijk->',arr_3D))
True
%timeit np.sum(arr_3D)
10 loops, best of 3: 142 ms per loop
%timeit np.einsum('ijk->', arr_3D)
10 loops, best of 3: 70.2 ms per loop
Powers:
np.allclose(arr_3D*arr_3D*arr_3D,np.einsum('ijk,ijk,ijk->ijk',arr_3D,arr_3D,arr_3D))
True
%timeit arr_3D*arr_3D*arr_3D
1 loops, best of 3: 1.32 s per loop
%timeit np.einsum('ijk,ijk,ijk->ijk', arr_3D, arr_3D, arr_3D)
1 loops, best of 3: 694 ms per loop
Outer product:
np.all(np.outer(arr_1D,arr_1D)==np.einsum('i,k->ik',arr_1D,arr_1D))
True
%timeit np.outer(arr_1D, arr_1D)
1000 loops, best of 3: 411 us per loop
%timeit np.einsum('i,k->ik', arr_1D, arr_1D)
1000 loops, best of 3: 245 us per loop
All of the above are twice as fast with np.einsum. These should be apples to apples comparisons as everything is specifically of dtype=np.double. I would expect the speed up in an operation like this:
np.allclose(np.sum(arr_2D*arr_3D),np.einsum('ij,oij->',arr_2D,arr_3D))
True
%timeit np.sum(arr_2D*arr_3D)
1 loops, best of 3: 813 ms per loop
%timeit np.einsum('ij,oij->', arr_2D, arr_3D)
10 loops, best of 3: 85.1 ms per loop
Einsum seems to be at least twice as fast for np.inner, np.outer, np.kron, and np.sum regardless of axes selection. The primary exception being np.dot as it calls DGEMM from a BLAS library. So why is np.einsum faster that other numpy functions that are equivalent?
The DGEMM case for completeness:
np.allclose(np.dot(arr_2D,arr_2D),np.einsum('ij,jk',arr_2D,arr_2D))
True
%timeit np.einsum('ij,jk',arr_2D,arr_2D)
10 loops, best of 3: 56.1 ms per loop
%timeit np.dot(arr_2D,arr_2D)
100 loops, best of 3: 5.17 ms per loop
The leading theory is from #sebergs comment that np.einsum can make use of SSE2, but numpy's ufuncs will not until numpy 1.8 (see the change log). I believe this is the correct answer, but have not been able to confirm it. Some limited proof can be found by changing the dtype of input array and observing speed difference and the fact that not everyone observes the same trends in timings.
First off, there's been a lot of past discussion about this on the numpy list. For example, see:
http://numpy-discussion.10968.n7.nabble.com/poor-performance-of-sum-with-sub-machine-word-integer-types-td41.html
http://numpy-discussion.10968.n7.nabble.com/odd-performance-of-sum-td3332.html
Some of boils down to the fact that einsum is new, and is presumably trying to be better about cache alignment and other memory access issues, while many of the older numpy functions focus on a easily portable implementation over a heavily optimized one. I'm just speculating, there, though.
However, some of what you're doing isn't quite an "apples-to-apples" comparison.
In addition to what #Jamie already said, sum uses a more appropriate accumulator for arrays
For example, sum is more careful about checking the type of the input and using an appropriate accumulator. For example, consider the following:
In [1]: x = 255 * np.ones(100, dtype=np.uint8)
In [2]: x
Out[2]:
array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255], dtype=uint8)
Note that the sum is correct:
In [3]: x.sum()
Out[3]: 25500
While einsum will give the wrong result:
In [4]: np.einsum('i->', x)
Out[4]: 156
But if we use a less limited dtype, we'll still get the result you'd expect:
In [5]: y = 255 * np.ones(100)
In [6]: np.einsum('i->', y)
Out[6]: 25500.0
Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.
To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.
This is the test run both before and after the 1.8 upgrade:
import numpy as np
import timeit
arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
Numpy 1.7.1:
Summation test:
0.172988510132
0.0934836149216
----------------------
Power test:
1.93524689674
0.839519000053
----------------------
Outer test:
0.130380821228
0.121401786804
----------------------
Einsum test:
0.979052495956
0.126066613197
Numpy 1.8:
Summation test:
0.116551589966
0.0920487880707
----------------------
Power test:
1.23683619499
0.815982818604
----------------------
Outer test:
0.131808176041
0.127472200394
----------------------
Einsum test:
0.781750011444
0.129271841049
I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.
I think these timings explain what's going on:
a = np.arange(1000, dtype=np.double)
%timeit np.einsum('i->', a)
100000 loops, best of 3: 3.32 us per loop
%timeit np.sum(a)
100000 loops, best of 3: 6.84 us per loop
a = np.arange(10000, dtype=np.double)
%timeit np.einsum('i->', a)
100000 loops, best of 3: 12.6 us per loop
%timeit np.sum(a)
100000 loops, best of 3: 16.5 us per loop
a = np.arange(100000, dtype=np.double)
%timeit np.einsum('i->', a)
10000 loops, best of 3: 103 us per loop
%timeit np.sum(a)
10000 loops, best of 3: 109 us per loop
So you basically have an almost constant 3us overhead when calling np.sum over np.einsum, so they basically run as fast, but one takes a little longer to get going. Why could that be? My money is on the following:
a = np.arange(1000, dtype=object)
%timeit np.einsum('i->', a)
Traceback (most recent call last):
...
TypeError: invalid data type for einsum
%timeit np.sum(a)
10000 loops, best of 3: 20.3 us per loop
Not sure what is going on exactly, but it seems that np.einsum is skipping some checks to extract type specific functions to do the multiplications and additions, and is going directly with * and + for standard C types only.
The multidimensional cases are not different:
n = 10; a = np.arange(n**3, dtype=np.double).reshape(n, n, n)
%timeit np.einsum('ijk->', a)
100000 loops, best of 3: 3.79 us per loop
%timeit np.sum(a)
100000 loops, best of 3: 7.33 us per loop
n = 100; a = np.arange(n**3, dtype=np.double).reshape(n, n, n)
%timeit np.einsum('ijk->', a)
1000 loops, best of 3: 1.2 ms per loop
%timeit np.sum(a)
1000 loops, best of 3: 1.23 ms per loop
So a mostly constant overhead, not a faster running once they get down to it.
An update for numpy 1.21.2: Numpy's native functions are faster than einsums in almost all cases. Only einsum's outer variant and the sum23 test faster than the non-einsum versions.
If you can use numpy's native functions, do that.
(Images created with perfplot, a project of mine.)
Code to reproduce the plots:
import numpy
import perfplot
def setup1(n):
return numpy.arange(n, dtype=numpy.double)
def setup2(n):
return numpy.arange(n ** 2, dtype=numpy.double).reshape(n, n)
def setup3(n):
return numpy.arange(n ** 3, dtype=numpy.double).reshape(n, n, n)
def setup23(n):
return (
numpy.arange(n ** 2, dtype=numpy.double).reshape(n, n),
numpy.arange(n ** 3, dtype=numpy.double).reshape(n, n, n),
)
def numpy_sum(a):
return numpy.sum(a)
def einsum_sum(a):
return numpy.einsum("ijk->", a)
perfplot.save(
"sum.png",
setup=setup3,
kernels=[numpy_sum, einsum_sum],
n_range=[2 ** k for k in range(10)],
)
def numpy_power(a):
return a * a * a
def einsum_power(a):
return numpy.einsum("ijk,ijk,ijk->ijk", a, a, a)
perfplot.save(
"power.png",
setup=setup3,
kernels=[numpy_power, einsum_power],
n_range=[2 ** k for k in range(9)],
)
def numpy_outer(a):
return numpy.outer(a, a)
def einsum_outer(a):
return numpy.einsum("i,k->ik", a, a)
perfplot.save(
"outer.png",
setup=setup1,
kernels=[numpy_outer, einsum_outer],
n_range=[2 ** k for k in range(13)],
)
def dgemm_numpy(a):
return numpy.dot(a, a)
def dgemm_einsum(a):
return numpy.einsum("ij,jk", a, a)
def dgemm_einsum_optimize(a):
return numpy.einsum("ij,jk", a, a, optimize=True)
perfplot.save(
"dgemm.png",
setup=setup2,
kernels=[dgemm_numpy, dgemm_einsum],
n_range=[2 ** k for k in range(13)],
)
def dot_numpy(a):
return numpy.dot(a, a)
def dot_einsum(a):
return numpy.einsum("i,i->", a, a)
perfplot.save(
"dot.png",
setup=setup1,
kernels=[dot_numpy, dot_einsum],
n_range=[2 ** k for k in range(20)],
)
def sum23_numpy(data):
a, b = data
return numpy.sum(a * b)
def sum23_einsum(data):
a, b = data
return numpy.einsum("ij,oij->", a, b)
perfplot.save(
"sum23.png",
setup=setup23,
kernels=[sum23_numpy, sum23_einsum],
n_range=[2 ** k for k in range(10)],
)

Categories