Optimizing an operation with numpy sparsey array - python

I am struggling with a slow numpy operation, using python 3.
I have the following operation:
np.sum(np.log(X.T * b + a).T, 1)
where
(30000,1000) = X.shape
(1000,1) = b.shape
(1000,1) = a.shape
My problem is that this operation is pretty slow (around 1.5 seconds), and it is inside a loop, so it is repeated around 100 times, that makes the running time of my code very long.
I am wondering if there is a faster implementation of this function.
Maybe useful fact: X is extremely sparse (only 0.08% of the entries are nonzero), but is a NumPy array.

We can optimize the logarithm operation which seems to be the bottleneck and that being one of the transcendental functions could be sped up with numexpr module and then sum-reduce with NumPy because NumPy does it much better, thus giving us a hybrid one, like so -
import numexpr as ne
def numexpr_app(X, a, b):
XT = X.T
return ne.evaluate('log(XT * b + a)').sum(0)
Looking closely at the broadcasting operations : XT * b + a, we see that there are two stages of broadcasting, on which we can optimize further. The intention is to see if that could be reduced to one stage and that seems possible here with some division. This gives us a slightly modified version, shown below -
def numexpr_app2(X, a, b):
ab = (a/b)
XT = X.T
return np.log(b).sum() + ne.evaluate('log(ab + XT)').sum(0)
Runtime test and verification
Original approach -
def numpy_app(X, a, b):
return np.sum(np.log(X.T * b + a).T, 1)
Timings -
In [111]: # Setup inputs
...: density = 0.08/100 # 0.08 % sparse
...: m,n = 30000, 1000
...: X = scipy.sparse.rand(m,n,density=density,format="csr").toarray()
...: a = np.random.rand(n,1)
...: b = np.random.rand(n,1)
...:
In [112]: out0 = numpy_app(X, a, b)
...: out1 = numexpr_app(X, a, b)
...: out2 = numexpr_app2(X, a, b)
...: print np.allclose(out0, out1)
...: print np.allclose(out0, out2)
...:
True
True
In [114]: %timeit numpy_app(X, a, b)
1 loop, best of 3: 691 ms per loop
In [115]: %timeit numexpr_app(X, a, b)
10 loops, best of 3: 153 ms per loop
In [116]: %timeit numexpr_app2(X, a, b)
10 loops, best of 3: 149 ms per loop
Just to prove the observation stated at the start that log part is the bottleneck with the original NumPy approach, here's the timing on it -
In [44]: %timeit np.log(X.T * b + a)
1 loop, best of 3: 682 ms per loop
On which the improvement was significant -
In [120]: XT = X.T
In [121]: %timeit ne.evaluate('log(XT * b + a)')
10 loops, best of 3: 142 ms per loop

It's a bit unclear why you would do np.sum(your_array.T, axis=1) instead of np.sum(your_array, axis=0).
You can use a scipy sparse matrix: (use column compressed format for X, so that X.T is row compressed, since you multiply by b which has the shape of one row of X.T)
X_sparse = scipy.sparse.csc_matrx(X)
and replace X.T * b by:
X_sparse.T.multiply(b)
However if a is not sparse it will not help you as much as it could.
These are the speed ups I obtain for this operation:
In [16]: %timeit X_sparse.T.multiply(b)
The slowest run took 10.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 374 µs per loop
In [17]: %timeit X.T * b
10 loops, best of 3: 44.5 ms per loop
with:
import numpy as np
from scipy import sparse
X = np.random.randn(30000, 1000)
a = np.random.randn(1000, 1)
b = np.random.randn(1000, 1)
X[X < 3] = 0
print(np.sum(X != 0))
X_sparse = sparse.csc_matrix(X)

Related

Time- and space- efficient array multiplication in NumPy [duplicate]

This question already has an answer here:
Python: Multiplying a list of vectors by a list of matrices as a single matrix operation
(1 answer)
Closed 5 years ago.
Given NumPy arrays R and S with shapes (m, d) and (m, n, d) respectively, I would like to compute an array P of shape (m, n) whose (i, j)-th entry is np.dot(R[i, :] , S[i, j, :]).
Doing a double for-loop would not need any extra space (apart from the m * n space for P), but would not be time-efficient.
Using broadcasting, I could do P = np.sum(R[:, np.newaxis, :] * S, axis=2), but this would cost extra m * n * d space.
What is the most time- and space-efficient way to do this?
einsum is another of the usual suspects
m, n, d = 100, 100, 100
>>> R = np.random.random((m, d))
>>> S = np.random.random((m, n, d))
>>> np.einsum('md,mnd->mn', R, S)
>>> np.allclose(np.einsum('md,mnd->mn', R, S), (R[:,None,:]*S).sum(axis=-1))
True
>>> from timeit import repeat
>>> repeat('np.einsum("md,mnd->mn", R, S)', globals=globals(), number=1000)
[0.7004671019967645, 0.6925274690147489, 0.6952172230230644]
>>> repeat('(R[:,None,:]*S).sum(axis=-1)', globals=globals(), number=1000)
[3.0512512560235336, 3.0466731210472062, 3.044075728044845]
Some indirect evidence that einsum isn't too wasteful with the RAM:
>>> m, n, d = 1000, 1001, 1002
>>> # Too much for broadcasting:
>>> np.zeros((m, n, d))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>> R = np.random.random((m, d))
>>> S = np.random.random((n, d))
>>> np.einsum('md,nd->mn', R, S).shape
(1000, 1001)
In these cases, it is always good to consider numba, which can provide the best of both worlds:
import numpy as np
from numba import jit
def vanilla_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
def broadcast_mult(R, S):
return np.sum(R[:, np.newaxis, :] * S, axis=2)
#jit(nopython=True)
def jit_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
Note, vanilla_mult and jit_mult have the exact-same implementation, however, the latter is just-in-time compiled. Let's test this out:
In [1]: import test # the above is in test.py
In [2]: import numpy as np
In [3]: m, n, d = 100, 100, 100
In [4]: R = np.random.rand(m, d)
In [5]: S = np.random.rand(m, n, d)
OK...
In [6]: %timeit test.broadcast_mult(R, S)
100 loops, best of 3: 1.95 ms per loop
In [7]: %timeit test.vanilla_mult(R, S)
100 loops, best of 3: 11.7 ms per loop
Ouch, yeah, an almost 5-fold increase in compuation time compared to broadcasting. However...
In [8]: %timeit test.jit_mult(R, S)
The slowest run took 760.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 870 µs per loop
Nice! We can cut our runtime in half by simply JITing! How does this scale?
In [12]: m, n, d = 1000, 1000, 100
In [13]: R = np.random.rand(m, d)
In [14]: S = np.random.rand(m, n, d)
In [15]: %timeit test.vanilla_mult(R, S)
1 loop, best of 3: 1.22 s per loop
In [16]: %timeit test.broadcast_mult(R, S)
1 loop, best of 3: 666 ms per loop
In [17]: %timeit test.jit_mult(R, S)
The slowest run took 7.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 83.6 ms per loop
Scales very well, since broadcasting is starting to be held back by having to create large, intermediate arrays, it is only half the time compared to the vanilla approach, but it takes almost 7-times as much as the JIT-approach!
Edit to Add
And finally, we compare the np.einsum approach:
In [19]: %timeit np.einsum('md,mnd->mn', R, S)
10 loops, best of 3: 59.5 ms per loop
And it is clearly the winner in speed. I am not familiar enough with it to comment on the space requirements, though.

Speed up angle calculation for each x,y point in a matrix

I have a 3-d Numpy array flow as follows:
flow = np.random.uniform(low=-1.0, high=1.0, size=(720,1280,2))
# Suppose flow[0] are x-coordinates. flow[1] are y-coordinates.
Need to calculate the angle for each x,y point. Here is how I have implemented it:
def calcAngle(a):
assert(len(a) == 2)
(x, y) = a
# angle_deg = 0
angle_deg = np.angle(x + y * 1j, deg=True)
return angle_deg
fangle = np.apply_along_axis(calcAngle, axis=2, arr=flow)
# The above statement takes 14.0389318466 to execute
The calculation of angle at each point takes 14.0389318466 seconds to execute on my Macbook Pro.
Is there a way I could speed this up, probably by using some matrix operation, rather than processing each pixel one at a time.
You can use numpy.arctan2() to get the angle in radians, and then convert to degrees with numpy.rad2deg():
fangle = np.rad2deg(np.arctan2(flow[:,:,1], flow[:,:,0]))
On my computer, this is a little faster than Divakar's version:
In [17]: %timeit np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
10 loops, best of 3: 44.5 ms per loop
In [18]: %timeit np.rad2deg(np.arctan2(flow[:,:,1], flow[:,:,0]))
10 loops, best of 3: 35.4 ms per loop
A more efficient way to use np.angle() is to create a complex view of flow. If flow is an array of type np.float64 with shape (m, n, 2), then flow.view(np.complex128)[:,:,0] will be an array of type np.complex128 with shape (m, n):
fangle = np.angle(flow.view(np.complex128)[:,:,0], deg=True)
This appears to be a smidge faster than using arctan2 followed by rad2deg (but the difference is not far above the measurement noise of timeit):
In [47]: %timeit np.angle(flow.view(np.complex128)[:,:,0], deg=True)
10 loops, best of 3: 35 ms per loop
Note that this might not work if flow was creating as the tranpose of some other array, or as a slice of another array using steps bigger than 1.
numpy.angle supports vectorized operation. So, just feed in the first and second column slices to it for the final output, like so -
fangle = np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
Verification -
In [9]: flow = np.random.uniform(low=-1.0, high=1.0, size=(720,1280,2))
In [17]: out1 = np.apply_along_axis(calcAngle, axis=2, arr=flow)
In [18]: out2 = np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
In [19]: np.allclose(out1, out2)
Out[19]: True
Runtime test -
In [10]: %timeit np.apply_along_axis(calcAngle, axis=2, arr=flow)
1 loop, best of 3: 8.27 s per loop
In [11]: %timeit np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
10 loops, best of 3: 47.6 ms per loop
In [12]: 8270/47.6
Out[12]: 173.73949579831933
173x+ speedup!

Exterior product in NumPy : Vectorizing six nested loops

In a research paper, the author introduces an exterior product between two (3*3) matrices A and B, resulting in C:
C(i, j) = sum(k=1..3, l=1..3, m=1..3, n=1..3) eps(i,k,l)*eps(j,m,n)*A(k,m)*B(l,n)
where eps(a, b, c) is the Levi-Civita symbol.
I am wondering how to vectorize such a mathematical operator in Numpy instead of implementing 6 nested loops (for i, j, k, l, m, n) naively.
It looks like a purely sum-reduction based problem without the requirement of keeping any axis aligned between the inputs. So, I would suggest matrix-multiplication based solution for tensors using np.tensordot.
Thus, one solution could be implemented in three steps -
# Matrix-multiplication between first eps and A.
# Thus losing second axis from eps and first from A : k
parte1 = np.tensordot(eps,A,axes=((1),(0)))
# Matrix-multiplication between second eps and B.
# Thus losing third axis from eps and second from B : n
parte2 = np.tensordot(eps,B,axes=((2),(1)))
# Finally, we are left with two products : ilm & jml.
# We need to lose lm and ml from these inputs respectively to get ij.
# So, we need to lose last two dims from the products, but flipped .
out = np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Runtime test
Approaches -
def einsum_based1(eps, A, B): # #unutbu's soln1
return np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
def einsum_based2(eps, A, B): # #unutbu's soln2
return np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
def tensordot_based(eps, A, B):
parte1 = np.tensordot(eps,A,axes=((1),(0)))
parte2 = np.tensordot(eps,B,axes=((2),(1)))
return np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Timings -
In [5]: # Setup inputs
...: N = 20
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [6]: %timeit einsum_based1(eps, A, B)
1 loops, best of 3: 773 ms per loop
In [7]: %timeit einsum_based2(eps, A, B)
1000 loops, best of 3: 972 µs per loop
In [8]: %timeit tensordot_based(eps, A, B)
1000 loops, best of 3: 214 µs per loop
Bigger dataset -
In [12]: # Setup inputs
...: N = 100
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [13]: %timeit einsum_based2(eps, A, B)
1 loops, best of 3: 856 ms per loop
In [14]: %timeit tensordot_based(eps, A, B)
10 loops, best of 3: 49.2 ms per loop
You could use einsum which implements Einstein summation notation:
C = np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
or for better performance, apply einsum to two arrays at a time:
C = np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
np.einsum computes a sum of products.
The subscript specifier 'ikl,jmn,km,ln->ij' tells np.einsum that
the first eps has subcripts i,k,l,
the second eps has subcripts j,m,n,
A has subcripts k,m,
B has subcripts l,n,
the output array has subscripts i,j
Thus, the summation is over products of the form
eps(i,k,l) * eps(j,m,n) * A(k,m) * B(l,n)
All subscripts not in the output array are summed over.

Speed up numpy summation between big arrays

I have a code running operations on numpy arrays.
While linear algebra operations seem fast, I now am finding a bottleneck in a different issue: the summation of two distinct arrays.
In the example below WE3 and T1 are two 1000X1000X1000 arrays.
First I calculate WE3 using a numpy operation, then I sum those arrays.
import numpy as np
import scipy as sp
import time
N = 100
n = 1000
X = np.random.uniform(size = (N,n))
wE = np.mean(X,0)
wE3 = np.einsum('i,j,k->ijk', wE, wE, wE) #22 secs
T1 = np.random.uniform(size = (n,n,n))
a = wE3 + T1 #115 secs
The calculation of wE3 takes like 22 seconds, while the addition between WE3 and T1 takes 115 seconds.
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3? They should have more or less the same complexity..
Is there a way to speed up that code?
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3?
The arrays wE3, T1 and a each require 8 gigabytes of memory. You are probably running out of physical memory, and swap memory access is killing your performance.
Is there a way to speed up that code?
Get more physical memory (i.e. RAM).
If that is not possible, take a look at what you are going to do with these arrays, and see if you can work in batches such that the total memory required when processing a batch remains within the limits of your physical memory.
That np.einsum('i,j,k->ijk', wE, wE, wE) part isn't doing any sum-reduction and is essentially just broadcasted elementwise multiplication. So, we can replace that with something like this -
wE[:,None,None] * wE[:,None] * wE
Runtime test -
In [9]: # Setup inputs at 1/5th of original dataset sizes
...: N = 20
...: n = 200
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [10]: %timeit np.einsum('i,j,k->ijk', wE, wE, wE)
10 loops, best of 3: 45.7 ms per loop
In [11]: %timeit wE[:,None,None] * wE[:,None] * wE
10 loops, best of 3: 26.1 ms per loop
Next up, we have wE3 + T1, where T1 = np.random.uniform(size = (n,n,n)) doesn't look like could be helped in a big way as we have to create T1 anyway and then it's just element-wise addition. It seems we can use np.add that lets us write back the results to one of the arrays : wE3 or T1. Let's say we choose T1, if that's okay to be modified. I guess this would bring slight memory efficiency as we won't be adding another variable into workspace.
Thus, we could do -
np.add(wE3,T1,out=T1)
Runtime test -
In [58]: def func1(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: np.add(wE3,T1,out=T1)
...: return T1
...:
In [59]: # Setup inputs at 1/4th of original dataset sizes
...: N = 25
...: n = 250
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...:
In [60]: %timeit func1(wE3)
1 loops, best of 3: 390 ms per loop
In [61]: %timeit func2(wE3)
1 loops, best of 3: 363 ms per loop
Using #Aaron's suggestion, we can use a loop and assuming that writing back the results into wE3 is okay, we could do -
wE3 = wE[:,None,None] * wE[:,None] * wE
for x in wE3:
np.add(x, np.random.uniform(size = (n,n)), out=x)
Final results
Thus, putting back all the suggested improvements, finally the runtime test results were -
In [97]: def func1(wE):
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE):
...: wE3 = wE[:,None,None] * wE[:,None] * wE
...: for x in wE3:
...: np.add(x, np.random.uniform(size = (n,n)), out=x)
...: return wE3
...:
In [98]: # Setup inputs at 1/3rd of original dataset sizes
...: N = 33
...: n = 330
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [99]: %timeit func1(wE)
1 loops, best of 3: 1.09 s per loop
In [100]: %timeit func2(wE)
1 loops, best of 3: 879 ms per loop
You should really use Numba's Jit (just in time compiler) for this. It is a purely numpy pipeline, which is perfect for Numba.
All you have to do is throw that above code into a function, and put an #jit decorater on top. It gets speedups close to Cython.
However, as others have pointed out, it appears you're trying to work with data too large for your local machine, and numba would not solve your problems

optimal numba implementations for matrix multiplication depends significantly on matrix size

This question relates to one I posted awhile back:
Python, numpy, einsum multiply a stack of matrices
I am trying to understand why I get the speedups I get with Numba when used in a particular manner when multiplying a stack of a stack of matrices. As before, I am putting in a (500,201,2,2) array, multiplying the (2x2) matrices at the end along the first axis (so 500 multiplications), to get a (201,2,2) array as the result.
Here is the Python code:
from numba import jit # numba 0.24, numpy 1.9.3, python 2.7.11
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
#jit(nopython=True)
def loopMultJit(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = np.dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_2X2(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
x1 = ArrMult[i,0,0] * Arr[j,i,0,0] + ArrMult[i,0,1] * Arr[j,i,1,0]
y1 = ArrMult[i,0,0] * Arr[j,i,0,1] + ArrMult[i,0,1] * Arr[j,i,1,1]
x2 = ArrMult[i,1,0] * Arr[j,i,0,0] + ArrMult[i,1,1] * Arr[j,i,1,0]
y2 = ArrMult[i,1,0] * Arr[j,i,0,1] + ArrMult[i,1,1] * Arr[j,i,1,1]
ArrMult[i,0,0] = x1
ArrMult[i,0,1] = y1
ArrMult[i,1,0] = x2
ArrMult[i,1,1] = y2
return ArrMult
A1 = loopMult(Arr)
A2 = loopMultJit(Arr)
A3 = loopMultJit_2X2(Arr)
print np.allclose(A1, A2)
print np.allclose(A1, A3)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
Here is the output:
True
True
10 loops, best of 3: 40.5 ms per loop
10 loops, best of 3: 36 ms per loop
1000 loops, best of 3: 808 µs per loop
In the prior question, the accepted answer showed that with f2py there was a speedup of 8x without detailed optimization. Here, with Numba, I get about 10% speedup using numba over an einsum loop, but I get 45x speedup if instead of using np.dot in the loop, I simply do the 2x2 matrix multiplication by hand. Why is this? I should mention I have implemented both of these jit functions with proper type signatures as guvectorize versions as well, which basically provides the same speedup factors, so I left them out. Also speedup from iterating over a 201,500,2,2 matrix is minimal.
2 Comments have responded that the speedup is just due to python overhead, and I think that's right. The overhead is mostly function calls, but also for loops, and np.dot has some extra overhead on top of that. I set up a Naive dot product function:
#jit(nopython=True)
def dot(mat1, mat2):
s = 0
mat = np.empty(shape=(mat1.shape[1], mat2.shape[0]), dtype=mat1.dtype)
for r1 in range(mat1.shape[0]):
for c2 in range(mat2.shape[1]):
s = 0
for j in range(mat2.shape[0]):
s += mat1[r1,j] * mat2[j,c2]
mat[r1,c2] = s
return mat
Then I set up to functions to multiply the arrays, one which calls the dot function and one which has the dot function built into the loop, so that it is executed without an extra function call:
#jit(nopython=True)
def loopMultJit_dot(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_dotInternal(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
s = 0.0
for r1 in range(ArrMult.shape[1]):
for c2 in range(Arr.shape[3]):
s = 0.0
for r2 in range(Arr.shape[2]):
s += ArrMult[i,r1,r2] * Arr[j,i,r2,c2]
ArrMult[i,r1,c2] = s
return ArrMult
Then I can run 2 comparisons: 2x2 arrays, and 10x10 arrays. With these I get some idea of the penalties paid for function calls in general, and for the np.dot function call in particular, and the gains from BLAS optimizations in np.dot:
print "2x2 Time Test:"
Arr = rand(500,201,2,2)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
print "10x10 Time Test:"
Arr = rand(500,201,10,10)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
which yields:
2x2 Time Test:
10 loops, best of 3: 55.8 ms per loop # einsum
10 loops, best of 3: 48.7 ms per loop # np.dot
1000 loops, best of 3: 1.09 ms per loop # 2x2
10 loops, best of 3: 28.3 ms per loop # naive dot, separate function
100 loops, best of 3: 2.58 ms per loop # naive dot internal
10x10 Time Test:
1 loop, best of 3: 499 ms per loop # einsum
10 loops, best of 3: 91.3 ms per loop # np.dot
10 loops, best of 3: 170 ms per loop # naive dot, separate function
10 loops, best of 3: 161 ms per loop # naive dot internal
I suppose the take home messages are:
einsum is nice if you're not using numba, or need one-liners, but for matrix multiplication, there are faster options
if you're working with small matrices, it can be faster to do things by hand and not call separate functions
for large matrices, there is a reason BLAS was invented, and in fact, speedups are quite noticeable at sizes as small as 10x10.

Categories