Related
This question already has an answer here:
Python: Multiplying a list of vectors by a list of matrices as a single matrix operation
(1 answer)
Closed 5 years ago.
Given NumPy arrays R and S with shapes (m, d) and (m, n, d) respectively, I would like to compute an array P of shape (m, n) whose (i, j)-th entry is np.dot(R[i, :] , S[i, j, :]).
Doing a double for-loop would not need any extra space (apart from the m * n space for P), but would not be time-efficient.
Using broadcasting, I could do P = np.sum(R[:, np.newaxis, :] * S, axis=2), but this would cost extra m * n * d space.
What is the most time- and space-efficient way to do this?
einsum is another of the usual suspects
m, n, d = 100, 100, 100
>>> R = np.random.random((m, d))
>>> S = np.random.random((m, n, d))
>>> np.einsum('md,mnd->mn', R, S)
>>> np.allclose(np.einsum('md,mnd->mn', R, S), (R[:,None,:]*S).sum(axis=-1))
True
>>> from timeit import repeat
>>> repeat('np.einsum("md,mnd->mn", R, S)', globals=globals(), number=1000)
[0.7004671019967645, 0.6925274690147489, 0.6952172230230644]
>>> repeat('(R[:,None,:]*S).sum(axis=-1)', globals=globals(), number=1000)
[3.0512512560235336, 3.0466731210472062, 3.044075728044845]
Some indirect evidence that einsum isn't too wasteful with the RAM:
>>> m, n, d = 1000, 1001, 1002
>>> # Too much for broadcasting:
>>> np.zeros((m, n, d))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>> R = np.random.random((m, d))
>>> S = np.random.random((n, d))
>>> np.einsum('md,nd->mn', R, S).shape
(1000, 1001)
In these cases, it is always good to consider numba, which can provide the best of both worlds:
import numpy as np
from numba import jit
def vanilla_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
def broadcast_mult(R, S):
return np.sum(R[:, np.newaxis, :] * S, axis=2)
#jit(nopython=True)
def jit_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
Note, vanilla_mult and jit_mult have the exact-same implementation, however, the latter is just-in-time compiled. Let's test this out:
In [1]: import test # the above is in test.py
In [2]: import numpy as np
In [3]: m, n, d = 100, 100, 100
In [4]: R = np.random.rand(m, d)
In [5]: S = np.random.rand(m, n, d)
OK...
In [6]: %timeit test.broadcast_mult(R, S)
100 loops, best of 3: 1.95 ms per loop
In [7]: %timeit test.vanilla_mult(R, S)
100 loops, best of 3: 11.7 ms per loop
Ouch, yeah, an almost 5-fold increase in compuation time compared to broadcasting. However...
In [8]: %timeit test.jit_mult(R, S)
The slowest run took 760.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 870 µs per loop
Nice! We can cut our runtime in half by simply JITing! How does this scale?
In [12]: m, n, d = 1000, 1000, 100
In [13]: R = np.random.rand(m, d)
In [14]: S = np.random.rand(m, n, d)
In [15]: %timeit test.vanilla_mult(R, S)
1 loop, best of 3: 1.22 s per loop
In [16]: %timeit test.broadcast_mult(R, S)
1 loop, best of 3: 666 ms per loop
In [17]: %timeit test.jit_mult(R, S)
The slowest run took 7.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 83.6 ms per loop
Scales very well, since broadcasting is starting to be held back by having to create large, intermediate arrays, it is only half the time compared to the vanilla approach, but it takes almost 7-times as much as the JIT-approach!
Edit to Add
And finally, we compare the np.einsum approach:
In [19]: %timeit np.einsum('md,mnd->mn', R, S)
10 loops, best of 3: 59.5 ms per loop
And it is clearly the winner in speed. I am not familiar enough with it to comment on the space requirements, though.
I have a 3-d Numpy array flow as follows:
flow = np.random.uniform(low=-1.0, high=1.0, size=(720,1280,2))
# Suppose flow[0] are x-coordinates. flow[1] are y-coordinates.
Need to calculate the angle for each x,y point. Here is how I have implemented it:
def calcAngle(a):
assert(len(a) == 2)
(x, y) = a
# angle_deg = 0
angle_deg = np.angle(x + y * 1j, deg=True)
return angle_deg
fangle = np.apply_along_axis(calcAngle, axis=2, arr=flow)
# The above statement takes 14.0389318466 to execute
The calculation of angle at each point takes 14.0389318466 seconds to execute on my Macbook Pro.
Is there a way I could speed this up, probably by using some matrix operation, rather than processing each pixel one at a time.
You can use numpy.arctan2() to get the angle in radians, and then convert to degrees with numpy.rad2deg():
fangle = np.rad2deg(np.arctan2(flow[:,:,1], flow[:,:,0]))
On my computer, this is a little faster than Divakar's version:
In [17]: %timeit np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
10 loops, best of 3: 44.5 ms per loop
In [18]: %timeit np.rad2deg(np.arctan2(flow[:,:,1], flow[:,:,0]))
10 loops, best of 3: 35.4 ms per loop
A more efficient way to use np.angle() is to create a complex view of flow. If flow is an array of type np.float64 with shape (m, n, 2), then flow.view(np.complex128)[:,:,0] will be an array of type np.complex128 with shape (m, n):
fangle = np.angle(flow.view(np.complex128)[:,:,0], deg=True)
This appears to be a smidge faster than using arctan2 followed by rad2deg (but the difference is not far above the measurement noise of timeit):
In [47]: %timeit np.angle(flow.view(np.complex128)[:,:,0], deg=True)
10 loops, best of 3: 35 ms per loop
Note that this might not work if flow was creating as the tranpose of some other array, or as a slice of another array using steps bigger than 1.
numpy.angle supports vectorized operation. So, just feed in the first and second column slices to it for the final output, like so -
fangle = np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
Verification -
In [9]: flow = np.random.uniform(low=-1.0, high=1.0, size=(720,1280,2))
In [17]: out1 = np.apply_along_axis(calcAngle, axis=2, arr=flow)
In [18]: out2 = np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
In [19]: np.allclose(out1, out2)
Out[19]: True
Runtime test -
In [10]: %timeit np.apply_along_axis(calcAngle, axis=2, arr=flow)
1 loop, best of 3: 8.27 s per loop
In [11]: %timeit np.angle(flow[...,0] + flow[...,1] * 1j, deg=True)
10 loops, best of 3: 47.6 ms per loop
In [12]: 8270/47.6
Out[12]: 173.73949579831933
173x+ speedup!
In a research paper, the author introduces an exterior product between two (3*3) matrices A and B, resulting in C:
C(i, j) = sum(k=1..3, l=1..3, m=1..3, n=1..3) eps(i,k,l)*eps(j,m,n)*A(k,m)*B(l,n)
where eps(a, b, c) is the Levi-Civita symbol.
I am wondering how to vectorize such a mathematical operator in Numpy instead of implementing 6 nested loops (for i, j, k, l, m, n) naively.
It looks like a purely sum-reduction based problem without the requirement of keeping any axis aligned between the inputs. So, I would suggest matrix-multiplication based solution for tensors using np.tensordot.
Thus, one solution could be implemented in three steps -
# Matrix-multiplication between first eps and A.
# Thus losing second axis from eps and first from A : k
parte1 = np.tensordot(eps,A,axes=((1),(0)))
# Matrix-multiplication between second eps and B.
# Thus losing third axis from eps and second from B : n
parte2 = np.tensordot(eps,B,axes=((2),(1)))
# Finally, we are left with two products : ilm & jml.
# We need to lose lm and ml from these inputs respectively to get ij.
# So, we need to lose last two dims from the products, but flipped .
out = np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Runtime test
Approaches -
def einsum_based1(eps, A, B): # #unutbu's soln1
return np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
def einsum_based2(eps, A, B): # #unutbu's soln2
return np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
def tensordot_based(eps, A, B):
parte1 = np.tensordot(eps,A,axes=((1),(0)))
parte2 = np.tensordot(eps,B,axes=((2),(1)))
return np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Timings -
In [5]: # Setup inputs
...: N = 20
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [6]: %timeit einsum_based1(eps, A, B)
1 loops, best of 3: 773 ms per loop
In [7]: %timeit einsum_based2(eps, A, B)
1000 loops, best of 3: 972 µs per loop
In [8]: %timeit tensordot_based(eps, A, B)
1000 loops, best of 3: 214 µs per loop
Bigger dataset -
In [12]: # Setup inputs
...: N = 100
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [13]: %timeit einsum_based2(eps, A, B)
1 loops, best of 3: 856 ms per loop
In [14]: %timeit tensordot_based(eps, A, B)
10 loops, best of 3: 49.2 ms per loop
You could use einsum which implements Einstein summation notation:
C = np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
or for better performance, apply einsum to two arrays at a time:
C = np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
np.einsum computes a sum of products.
The subscript specifier 'ikl,jmn,km,ln->ij' tells np.einsum that
the first eps has subcripts i,k,l,
the second eps has subcripts j,m,n,
A has subcripts k,m,
B has subcripts l,n,
the output array has subscripts i,j
Thus, the summation is over products of the form
eps(i,k,l) * eps(j,m,n) * A(k,m) * B(l,n)
All subscripts not in the output array are summed over.
I have a code running operations on numpy arrays.
While linear algebra operations seem fast, I now am finding a bottleneck in a different issue: the summation of two distinct arrays.
In the example below WE3 and T1 are two 1000X1000X1000 arrays.
First I calculate WE3 using a numpy operation, then I sum those arrays.
import numpy as np
import scipy as sp
import time
N = 100
n = 1000
X = np.random.uniform(size = (N,n))
wE = np.mean(X,0)
wE3 = np.einsum('i,j,k->ijk', wE, wE, wE) #22 secs
T1 = np.random.uniform(size = (n,n,n))
a = wE3 + T1 #115 secs
The calculation of wE3 takes like 22 seconds, while the addition between WE3 and T1 takes 115 seconds.
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3? They should have more or less the same complexity..
Is there a way to speed up that code?
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3?
The arrays wE3, T1 and a each require 8 gigabytes of memory. You are probably running out of physical memory, and swap memory access is killing your performance.
Is there a way to speed up that code?
Get more physical memory (i.e. RAM).
If that is not possible, take a look at what you are going to do with these arrays, and see if you can work in batches such that the total memory required when processing a batch remains within the limits of your physical memory.
That np.einsum('i,j,k->ijk', wE, wE, wE) part isn't doing any sum-reduction and is essentially just broadcasted elementwise multiplication. So, we can replace that with something like this -
wE[:,None,None] * wE[:,None] * wE
Runtime test -
In [9]: # Setup inputs at 1/5th of original dataset sizes
...: N = 20
...: n = 200
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [10]: %timeit np.einsum('i,j,k->ijk', wE, wE, wE)
10 loops, best of 3: 45.7 ms per loop
In [11]: %timeit wE[:,None,None] * wE[:,None] * wE
10 loops, best of 3: 26.1 ms per loop
Next up, we have wE3 + T1, where T1 = np.random.uniform(size = (n,n,n)) doesn't look like could be helped in a big way as we have to create T1 anyway and then it's just element-wise addition. It seems we can use np.add that lets us write back the results to one of the arrays : wE3 or T1. Let's say we choose T1, if that's okay to be modified. I guess this would bring slight memory efficiency as we won't be adding another variable into workspace.
Thus, we could do -
np.add(wE3,T1,out=T1)
Runtime test -
In [58]: def func1(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: np.add(wE3,T1,out=T1)
...: return T1
...:
In [59]: # Setup inputs at 1/4th of original dataset sizes
...: N = 25
...: n = 250
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...:
In [60]: %timeit func1(wE3)
1 loops, best of 3: 390 ms per loop
In [61]: %timeit func2(wE3)
1 loops, best of 3: 363 ms per loop
Using #Aaron's suggestion, we can use a loop and assuming that writing back the results into wE3 is okay, we could do -
wE3 = wE[:,None,None] * wE[:,None] * wE
for x in wE3:
np.add(x, np.random.uniform(size = (n,n)), out=x)
Final results
Thus, putting back all the suggested improvements, finally the runtime test results were -
In [97]: def func1(wE):
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE):
...: wE3 = wE[:,None,None] * wE[:,None] * wE
...: for x in wE3:
...: np.add(x, np.random.uniform(size = (n,n)), out=x)
...: return wE3
...:
In [98]: # Setup inputs at 1/3rd of original dataset sizes
...: N = 33
...: n = 330
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [99]: %timeit func1(wE)
1 loops, best of 3: 1.09 s per loop
In [100]: %timeit func2(wE)
1 loops, best of 3: 879 ms per loop
You should really use Numba's Jit (just in time compiler) for this. It is a purely numpy pipeline, which is perfect for Numba.
All you have to do is throw that above code into a function, and put an #jit decorater on top. It gets speedups close to Cython.
However, as others have pointed out, it appears you're trying to work with data too large for your local machine, and numba would not solve your problems
This question relates to one I posted awhile back:
Python, numpy, einsum multiply a stack of matrices
I am trying to understand why I get the speedups I get with Numba when used in a particular manner when multiplying a stack of a stack of matrices. As before, I am putting in a (500,201,2,2) array, multiplying the (2x2) matrices at the end along the first axis (so 500 multiplications), to get a (201,2,2) array as the result.
Here is the Python code:
from numba import jit # numba 0.24, numpy 1.9.3, python 2.7.11
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
#jit(nopython=True)
def loopMultJit(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = np.dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_2X2(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
x1 = ArrMult[i,0,0] * Arr[j,i,0,0] + ArrMult[i,0,1] * Arr[j,i,1,0]
y1 = ArrMult[i,0,0] * Arr[j,i,0,1] + ArrMult[i,0,1] * Arr[j,i,1,1]
x2 = ArrMult[i,1,0] * Arr[j,i,0,0] + ArrMult[i,1,1] * Arr[j,i,1,0]
y2 = ArrMult[i,1,0] * Arr[j,i,0,1] + ArrMult[i,1,1] * Arr[j,i,1,1]
ArrMult[i,0,0] = x1
ArrMult[i,0,1] = y1
ArrMult[i,1,0] = x2
ArrMult[i,1,1] = y2
return ArrMult
A1 = loopMult(Arr)
A2 = loopMultJit(Arr)
A3 = loopMultJit_2X2(Arr)
print np.allclose(A1, A2)
print np.allclose(A1, A3)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
Here is the output:
True
True
10 loops, best of 3: 40.5 ms per loop
10 loops, best of 3: 36 ms per loop
1000 loops, best of 3: 808 µs per loop
In the prior question, the accepted answer showed that with f2py there was a speedup of 8x without detailed optimization. Here, with Numba, I get about 10% speedup using numba over an einsum loop, but I get 45x speedup if instead of using np.dot in the loop, I simply do the 2x2 matrix multiplication by hand. Why is this? I should mention I have implemented both of these jit functions with proper type signatures as guvectorize versions as well, which basically provides the same speedup factors, so I left them out. Also speedup from iterating over a 201,500,2,2 matrix is minimal.
2 Comments have responded that the speedup is just due to python overhead, and I think that's right. The overhead is mostly function calls, but also for loops, and np.dot has some extra overhead on top of that. I set up a Naive dot product function:
#jit(nopython=True)
def dot(mat1, mat2):
s = 0
mat = np.empty(shape=(mat1.shape[1], mat2.shape[0]), dtype=mat1.dtype)
for r1 in range(mat1.shape[0]):
for c2 in range(mat2.shape[1]):
s = 0
for j in range(mat2.shape[0]):
s += mat1[r1,j] * mat2[j,c2]
mat[r1,c2] = s
return mat
Then I set up to functions to multiply the arrays, one which calls the dot function and one which has the dot function built into the loop, so that it is executed without an extra function call:
#jit(nopython=True)
def loopMultJit_dot(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_dotInternal(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
s = 0.0
for r1 in range(ArrMult.shape[1]):
for c2 in range(Arr.shape[3]):
s = 0.0
for r2 in range(Arr.shape[2]):
s += ArrMult[i,r1,r2] * Arr[j,i,r2,c2]
ArrMult[i,r1,c2] = s
return ArrMult
Then I can run 2 comparisons: 2x2 arrays, and 10x10 arrays. With these I get some idea of the penalties paid for function calls in general, and for the np.dot function call in particular, and the gains from BLAS optimizations in np.dot:
print "2x2 Time Test:"
Arr = rand(500,201,2,2)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
print "10x10 Time Test:"
Arr = rand(500,201,10,10)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
which yields:
2x2 Time Test:
10 loops, best of 3: 55.8 ms per loop # einsum
10 loops, best of 3: 48.7 ms per loop # np.dot
1000 loops, best of 3: 1.09 ms per loop # 2x2
10 loops, best of 3: 28.3 ms per loop # naive dot, separate function
100 loops, best of 3: 2.58 ms per loop # naive dot internal
10x10 Time Test:
1 loop, best of 3: 499 ms per loop # einsum
10 loops, best of 3: 91.3 ms per loop # np.dot
10 loops, best of 3: 170 ms per loop # naive dot, separate function
10 loops, best of 3: 161 ms per loop # naive dot internal
I suppose the take home messages are:
einsum is nice if you're not using numba, or need one-liners, but for matrix multiplication, there are faster options
if you're working with small matrices, it can be faster to do things by hand and not call separate functions
for large matrices, there is a reason BLAS was invented, and in fact, speedups are quite noticeable at sizes as small as 10x10.