This question already has an answer here:
Python: Multiplying a list of vectors by a list of matrices as a single matrix operation
(1 answer)
Closed 5 years ago.
Given NumPy arrays R and S with shapes (m, d) and (m, n, d) respectively, I would like to compute an array P of shape (m, n) whose (i, j)-th entry is np.dot(R[i, :] , S[i, j, :]).
Doing a double for-loop would not need any extra space (apart from the m * n space for P), but would not be time-efficient.
Using broadcasting, I could do P = np.sum(R[:, np.newaxis, :] * S, axis=2), but this would cost extra m * n * d space.
What is the most time- and space-efficient way to do this?
einsum is another of the usual suspects
m, n, d = 100, 100, 100
>>> R = np.random.random((m, d))
>>> S = np.random.random((m, n, d))
>>> np.einsum('md,mnd->mn', R, S)
>>> np.allclose(np.einsum('md,mnd->mn', R, S), (R[:,None,:]*S).sum(axis=-1))
True
>>> from timeit import repeat
>>> repeat('np.einsum("md,mnd->mn", R, S)', globals=globals(), number=1000)
[0.7004671019967645, 0.6925274690147489, 0.6952172230230644]
>>> repeat('(R[:,None,:]*S).sum(axis=-1)', globals=globals(), number=1000)
[3.0512512560235336, 3.0466731210472062, 3.044075728044845]
Some indirect evidence that einsum isn't too wasteful with the RAM:
>>> m, n, d = 1000, 1001, 1002
>>> # Too much for broadcasting:
>>> np.zeros((m, n, d))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>> R = np.random.random((m, d))
>>> S = np.random.random((n, d))
>>> np.einsum('md,nd->mn', R, S).shape
(1000, 1001)
In these cases, it is always good to consider numba, which can provide the best of both worlds:
import numpy as np
from numba import jit
def vanilla_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
def broadcast_mult(R, S):
return np.sum(R[:, np.newaxis, :] * S, axis=2)
#jit(nopython=True)
def jit_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
Note, vanilla_mult and jit_mult have the exact-same implementation, however, the latter is just-in-time compiled. Let's test this out:
In [1]: import test # the above is in test.py
In [2]: import numpy as np
In [3]: m, n, d = 100, 100, 100
In [4]: R = np.random.rand(m, d)
In [5]: S = np.random.rand(m, n, d)
OK...
In [6]: %timeit test.broadcast_mult(R, S)
100 loops, best of 3: 1.95 ms per loop
In [7]: %timeit test.vanilla_mult(R, S)
100 loops, best of 3: 11.7 ms per loop
Ouch, yeah, an almost 5-fold increase in compuation time compared to broadcasting. However...
In [8]: %timeit test.jit_mult(R, S)
The slowest run took 760.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 870 µs per loop
Nice! We can cut our runtime in half by simply JITing! How does this scale?
In [12]: m, n, d = 1000, 1000, 100
In [13]: R = np.random.rand(m, d)
In [14]: S = np.random.rand(m, n, d)
In [15]: %timeit test.vanilla_mult(R, S)
1 loop, best of 3: 1.22 s per loop
In [16]: %timeit test.broadcast_mult(R, S)
1 loop, best of 3: 666 ms per loop
In [17]: %timeit test.jit_mult(R, S)
The slowest run took 7.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 83.6 ms per loop
Scales very well, since broadcasting is starting to be held back by having to create large, intermediate arrays, it is only half the time compared to the vanilla approach, but it takes almost 7-times as much as the JIT-approach!
Edit to Add
And finally, we compare the np.einsum approach:
In [19]: %timeit np.einsum('md,mnd->mn', R, S)
10 loops, best of 3: 59.5 ms per loop
And it is clearly the winner in speed. I am not familiar enough with it to comment on the space requirements, though.
Related
My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop
I am struggling with a slow numpy operation, using python 3.
I have the following operation:
np.sum(np.log(X.T * b + a).T, 1)
where
(30000,1000) = X.shape
(1000,1) = b.shape
(1000,1) = a.shape
My problem is that this operation is pretty slow (around 1.5 seconds), and it is inside a loop, so it is repeated around 100 times, that makes the running time of my code very long.
I am wondering if there is a faster implementation of this function.
Maybe useful fact: X is extremely sparse (only 0.08% of the entries are nonzero), but is a NumPy array.
We can optimize the logarithm operation which seems to be the bottleneck and that being one of the transcendental functions could be sped up with numexpr module and then sum-reduce with NumPy because NumPy does it much better, thus giving us a hybrid one, like so -
import numexpr as ne
def numexpr_app(X, a, b):
XT = X.T
return ne.evaluate('log(XT * b + a)').sum(0)
Looking closely at the broadcasting operations : XT * b + a, we see that there are two stages of broadcasting, on which we can optimize further. The intention is to see if that could be reduced to one stage and that seems possible here with some division. This gives us a slightly modified version, shown below -
def numexpr_app2(X, a, b):
ab = (a/b)
XT = X.T
return np.log(b).sum() + ne.evaluate('log(ab + XT)').sum(0)
Runtime test and verification
Original approach -
def numpy_app(X, a, b):
return np.sum(np.log(X.T * b + a).T, 1)
Timings -
In [111]: # Setup inputs
...: density = 0.08/100 # 0.08 % sparse
...: m,n = 30000, 1000
...: X = scipy.sparse.rand(m,n,density=density,format="csr").toarray()
...: a = np.random.rand(n,1)
...: b = np.random.rand(n,1)
...:
In [112]: out0 = numpy_app(X, a, b)
...: out1 = numexpr_app(X, a, b)
...: out2 = numexpr_app2(X, a, b)
...: print np.allclose(out0, out1)
...: print np.allclose(out0, out2)
...:
True
True
In [114]: %timeit numpy_app(X, a, b)
1 loop, best of 3: 691 ms per loop
In [115]: %timeit numexpr_app(X, a, b)
10 loops, best of 3: 153 ms per loop
In [116]: %timeit numexpr_app2(X, a, b)
10 loops, best of 3: 149 ms per loop
Just to prove the observation stated at the start that log part is the bottleneck with the original NumPy approach, here's the timing on it -
In [44]: %timeit np.log(X.T * b + a)
1 loop, best of 3: 682 ms per loop
On which the improvement was significant -
In [120]: XT = X.T
In [121]: %timeit ne.evaluate('log(XT * b + a)')
10 loops, best of 3: 142 ms per loop
It's a bit unclear why you would do np.sum(your_array.T, axis=1) instead of np.sum(your_array, axis=0).
You can use a scipy sparse matrix: (use column compressed format for X, so that X.T is row compressed, since you multiply by b which has the shape of one row of X.T)
X_sparse = scipy.sparse.csc_matrx(X)
and replace X.T * b by:
X_sparse.T.multiply(b)
However if a is not sparse it will not help you as much as it could.
These are the speed ups I obtain for this operation:
In [16]: %timeit X_sparse.T.multiply(b)
The slowest run took 10.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 374 µs per loop
In [17]: %timeit X.T * b
10 loops, best of 3: 44.5 ms per loop
with:
import numpy as np
from scipy import sparse
X = np.random.randn(30000, 1000)
a = np.random.randn(1000, 1)
b = np.random.randn(1000, 1)
X[X < 3] = 0
print(np.sum(X != 0))
X_sparse = sparse.csc_matrix(X)
Given two matrices, I want to compute the pairwise differences between all rows. Each matrix has 1000 rows and 100 columns so they are fairly large. I tried using a for loop and pure broadcasting but the for loop seem to be working faster. Am I doing something wrong? Here is the code:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
The broadcasting method takes about 1 second longer and it's even longer for large matrices. Any idea how to speed this up purely using numpy?
Here's another way to perform :
(a-b)^2 = a^2 + b^2 - 2ab
with np.einsum for the first two terms and dot-product for the third one -
import numpy as np
np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)
Runtime test
Approaches -
def loopy_app(A,B):
m,n = A.shape[0], B.shape[0]
out = np.empty((m,n))
for i,a in enumerate(A):
out[i] = np.sum((a - B)**2,1)
return out
def broadcasting_app(A,B):
return ((A[:,np.newaxis,:] - B)**2).sum(-1)
# #Paul Panzer's soln
def outer_sum_dot_app(A,B):
return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)
# #Daniel Forsman's soln
def einsum_all_app(A,B):
return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
A[:,None,:] - B[None,:,:])
# Proposed in this post
def outer_einsum_dot_app(A,B):
return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
2*np.dot(A,B.T)
Timings -
In [51]: A = np.random.randn(1000,100)
...: B = np.random.randn(1000,100)
...:
In [52]: %timeit loopy_app(A,B)
...: %timeit broadcasting_app(A,B)
...: %timeit outer_sum_dot_app(A,B)
...: %timeit einsum_all_app(A,B)
...: %timeit outer_einsum_dot_app(A,B)
...:
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop
Here is a solution which avoids both the loop and the large intermediates:
from numpy import *
import time
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start
Prints:
0.546781778336
0.674743175507
0.10723400116
Another job for np.einsum
np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])
Similar to #paul-panzer, a general way to compute pairwise differences of arrays of arbitrary dimension is broadcasting as follows:
Let v be a NumPy array of size (n, d):
import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down
For a better picture what's happening, imagine each d-dimensional row of v being propped up like a flagpole, and copied across and down. Now when you do component-wise subtraction, you're getting each pairwise combination.
__
There's also scipy.spatial.distance.pdist, which computes a metric in a pairwise fashion.
from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))
In a research paper, the author introduces an exterior product between two (3*3) matrices A and B, resulting in C:
C(i, j) = sum(k=1..3, l=1..3, m=1..3, n=1..3) eps(i,k,l)*eps(j,m,n)*A(k,m)*B(l,n)
where eps(a, b, c) is the Levi-Civita symbol.
I am wondering how to vectorize such a mathematical operator in Numpy instead of implementing 6 nested loops (for i, j, k, l, m, n) naively.
It looks like a purely sum-reduction based problem without the requirement of keeping any axis aligned between the inputs. So, I would suggest matrix-multiplication based solution for tensors using np.tensordot.
Thus, one solution could be implemented in three steps -
# Matrix-multiplication between first eps and A.
# Thus losing second axis from eps and first from A : k
parte1 = np.tensordot(eps,A,axes=((1),(0)))
# Matrix-multiplication between second eps and B.
# Thus losing third axis from eps and second from B : n
parte2 = np.tensordot(eps,B,axes=((2),(1)))
# Finally, we are left with two products : ilm & jml.
# We need to lose lm and ml from these inputs respectively to get ij.
# So, we need to lose last two dims from the products, but flipped .
out = np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Runtime test
Approaches -
def einsum_based1(eps, A, B): # #unutbu's soln1
return np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
def einsum_based2(eps, A, B): # #unutbu's soln2
return np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
def tensordot_based(eps, A, B):
parte1 = np.tensordot(eps,A,axes=((1),(0)))
parte2 = np.tensordot(eps,B,axes=((2),(1)))
return np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Timings -
In [5]: # Setup inputs
...: N = 20
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [6]: %timeit einsum_based1(eps, A, B)
1 loops, best of 3: 773 ms per loop
In [7]: %timeit einsum_based2(eps, A, B)
1000 loops, best of 3: 972 µs per loop
In [8]: %timeit tensordot_based(eps, A, B)
1000 loops, best of 3: 214 µs per loop
Bigger dataset -
In [12]: # Setup inputs
...: N = 100
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [13]: %timeit einsum_based2(eps, A, B)
1 loops, best of 3: 856 ms per loop
In [14]: %timeit tensordot_based(eps, A, B)
10 loops, best of 3: 49.2 ms per loop
You could use einsum which implements Einstein summation notation:
C = np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
or for better performance, apply einsum to two arrays at a time:
C = np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
np.einsum computes a sum of products.
The subscript specifier 'ikl,jmn,km,ln->ij' tells np.einsum that
the first eps has subcripts i,k,l,
the second eps has subcripts j,m,n,
A has subcripts k,m,
B has subcripts l,n,
the output array has subscripts i,j
Thus, the summation is over products of the form
eps(i,k,l) * eps(j,m,n) * A(k,m) * B(l,n)
All subscripts not in the output array are summed over.
I have to apply some mathematical formula that I've written
in python as:
for s in range(tdim):
sum1 = 0.0
for i in range(dim):
for j in range(dim):
sum1+=0.5*np.cos(theta[s]*(i-j))*
eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*eig1[j]*eig2[i]-eig1[i]*eig2[j])
PHi2.append(sum1)
Now, this is correct, but clearly inefficient, the other way around is to do:
for i in range(dim):
for j in range(dim):
PHi2 = 0.5*np.cos(theta*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])
However, the second example gives me the same number in all elements of PHi2, so this
is faster but answer is wrong. How can you do this correctly and more efficiently?
NOTE: eig1 and eig2 are of the same dimension d, theta and PHi2 are the same dimension D,
BUT d!=D.
You can use a brute force broadcasting approach, but you are creating an intermediate array of shape (D, d, d), which can get out of hand if your arrays are even moderately large. Furthermore, in using broadcasting with no refinements you are recomputing a lot of calculations from the innermost loop that you only need to do once. If you first compute the necessary parameters for all possible values of i - j and add them together, you can reuse those values on the outer loop, e.g.:
def fast_ops(eig1, eig2, theta):
d = len(eig1)
d_arr = np.arange(d)
i_j = d_arr[:, None] - d_arr[None, :]
reidx = i_j + d - 1
mult1 = eig1[:, None] * eig1[ None, :] + eig2[:, None] + eig2[None, :]
mult2 = eig1[None, :] * eig2[:, None] - eig1[:, None] * eig2[None, :]
mult1_reidx = np.bincount(reidx.ravel(), weights=mult1.ravel())
mult2_reidx = np.bincount(reidx.ravel(), weights=mult2.ravel())
angles = theta[:, None] * np.arange(1 - d, d)
return 0.5 * (np.einsum('ij,j->i', np.cos(angles), mult1_reidx) -
np.einsum('ij,j->i', np.sin(angles), mult2_reidx))
IF we rewrite M4rtini's code as a function for comparison:
def fast_ops1(eig1, eig2, theta):
d = len(eig1)
D = len(theta)
s = np.array(range(D))[:, None, None]
i = np.array(range(d))[:, None]
j = np.array(range(d))
ret = 0.5 * (np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j]) -
np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j]))
return ret.sum(axis=(-1, -2))
And we make up some data:
d, D = 100, 200
eig1 = np.random.rand(d)
eig2 = np.random.rand(d)
theta = np.random.rand(D)
The speed improvement is very noticeable, 80x on top of the 115x over your original code, leading to a whooping 9000x speed-up:
In [22]: np.allclose(fast_ops1(eig1, eig2, theta), fast_ops(eig1, eig2, theta))
Out[22]: True
In [23]: %timeit fast_ops1(eig1, eig2, theta)
10 loops, best of 3: 145 ms per loop
In [24]: %timeit fast_ops(eig1, eig2, theta)
1000 loops, best of 3: 1.85 ms per loop
This works by broadcasting.
For tdim = 200 and dim = 100.
14 seconds with original.
120 ms with the version.
s = np.array(range(tdim))[:, None, None]
i = np.array(range(dim))[:, None]
j = np.array(range(dim))
PHi2 =(0.5*np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])).sum(axis=2).sum(axis=1)
In the first bit of code, you have 0.5*np.cos(theta[s]*(i-j))... but in the second it's 0.5*np.cos(theta*(i-j)).... Unless you've got theta defined differently for the second bit of code, this could well be the cause of the trouble.