Translating np.einsum to something more performant - python

Using python/numpy, I have the following np.einsum:
np.einsum('abde,abc->bcde', X, Y)
Y is sparse: for each [a,b], only one c == 1; all others := 0.
For an example of relative size of the axes, X.shape is on the order of (1000, 5, 30, 30), and Y.shape is equivalently (1000, 5, 300).
This operation is extremely costly; I want to make this more performant. For one thing, einsum is not parallelized. For another, beecause Y is sparse, I'm effectively computing 300x the number of multiplication operations I should be doing. In fact, when I wrote the equivalent of this einsum using a loop over n, I got a speed-up of around 3x. But that's clearly not very good.
How should I approach making this more performant? I've tried using np.tensordot, but I could not figure out how to get what I want from it (and I still run into the sparse/dense problem).

If Y only contains 1 and 0 then the einsum basically does this:
result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
I, J, K = np.nonzero(Y)
result[J, K] += X[I, J]
But this doesn't give the correct result due to duplicate j, k indices.
I couldn't get numpy.add.at to work, but a loop over just these indices is still pretty fast, at least for the given shapes and sparsity.
result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
for i, j, k in zip(*np.nonzero(Y)):
result[j, k] += X[i, j]
This is the test code that I used:
a, b, c, d, e = 1000, 5, 300, 30, 30
X = np.random.randint(10, size=(a,b,d,e))
R = np.random.rand(a, b, c)
K = np.argmax(R, axis=2)
I, J = np.indices((a, b), sparse=True)
Y = np.zeros((a, b, c), int)
Y[I, J, K] = 1

You can do that pretty easily with Numba:
import numba
#numba.njit('float64[:,:,:,::1](float64[:,:,:,::1], float64[:,:,::1])', fastmath=True, parallel=True)
def compute(x, y):
na, nb, nd, ne = x.shape
nc = y.shape[2]
assert y.shape == (na, nb, nc)
out = np.zeros((nb, nc, nd, ne))
for b in numba.prange(nb):
for a in range(na):
for c in range(nc):
yVal = y[a, b, c]
if np.abs(yVal) != 0:
for d in range(nd):
for e in range(ne):
out[b, c, d, e] += x[a, b, d, e] * yVal
return out
Note that it is faster to iterate over a and then b for a sequential code. That being said, for the code to be parallel, the loop have been swapped and the parallelization is performed over b (which is a small axis). A parallel reduction over the axis a would be more efficient, but this is unfortunately not easy to do with Numba (one need to split matrices in multiple chunks since there is no simple way to create thread-local matrices).
Note you can replace values like nd and ne by the actual value (ie. 30) so for the compiler to generate a faster code specifically for this matrix size.
Here is the testing code:
np.random.seed(0)
x = np.random.rand(1000, 5, 30, 30)
y = np.random.rand(1000, 5, 300)
y[np.random.rand(*y.shape) > 0.1] = 0.0 # Make it sparse (90% of 0)
%time res = np.einsum('abde,abc->bcde', x, y) # 2.350 s
%time res2 = compute(x, y) # 0.074 s (0.061 s with hand-written sizes)
print(np.allclose(res, res2))
This is about 32 times faster on a 10-core Intel Skylake Xeon processor. It reaches a 38x speed up with hand-written sizes. It does not scale very well due to the parallelization over the b axis but using other axis will cause a less efficient memory accesses.
If this is not enough, it may be a good idea to transpose x and y first so to improve data locality (thanks to a more contiguous access pattern along the a axis) and a better scaling (by parallelizing both the b and c axis). That being said, transpositions are generally expensive so one certainly need to optimize it so to get an even better speed up.

Related

How can I accelerate the matrix multiplication XAX^T when A is sparse?

Suppose X has r rows and c columns, so that A is a c by c matrix. If the total count of non-zero elements in A (call it z) is small then the following Python/pseudocode is plenty fast enough. If c is large though, and if z is bigger than c, then I don't have any reasonable ideas for how to accelerate the 3-way product, even approximately.
from itertools import product
def naive_threeway_matmul(X, A):
"""
X: (r,c) dense nested arrays
A: (c,c) sparse matrix, stored as {(i,j): value}
rtn: (r,r) dense nested arrays: X # A # X.transpose()
"""
r = len(X)
c = len(A)
rtn = [[0]*r for _ in range(r)]
for (u,v) in product(range(r), repeat=2):
rtn[u][v] = sum(
X[u][i] * value * X[v][j]
for (i,j), value in A.items()
)
return rtn
So we have a reasonable O(r^2 z) solution, and by computing a sparse-dense product followed by a dense-dense product we can get an O(r^2 c) solution (if z ~ c), but both of those are still effectively some cubic function. Is there an algorithm to bring the runtime down closer to O(r^2) when c is large and z ~ c, even approximately?
I don't really care about numpy or jax or other accelerators right now; I can optimize the code later.

how to create a matrix from combinations of elements from two vectors in tensorflow

I have two vectors X = [a,b,c,d] and Y = [m,n,o]. I'd like to construct a matrix M where each element is an operation on each pair from X and Y. i.e.
M[j,i] = f(X[i], Y[j])
# e.g. where f(x,y) = x-y:
M :=
a-m b-m c-m d-m
a-n b-n c-n d-n
a-o b-o c-o d-o
I imagine I could do this with two tf.while_loop(), but that seems inefficient, I was wondering if there is a more compact and parallel way of doing this.
P.S. There is a slight complication that X and Y are in fact not vectors, but R2. i.e. each element in X and Y is itself a fixed length vector, and f(X, Y) performs f() element wise. Plus there is a batch component too.
I.e.
X.shape => [BATCH, I, K]
Y.shape => [BATCH, J, K]
M[batch, j, i, k] = f( X[batch, i, k], Y[batch, j, k] )
# e.g.:
= X[batch, i, k] - Y[batch, j, k]
this is using the python API btw
I found a way of doing this by increasing rank and using broadcasting. I still don't know if this is the most efficient way of doing it, but it's a heck of a lot better than using tf.while_loop I guess! I'm still open to suggestions / improvements.
X_expand = tf.expand_dims(X, 1)
Y_expand = tf.expand_dims(Y, 2)
# now I think M = f(X,Y) will broadcast each tensor to the higher dimension on each axis duplicating the data e.g.:
M = X-Y

Fast inner product of more than two matrices in python

I'm currently writting code where I need to compute as fast as possible a kind of inner product between three 2-D arrays.
Let's call them a,b,c. They all have the same size (N x M).
I want to compute the following 3-d array, op, of size (N x N x N), such that op[i, j, k] is the sum over m of the a[i, m] b[j, m] c[k, m]
(click here for the nice Latex formula)
This is basically an extend version of np.inner to 3 inputs rather than 2.
In practice, the dimensions I will run into are something like N = 100 and M = 300 000. The matrices are not going to be sparse at all, so op contains about 1 million nonzero values.
So far, I've attempted two methods.
The first one uses broadcasting:
import numpy as np
N = 100
M = 300000
a = np.random.randn(N, M)
b = np.random.randn(N, M)
c = np.random.randn(N, M)
def method1(a, b, c):
a_i = a[:, None, None, :]
b_j = b[None, :, None, :]
c_k = c[None, None, :, :]
return np.sum(a_i * b_j * c_k, axis=3)
The problem with this is that it first computes a_i * b_j * c_k which is an N x N x N x M array, so in my case it is simply too much to handle.
I've tried another method using np.einsum, and it is much faster than the previous method:
def method2(a, b, c):
return np.einsum('im,jm,km', a, b, c)
My problem is that it is still too slow. For N = 100 and M = 30 000, it already takes 95 seconds to run on my computer, so taking M to its actual value of 300 000 is impossible.
My question is: do you know any pythonic way to solve my problem (maybe a magic numpy function?), or do I have to resort to things like cython or numba to actually make this computation feasible?
Thanks in advance for any help!
Very interesting one and related to this other problem.
Approach #1: For decent size arrays
Based on the winning approach there at the above mentioned Q&A, here's one solution -
np.tensordot(a[:,None]*b,c,axes=(2,1))
Explanation :
1) a[:,None]*b : Get a 3D array of shape (N, N, M). So, for the use case, it would be (100, 100, 30000), which might be a bit too much for regular systems, but might just work out given some extra system memory juice.
2) np.tensordot(..): Next up, we would sum-reduce that last axis from previous step with tensor-dot against the third array c to have a (100, 100, 100) shaped output array.
Approach #2: For very large arrays and with b identical to c
out = np.zeros((N, N, N))
for i in range(N):
for j in range(N):
for k in range(j+1):
out[i,j,k] = np.einsum('i,i,i->',a[i],b[j],b[k])
r,c = np.triu_indices(N,1)
out[np.arange(N)[:,None], r,c] = out[np.arange(N)[:,None], c,r]

Multilinear maps in python using numpy

In python I have a three dimensional array T with size (n,n,n), and a 2-D array, with size (n,k).
In linear algebra, the multilinear map defined by T and applied to W, in code would be:
X3 = np.zeros((k,k,k))
for i in xrange(k):
for j in xrange(k):
for t in xrange(k):
for l in xrange(n):
for m in xrange(n)
for h in xrange(n):
X3[i, j, t] += M3[l, m, h] * W[l, i] * W[m, j] * W[h, t]
See
https://en.wikipedia.org/wiki/Multilinear_map
For a reference.
This is very slow. I am wondering if there exist any alternative or any pre build function in numpy that can speed up the operations.
Here's an approach using a series of dot-products -
# Get partial products and thus reach to final output
p1 = np.tensordot(W,M3,axes=(0,0))
p2 = np.tensordot(p1,W,axes=(1,0))
X3out = np.tensordot(p2,W,axes=(1,0))
Einstein summation convention? Use np.einsum!
X3 = np.einsum('lmh,li,mj,ht->ijt', M3, W, W, W)
EDIT:
Out of curiosity, I just ran some benchmarks comparing np.einsum to Divakar's approach. The difference is hugely against np.einsum!
import numpy as np
def approach1(a, b):
x = np.einsum('lmh,li,mj,ht->ijt', a, b, b, b)
def approach2(a, b):
p1 = np.tensordot(b, a, axes=(0,0))
p2 = np.tensordot(p1, b, axes=(1,0))
x = np.tensordot(p2, b, axes=(1,0))
n = 100
k = 10
a = np.random.random((n, n, n))
b = np.random.random((n, k))
%timeit approach1(a, b) # => 1 loop, best of 3: 26 s per loop
%timeit approach2(a, b) # => 100 loops, best of 3: 4.23 ms per loop
There's some discussion about it in this question. It all seems to come down to the generality that np.einsum tries to achieve — at the cost of being able to offload computations to low-level linear algebra packages.

Efficient weighted vector distance calculation with numpy

I want to calculate the squared euclidean distance between two sets of points, inputs and testing. inputs is typically a real array of size ~(200, N), whereas testing is typically ~(1e8, N), and N is around 10. The distances should be scaled in each dimension in N, so I'd be aggregating the expression scale[j]*(inputs[i,j] - testing[ii,j])**2 (where scale is the scaling vector) over N times. I am trying to make this as fast as possible, particularly as N can be large. My first test is
def old_version (inputs, testing, x0):
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros((n, nn))
for d in xrange(d1):
b += x0[d] * (((np.tile(inputs[:, d], (nn, 1)) -
np.tile (testing[:, d], (n, 1)).T))**2).T
return b
Nothing too fancy. I then tried using scipy.spatial.distance.cdist, although I still have to loop through it to get the scaling right
def new_version (inputs, testing, x0):
# import scipy.spatial.distance as dist
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros ((n, nn))
for d in xrange(d1):
b += x0[d] * dist.cdist(inputs[:, d][:, None],
testing[:, d][:, None], 'sqeuclidean')
return b
It would appear that new_version scales better (as N > 1000), but I'm not sure that I've gone as fast as possible here. Any further ideas much appreciated!
This code gave me a factor of 10 over your implementation, give it a try:
x = np.random.randn(200, 10)
y = np.random.randn(1e5, 10)
scale = np.abs(np.random.randn(1, 10))
scale_sqrt = np.sqrt(scale)
dist_map = dist.cdist(x*scale_sqrt, y*scale_sqrt, 'sqeuclidean')
These are the test results:
In [135]: %timeit suggested_version(inputs, testing, x0)
1 loops, best of 3: 341 ms per loop
In [136]: %timeit op_version(inputs, testing, x00) (NOTICE: x00 is a reshape of x0)
1 loops, best of 3: 3.37 s per loop
Just make sure than when you go for the larger N you don't get low on memory. It can really slow things down.

Categories