In python I have a three dimensional array T with size (n,n,n), and a 2-D array, with size (n,k).
In linear algebra, the multilinear map defined by T and applied to W, in code would be:
X3 = np.zeros((k,k,k))
for i in xrange(k):
for j in xrange(k):
for t in xrange(k):
for l in xrange(n):
for m in xrange(n)
for h in xrange(n):
X3[i, j, t] += M3[l, m, h] * W[l, i] * W[m, j] * W[h, t]
See
https://en.wikipedia.org/wiki/Multilinear_map
For a reference.
This is very slow. I am wondering if there exist any alternative or any pre build function in numpy that can speed up the operations.
Here's an approach using a series of dot-products -
# Get partial products and thus reach to final output
p1 = np.tensordot(W,M3,axes=(0,0))
p2 = np.tensordot(p1,W,axes=(1,0))
X3out = np.tensordot(p2,W,axes=(1,0))
Einstein summation convention? Use np.einsum!
X3 = np.einsum('lmh,li,mj,ht->ijt', M3, W, W, W)
EDIT:
Out of curiosity, I just ran some benchmarks comparing np.einsum to Divakar's approach. The difference is hugely against np.einsum!
import numpy as np
def approach1(a, b):
x = np.einsum('lmh,li,mj,ht->ijt', a, b, b, b)
def approach2(a, b):
p1 = np.tensordot(b, a, axes=(0,0))
p2 = np.tensordot(p1, b, axes=(1,0))
x = np.tensordot(p2, b, axes=(1,0))
n = 100
k = 10
a = np.random.random((n, n, n))
b = np.random.random((n, k))
%timeit approach1(a, b) # => 1 loop, best of 3: 26 s per loop
%timeit approach2(a, b) # => 100 loops, best of 3: 4.23 ms per loop
There's some discussion about it in this question. It all seems to come down to the generality that np.einsum tries to achieve — at the cost of being able to offload computations to low-level linear algebra packages.
Related
Using python/numpy, I have the following np.einsum:
np.einsum('abde,abc->bcde', X, Y)
Y is sparse: for each [a,b], only one c == 1; all others := 0.
For an example of relative size of the axes, X.shape is on the order of (1000, 5, 30, 30), and Y.shape is equivalently (1000, 5, 300).
This operation is extremely costly; I want to make this more performant. For one thing, einsum is not parallelized. For another, beecause Y is sparse, I'm effectively computing 300x the number of multiplication operations I should be doing. In fact, when I wrote the equivalent of this einsum using a loop over n, I got a speed-up of around 3x. But that's clearly not very good.
How should I approach making this more performant? I've tried using np.tensordot, but I could not figure out how to get what I want from it (and I still run into the sparse/dense problem).
If Y only contains 1 and 0 then the einsum basically does this:
result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
I, J, K = np.nonzero(Y)
result[J, K] += X[I, J]
But this doesn't give the correct result due to duplicate j, k indices.
I couldn't get numpy.add.at to work, but a loop over just these indices is still pretty fast, at least for the given shapes and sparsity.
result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
for i, j, k in zip(*np.nonzero(Y)):
result[j, k] += X[i, j]
This is the test code that I used:
a, b, c, d, e = 1000, 5, 300, 30, 30
X = np.random.randint(10, size=(a,b,d,e))
R = np.random.rand(a, b, c)
K = np.argmax(R, axis=2)
I, J = np.indices((a, b), sparse=True)
Y = np.zeros((a, b, c), int)
Y[I, J, K] = 1
You can do that pretty easily with Numba:
import numba
#numba.njit('float64[:,:,:,::1](float64[:,:,:,::1], float64[:,:,::1])', fastmath=True, parallel=True)
def compute(x, y):
na, nb, nd, ne = x.shape
nc = y.shape[2]
assert y.shape == (na, nb, nc)
out = np.zeros((nb, nc, nd, ne))
for b in numba.prange(nb):
for a in range(na):
for c in range(nc):
yVal = y[a, b, c]
if np.abs(yVal) != 0:
for d in range(nd):
for e in range(ne):
out[b, c, d, e] += x[a, b, d, e] * yVal
return out
Note that it is faster to iterate over a and then b for a sequential code. That being said, for the code to be parallel, the loop have been swapped and the parallelization is performed over b (which is a small axis). A parallel reduction over the axis a would be more efficient, but this is unfortunately not easy to do with Numba (one need to split matrices in multiple chunks since there is no simple way to create thread-local matrices).
Note you can replace values like nd and ne by the actual value (ie. 30) so for the compiler to generate a faster code specifically for this matrix size.
Here is the testing code:
np.random.seed(0)
x = np.random.rand(1000, 5, 30, 30)
y = np.random.rand(1000, 5, 300)
y[np.random.rand(*y.shape) > 0.1] = 0.0 # Make it sparse (90% of 0)
%time res = np.einsum('abde,abc->bcde', x, y) # 2.350 s
%time res2 = compute(x, y) # 0.074 s (0.061 s with hand-written sizes)
print(np.allclose(res, res2))
This is about 32 times faster on a 10-core Intel Skylake Xeon processor. It reaches a 38x speed up with hand-written sizes. It does not scale very well due to the parallelization over the b axis but using other axis will cause a less efficient memory accesses.
If this is not enough, it may be a good idea to transpose x and y first so to improve data locality (thanks to a more contiguous access pattern along the a axis) and a better scaling (by parallelizing both the b and c axis). That being said, transpositions are generally expensive so one certainly need to optimize it so to get an even better speed up.
I have a matrix, say, P of size (X,Y). Also, I have two matrices, say, Kx and Ky of size (M,N) both, a matrix pk of size (M,N) and two vectors u and v of X and Y respectively. For example, they can be defined as follows:
import numpy as np
P = np.zeros((X,Y));
pk = np.random.rand(M,N);
Kx = np.random.rand(M,N);
Ky = np.random.rand(M,N);
u = np.random.rand(X);
v = np.random.rand(Y);
In the actual code they are not random, of course, but this shall not matter for this example. The question is, if there exists a pure numpy equivalent to the following:
for m in range(0, M):
for n in range(0, N):
for i in range(0,X):
for j in range(0,Y):
Arg = Kx[m,n]*u[i] + Ky[m,n]*v[j];
P[i,j] += pk[m,n]*np.cos(Arg);
All M,N,X,Y are different, but X and Y can be the same if the solution does not exist otherwise.
A common strategy for eliminating for-loops in NumPy calculations is to work with higher-dimensional arrays.
Consider for example, the line
Arg = Kx[m,n]*u[i] + Ky[m,n]*v[j]
This line depends on the indices m, n, i and j. So Arg depends on m, n, i and j. This means Arg can be thought of as a 4-dimensional array indexed by m, n, i and j. So we can eliminate the 4 for-loops -- as far as Arg is concerned -- by computing
Kxu = Kx[:,:,np.newaxis]*u
Kyv = Ky[:,:,np.newaxis]*v
Arg = Kxu[:,:,:,np.newaxis] + Kyv[:,:,np.newaxis,:]
Kx[:,:,np.newaxis] has shape (M, N, 1), and u has shape (X,). Multiplying them together uses NumPy broadcasting to create an array of shape (M, N, X). Thus, above, new axes are used somewhat like placeholders, so that Arg ends up with 4 axes indexed by m,n,i,j in that order.
Similarly, P can be defined as
P = (pk[:,:,np.newaxis,np.newaxis]*np.cos(Arg)).sum(axis=0).sum(axis=0)
The sum(axis=0) (called twice) sums along the m and n axes, so that P ends up being a 2-dimensional array indexed by i and j only.
By working with these 4-dimensional arrays, we get to apply NumPy operations on whole NumPy arrays. In contrast, when using the 4 for-loops, we had to do computations value-by-value on scalars. Consider for example what np.cos(Arg) is doing when Arg is a 4-dimensional array. This off-loads the computation of all the cosines in one NumPy function call which does the underlying loop in compiled C code. This is much much faster than calling np.cos once for each scalar. This is the reason why working with the higher-dimensional arrays ends up being so much faster than the for-loop-based code.
import numpy as np
def orig(Kx, Ky, u, v, pk):
M, N = Kx.shape
X = u.size
Y = v.size
P = np.empty((X, Y), dtype=pk.dtype)
for m in range(0, M):
for n in range(0, N):
for i in range(0,X):
for j in range(0,Y):
Arg = Kx[m,n]*u[i] + Ky[m,n]*v[j]
P[i,j] += pk[m,n]*np.cos(Arg)
return P
def alt(Kx, Ky, u, v, pk):
Kxu = Kx[:,:,np.newaxis]*u
Kyv = Ky[:,:,np.newaxis]*v
Arg = Kxu[:,:,:,np.newaxis] + Kyv[:,:,np.newaxis,:]
P = (pk[:,:,np.newaxis,np.newaxis]*np.cos(Arg)).sum(axis=0).sum(axis=0)
return P
M, N = 10, 20
X, Y = 5, 15
Kx = np.random.random((M, N))
Ky = np.random.random((M, N))
u = np.random.random(X)
v = np.random.random(Y)
pk = np.random.random((M, N))
Sanity check, (showing alt and orig return the same result):
In [57]: P2 = alt(Kx, Ky, u, v, pk)
In [58]: P1 = orig(Kx, Ky, u, v, pk)
In [59]: np.allclose(P1, P2)
Out[59]: True
A benchmark, showing alt is significantly faster than orig:
In [60]: %timeit orig(Kx, Ky, u, v, pk)
10 loops, best of 3: 33.6 ms per loop
In [61]: %timeit alt(Kx, Ky, u, v, pk)
1000 loops, best of 3: 349 µs per loop
I want to calculate the squared euclidean distance between two sets of points, inputs and testing. inputs is typically a real array of size ~(200, N), whereas testing is typically ~(1e8, N), and N is around 10. The distances should be scaled in each dimension in N, so I'd be aggregating the expression scale[j]*(inputs[i,j] - testing[ii,j])**2 (where scale is the scaling vector) over N times. I am trying to make this as fast as possible, particularly as N can be large. My first test is
def old_version (inputs, testing, x0):
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros((n, nn))
for d in xrange(d1):
b += x0[d] * (((np.tile(inputs[:, d], (nn, 1)) -
np.tile (testing[:, d], (n, 1)).T))**2).T
return b
Nothing too fancy. I then tried using scipy.spatial.distance.cdist, although I still have to loop through it to get the scaling right
def new_version (inputs, testing, x0):
# import scipy.spatial.distance as dist
nn, d1 = testing.shape
n, d1 = inputs.shape
b = np.zeros ((n, nn))
for d in xrange(d1):
b += x0[d] * dist.cdist(inputs[:, d][:, None],
testing[:, d][:, None], 'sqeuclidean')
return b
It would appear that new_version scales better (as N > 1000), but I'm not sure that I've gone as fast as possible here. Any further ideas much appreciated!
This code gave me a factor of 10 over your implementation, give it a try:
x = np.random.randn(200, 10)
y = np.random.randn(1e5, 10)
scale = np.abs(np.random.randn(1, 10))
scale_sqrt = np.sqrt(scale)
dist_map = dist.cdist(x*scale_sqrt, y*scale_sqrt, 'sqeuclidean')
These are the test results:
In [135]: %timeit suggested_version(inputs, testing, x0)
1 loops, best of 3: 341 ms per loop
In [136]: %timeit op_version(inputs, testing, x00) (NOTICE: x00 is a reshape of x0)
1 loops, best of 3: 3.37 s per loop
Just make sure than when you go for the larger N you don't get low on memory. It can really slow things down.
I have a system of equations in the form of A*x = B where [A] is a tridiagonal coefficient matrix. Using the Numpy solver numpy.linalg.solve I can solve the system of equations for x.
See example below of how I develop the tridiagonal [A] martix. the {B} vector, and solve for x:
# Solve system of equations with a tridiagonal coefficient matrix
# uses numpy.linalg.solve
# use Python 3 print function
from __future__ import print_function
from __future__ import division
# modules
import numpy as np
import time
ti = time.clock()
#---- Build [A] array and {B} column vector
m = 1000 # size of array, make this 8000 to see time benefits
A = np.zeros((m, m)) # pre-allocate [A] array
B = np.zeros((m, 1)) # pre-allocate {B} column vector
A[0, 0] = 1
A[0, 1] = 2
B[0, 0] = 1
for i in range(1, m-1):
A[i, i-1] = 7 # node-1
A[i, i] = 8 # node
A[i, i+1] = 9 # node+1
B[i, 0] = 2
A[m-1, m-2] = 3
A[m-1, m-1] = 4
B[m-1, 0] = 3
print('A \n', A)
print('B \n', B)
#---- Solve using numpy.linalg.solve
x = np.linalg.solve(A, B) # solve A*x = B for x
print('x \n', x)
#---- Elapsed time for each approach
print('NUMPY time', time.clock()-ti, 'seconds')
So my question relates to two sections of the above example:
Since I am dealing with a tridiagonal matrix for [A], also called a banded matrix, is there a more efficient way to solve the system of equations instead of using numpy.linalg.solve?
Also, is there a better way to create the tridiagonal matrix instead of using a for-loop?
The above example runs on Linux in about 0.08 seconds according to the time.clock() function.
The numpy.linalg.solve function works fine, but I'm trying to find an approach that takes advantage of the tridiagonal form of [A] in hopes of speeding up the solution even further and then apply that approach to a more complicated example.
There are two immediate performance improvements (1) do not use a loop, (2) use scipy.linalg.solve_banded().
I would write the code something more like
import scipy.linalg as la
# Create arrays and set values
ab = np.zeros((3,m))
b = 2*ones(m)
ab[0] = 9
ab[1] = 8
ab[2] = 7
# Fix end points
ab[0,1] = 2
ab[1,0] = 1
ab[1,-1] = 4
ab[2,-2] = 3
b[0] = 1
b[-1] = 3
return la.solve_banded ((1,1),ab,b)
There may be more elegant ways to construct the matrix, but this works.
Using %timeit in ipython the original code took 112 ms for m=1000. This code takes 2.94 ms for m=10,000, an order of magnitude larger problem yet still almost two orders of magnitude faster! I did not have the patience to wait on the original code for m=10,000. Most of the time in the original may be in constructing the array, I did not test this. Regardless, for large arrays it is much more efficient to only store the non-zero values of the matrix.
There is a scipy.sparse matrix type called scipy.sparse.dia_matrix which captures the structure of your matrix well (it will store 3 arrays, in "positions" 0 (diagonal), 1 (above) and -1 (below)). Using this type of matrix you can try scipy.sparse.linalg.lsqr for solving. If your problem has an exact solution, it will be found, otherwise it will find the solution in least squares sense.
from scipy import sparse
A_sparse = sparse.dia_matrix(A)
ret_values = sparse.linalg.lsqr(A_sparse, C)
x = ret_values[0]
However, this may not be completely optimal in terms of exploiting the triadiagonal structure, there may be a theoretical way of making this faster. What this conversion does do for you is cut down the matrix multiplication expenses to the essential: Only the 3 bands are used. This, in combination with the iterative solver lsqr should already yield a speedup.
Note: I am not proposing scipy.sparse.linalg.spsolve, because it converts your matrix to csr format. However, replacing lsqr with spsolve is worth a try, especially because spsolve can bind UMFPACK, see relevant doc on spsolve. Also, it may be of interest to take a look at this stackoverflow question and answer relating to UMFPACK
You could use scipy.linalg.solveh_banded.
EDIT: You CANNOT used the above as your matrix is not symmetric and I thought it was. However, as was mentioned above in the comment, the Thomas algorithm is great for this
a = [7] * ( m - 2 ) + [3]
b = [1] + [8] * ( m - 2 ) + [4]
c = [2] + [9] * ( m - 2 )
d = [1] + [2] * ( m - 2 ) + [3]
# This is taken directly from the Wikipedia page also cited above
# this overwrites b and d
def TDMASolve(a, b, c, d):
n = len(d) # n is the numbers of rows, a and c has length n-1
for i in xrange(n-1):
d[i+1] -= 1. * d[i] * a[i] / b[i]
b[i+1] -= 1. * c[i] * a[i] / b[i]
for i in reversed(xrange(n-1)):
d[i] -= d[i+1] * c[i] / b[i+1]
return [d[i] / b[i] for i in xrange(n)]
This code is not optimize nor does it use np, but if I (or any of the other fine folks here) have time, I will edit it so that it does those thing. It currently times at ~10 ms for m=10000.
This probably will help
There is a function creates_tridiagonal which will create tridiagonal matrix. There is another function which converts a matrix into diagonal ordered form as requested by SciPy solve_banded function.
import numpy as np
def lu_decomp3(a):
"""
c,d,e = lu_decomp3(a).
LU decomposition of tridiagonal matrix a = [c\d\e]. On output
{c},{d} and {e} are the diagonals of the decomposed matrix a.
"""
n = np.diagonal(a).size
assert(np.all(a.shape ==(n,n))) # check if square matrix
d = np.copy(np.diagonal(a)) # without copy (assignment destination is read-only) error is raised
e = np.copy(np.diagonal(a, 1))
c = np.copy(np.diagonal(a, -1))
for k in range(1,n):
lam = c[k-1]/d[k-1]
d[k] = d[k] - lam*e[k-1]
c[k-1] = lam
return c,d,e
def lu_solve3(c,d,e,b):
"""
x = lu_solve(c,d,e,b).
Solves [c\d\e]{x} = {b}, where {c}, {d} and {e} are the
vectors returned from lu_decomp3.
"""
n = len(d)
y = np.zeros_like(b)
y[0] = b[0]
for k in range(1,n):
y[k] = b[k] - c[k-1]*y[k-1]
x = np.zeros_like(b)
x[n-1] = y[n-1]/d[n-1] # there is no x[n] out of range
for k in range(n-2,-1,-1):
x[k] = (y[k] - e[k]*x[k+1])/d[k]
return x
from scipy.sparse import diags
def create_tridiagonal(size = 4):
diag = np.random.randn(size)*100
diag_pos1 = np.random.randn(size-1)*10
diag_neg1 = np.random.randn(size-1)*10
a = diags([diag_neg1, diag, diag_pos1], offsets=[-1, 0, 1],shape=(size,size)).todense()
return a
a = create_tridiagonal(4)
b = np.random.randn(4)*10
print('matrix a is\n = {} \n\n and vector b is \n {}'.format(a, b))
c, d, e = lu_decomp3(a)
x = lu_solve3(c, d, e, b)
print("x from our function is {}".format(x))
print("check is answer correct ({})".format(np.allclose(np.dot(a, x), b)))
## Test Scipy
from scipy.linalg import solve_banded
def diagonal_form(a, upper = 1, lower= 1):
"""
a is a numpy square matrix
this function converts a square matrix to diagonal ordered form
returned matrix in ab shape which can be used directly for scipy.linalg.solve_banded
"""
n = a.shape[1]
assert(np.all(a.shape ==(n,n)))
ab = np.zeros((2*n-1, n))
for i in range(n):
ab[i,(n-1)-i:] = np.diagonal(a,(n-1)-i)
for i in range(n-1):
ab[(2*n-2)-i,:i+1] = np.diagonal(a,i-(n-1))
mid_row_inx = int(ab.shape[0]/2)
upper_rows = [mid_row_inx - i for i in range(1, upper+1)]
upper_rows.reverse()
upper_rows.append(mid_row_inx)
lower_rows = [mid_row_inx + i for i in range(1, lower+1)]
keep_rows = upper_rows+lower_rows
ab = ab[keep_rows,:]
return ab
ab = diagonal_form(a, upper=1, lower=1) # for tridiagonal matrix upper and lower = 1
x_sp = solve_banded((1,1), ab, b)
print("is our answer the same as scipy answer ({})".format(np.allclose(x, x_sp)))
I have to apply some mathematical formula that I've written
in python as:
for s in range(tdim):
sum1 = 0.0
for i in range(dim):
for j in range(dim):
sum1+=0.5*np.cos(theta[s]*(i-j))*
eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*eig1[j]*eig2[i]-eig1[i]*eig2[j])
PHi2.append(sum1)
Now, this is correct, but clearly inefficient, the other way around is to do:
for i in range(dim):
for j in range(dim):
PHi2 = 0.5*np.cos(theta*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])
However, the second example gives me the same number in all elements of PHi2, so this
is faster but answer is wrong. How can you do this correctly and more efficiently?
NOTE: eig1 and eig2 are of the same dimension d, theta and PHi2 are the same dimension D,
BUT d!=D.
You can use a brute force broadcasting approach, but you are creating an intermediate array of shape (D, d, d), which can get out of hand if your arrays are even moderately large. Furthermore, in using broadcasting with no refinements you are recomputing a lot of calculations from the innermost loop that you only need to do once. If you first compute the necessary parameters for all possible values of i - j and add them together, you can reuse those values on the outer loop, e.g.:
def fast_ops(eig1, eig2, theta):
d = len(eig1)
d_arr = np.arange(d)
i_j = d_arr[:, None] - d_arr[None, :]
reidx = i_j + d - 1
mult1 = eig1[:, None] * eig1[ None, :] + eig2[:, None] + eig2[None, :]
mult2 = eig1[None, :] * eig2[:, None] - eig1[:, None] * eig2[None, :]
mult1_reidx = np.bincount(reidx.ravel(), weights=mult1.ravel())
mult2_reidx = np.bincount(reidx.ravel(), weights=mult2.ravel())
angles = theta[:, None] * np.arange(1 - d, d)
return 0.5 * (np.einsum('ij,j->i', np.cos(angles), mult1_reidx) -
np.einsum('ij,j->i', np.sin(angles), mult2_reidx))
IF we rewrite M4rtini's code as a function for comparison:
def fast_ops1(eig1, eig2, theta):
d = len(eig1)
D = len(theta)
s = np.array(range(D))[:, None, None]
i = np.array(range(d))[:, None]
j = np.array(range(d))
ret = 0.5 * (np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j]) -
np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j]))
return ret.sum(axis=(-1, -2))
And we make up some data:
d, D = 100, 200
eig1 = np.random.rand(d)
eig2 = np.random.rand(d)
theta = np.random.rand(D)
The speed improvement is very noticeable, 80x on top of the 115x over your original code, leading to a whooping 9000x speed-up:
In [22]: np.allclose(fast_ops1(eig1, eig2, theta), fast_ops(eig1, eig2, theta))
Out[22]: True
In [23]: %timeit fast_ops1(eig1, eig2, theta)
10 loops, best of 3: 145 ms per loop
In [24]: %timeit fast_ops(eig1, eig2, theta)
1000 loops, best of 3: 1.85 ms per loop
This works by broadcasting.
For tdim = 200 and dim = 100.
14 seconds with original.
120 ms with the version.
s = np.array(range(tdim))[:, None, None]
i = np.array(range(dim))[:, None]
j = np.array(range(dim))
PHi2 =(0.5*np.cos(theta[s]*(i-j))*(eig1[i]*eig1[j]+eig2[i]+eig2[j])-0.5*np.sin(theta[s]*(i-j))*(eig1[j]*eig2[i]-eig1[i]*eig2[j])).sum(axis=2).sum(axis=1)
In the first bit of code, you have 0.5*np.cos(theta[s]*(i-j))... but in the second it's 0.5*np.cos(theta*(i-j)).... Unless you've got theta defined differently for the second bit of code, this could well be the cause of the trouble.