Combining element-wise and matrix multiplication with multi-dimensional arrays in NumPy

Combining element-wise and matrix multiplication with multi-dimensional arrays in NumPy - python

I have two multidimensional NumPy arrays, A and B, with A.shape = (K, d, N) and B.shape = (K, N, d). I would like to perform an element-wise operation over axis 0 (K), with that operation being matrix multiplication over axes 1 and 2 (d, N and N, d). So the result should be a multidimensional array C with C.shape = (K, d, d), so that C[k] = np.dot(A[k], B[k]). A naive implementation would look like this:
C = np.vstack([np.dot(A[k], B[k])[np.newaxis, :, :] for k in xrange(K)])
but this implementation is slow. A slightly faster approach looks like this:
C = np.dot(A, B)[:, :, 0, :]
which uses the default behaviour of np.dot on multidimensional arrays, giving me an array with shape (K, d, K, d). However, this approach computes the required answer K times (each of the entries along axis 2 are the same). Asymptotically it will be slower than the first approach, but the overhead is much less. I am also aware of the following approach:
from numpy.core.umath_tests import matrix_multiply
C = matrix_multiply(A, B)
but I am not guaranteed that this function will be available. My question is thus, does NumPy provide a standard way of doing this efficiently? An answer which applies to multidimensional arrays in general would be perfect, but an answer specific to only this case would be great too.
Edit: As pointed out by #Juh_, the second approach is incorrect. The correct version is:
C = np.dot(A, B).diagonal(axis1=0, axis2=2).transpose(2, 0, 1)
but the overhead added makes it slower than the first approach, even for small matrices. The last approach is winning by a long shot on all my timing tests, for small and large matrices. I'm now strongly considering using this if no better solution crops up, even if that would mean copying the numpy.core.umath_tests library (written in C) into my project.

A possible solution to your problem is:
C = np.sum(A[:,:,:,np.newaxis]*B[:,np.newaxis,:,:],axis=2)
However:
it is quicker than the vstack approach only if K is much bigger than d and N
their might be some memory issue: in the above solution an KxdxNxd array is allocated (i.e. all possible product paires, before summing). Actually I could not test with big K,d and N as I was going out of memory.
btw, note that:
C = np.dot(A, B)[:, :, 0, :]
does not give the correct result. It got me tricked because I first checked my method by comparing the results to those given by this np.dot command.

I have this same issue in my project. The best I've been able to come up with is, I think it's a little faster (maybe 10%) than using vstack:
K, d, N = A.shape
C = np.empty((K, d, d))
for k in xrange(K):
C[k] = np.dot(A[k], B[k])
I'd love to see a better solution, I can't quite see how one would use tensordot to do this.

A very flexible, compact, and fast solution:
C = np.einsum('Kab,Kbc->Kac', A, B, optimize=True)
Confirmation:
import numpy as np
K = 10
d = 5
N = 3
A = np.random.rand(K,d,N)
B = np.random.rand(K,N,d)
C_old = np.dot(A, B).diagonal(axis1=0, axis2=2).transpose(2, 0, 1)
C_new = np.einsum('Kab,Kbc->Kac', A, B)
print(np.max(C_old-C_new)) # should be 0 or a very small number
For large multi-dimensional arrays, the optional parameter optimize=True can save you a lot of time.
You can learn about einsum here:
https://ajcr.net/Basic-guide-to-einsum/
https://rockt.github.io/2018/04/30/einsum
https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
Quote:
The Einstein summation convention can be used to compute many multi-dimensional, linear algebraic array operations. einsum provides a succinct way of representing these. A non-exhaustive list of these operations is:
Trace of an array, numpy.trace.
Return a diagonal, numpy.diag.
Array axis summations, numpy.sum.
Transpositions and permutations, numpy.transpose.
Matrix multiplication and dot product, numpy.matmul numpy.dot.
Vector inner and outer products, numpy.inner numpy.outer.
Broadcasting, element-wise and scalar multiplication, numpy.multiply.
Tensor contractions, numpy.tensordot.
Chained array operations, in efficient calculation order, numpy.einsum_path.

You can do
np.matmul(A, B)
Look at https://numpy.org/doc/stable/reference/generated/numpy.matmul.html.
Should be faster than einsum for big enough K.

Related

How to make two arrays contiguous so that Numba can speed up np.dot()

I have the following code:
import numpy as np
from numba import jit
Nx = 15
Ny = 1000
v = np.ones((Nx,Ny))
v = np.reshape(v,(Nx*Ny))
A = np.random.rand(Nx*Ny,Nx*Ny,5)
B = np.random.rand(Nx*Ny,Nx*Ny,5)
C = np.random.rand(Nx*Ny,5)
#jit(nopython=True)
def dotplus(B, v, C):
return np.dot(B, v) + C
k = 2
D = dotplus(B[:,:,k], v, C[:,k])
I get the following warning, which I guess refers to arrays B[:,:,k] and v:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 1d, C))
return np.dot(B, v0) + C
Is there a way to make the two arrays contiguous, so that Numba can speed up the code?
PS in case you're wondering about the meaning of k, note this is just a MRE. In the actual code, dotplus is called multiple times inside a for loop for different values of k (so, different slices of B and C). The for loop updates the values of v, but B and C don't change.

Flawr is correct. B[..., k] returns a np.view() into B, but does not actually copy any data. In memory, two neighbouring elements of the view have a distance of B.strides[1], which evaluates to B.shape[-1]*B.itemsize and is greater than B.itemsize. Consequentially, your array is not contiguous.
The best optimization is to vectorize the dotplus loop and write
D = np.tensordot(B, v, axes=(1, 0)) + C
The second best optimization is to refactor and let the batch dimension be the first dimension of the array. This can be done on top of the above vectorization and is generally advisable. It would look something like
A = np.random.rand(5, Nx*Ny,Nx*Ny)
# rather than
A = np.random.rand(Nx*Ny,Nx*Ny,5)
If you can't refactor the code, you need to start profiling. You can easily swap axes temporarily via
B = np.moveaxis(B, -1, 0)
some_op(B[k, ...], ...)
B = np.moveaxis(B, 0, -1)
Contrary to max9111's comment, this will not net you anything compared to np.ascontiguousarray() because the data has to be copied in both cases. That said, a copy is O(Nx*Ny*k) + buffer allocation. Direct matrix-vector multiplication is O(Nx*Ny), but you have to gather the elements first, which is really expensive. It comes down to your specific architecture and concrete problem, so profiling is the way to go.

Numpy array addition made as simple as np.einsum()?

If I have a.shape = (3,4,5) and b.shape = (3,5), using np.einsum() makes broadcasting then multiplying the two arrays super easy and explicit:
result = np.einsum('abc, ac -> abc', a, b)
But if I want to add the two arrays, so far as I can tell, I need two separate steps so that the broadcasting happens properly, and the code feels less explicit.
b = np.expand_dims(b, 1)
result = a + b
Is there way out there that allows me to do this array addition with the clarity of np.einsum()?

Broadcasting can occur only on one extra dimension. For adding these two arrays one could expand them in a one-liner as follows:
import numpy as np
a = np.random.rand(3,4,5); b = np.random.rand(3,5);
c = a + b[:, None, :] # c is shape of a, broadcasting occurs along 2nd dimension
Note this is not any different than c = a + np.expand_dim(b, 1). In terms of clarity it is a personal style thing. I prefer broadcasting, others prefer einsum.

Numpy: function that creates block matrices

Say I have a dimension k. What I'm looking for is a function that takes k as an input and returns the following block matrix.
Let I be a k-dimensional identity matrix and 0 be k-dimensional square matrix of zeros
That is:
def function(k):
...
return matrix
function(2) -> np.array([I, 0])
function(3) -> np.array([[I,0,0]
[0,I,0]])
function(4) -> np.array([[I,0,0,0]
[0,I,0,0],
[0,0,I,0]])
function(5) -> np.array([[I,0,0,0,0]
[0,I,0,0,0],
[0,0,I,0,0],
[0,0,0,I,0]])
That is, the output is a (k-1,k) matrix where identity matrices are on the diagonal elements and zero matrices elsewhere.
What I've tried:
I know how to create any individual row, I just can't think of a way to put it into a function so that it takes a dimension, k, and spits out the matrix I need.
e.g.
np.block([[np.eye(3),np.zeros((3, 3)),np.zeros((3, 3))],
[np.zeros((3, 3)),np.eye(3),np.zeros((3, 3))]])
Would be the desired output for k=3
scipy.linalg.block_diag seems like it might be on the right track...

IMO, np.eye already has everything you need, as you can define number of rows and columns separately.
So your function should simply look like
def fct(k):
return np.eye(k**2-k, k**2)

If I understand you correctly, this should work:
a = np.concatenate((np.eye((k-1)*k),np.zeros([(k-1)*k,k])), axis=1)
(at least, when I set k=3 and compare with the np.block(...) expression you gave, both results are identical)

IIUC, you can also try np.fill_diagonal such that you create the right shape of matrices and then fill in the diagonal parts.
def make_block(k):
arr = np.zeros(((k-1)*k, k*k))
np.fill_diagonal(arr, 1)
return arr

There are two interpretations to your question. One is where you are basically creating a matrix of the form [[1, 0, 0], [0, 1, 0]], which can be mathematically represented as [I 0], and another where each element contains its own numpy array entirely (which does reduce computational ability but might be what you want).
The former:
np.append(np.eye(k-1), np,zeros((k-1, 1)), axis=1)
The latter (a bit more complicated):
I = np.eye(m) #Whatever dimensions you want, although probably m==n
Z = np.eye(n)
arr = np.zeros((k-1, k)
for i in range(k-1):
for j in range(k):
if i == j:
arr[i,j] = np.array(I)
else:
arr[i,j] = np.array(Z)
I really have no idea how the second one would be useful, so I think you might be a bit confused on the fundamental structure of a block matrix if that's what you think you want. Generally [A b], for example, with A being a matrix and b being a vector, is generally thought of as now representing a single matrix, with block notation just existing for simplicity's sake. Hope this helps!

Speed up double for loop in numpy

I currently have the following double loop in my Python code:
for i in range(a):
for j in range(b):
A[:,i]*=B[j][:,C[i,j]]
(A is a float matrix. B is a list of float matrices. C is a matrix of integers. By matrices I mean m x n np.arrays.
To be precise, the sizes are: A: mxa B: b matrices of size mxl (with l different for each matrix) C: axb. Here m is very large, a is very large, b is small, the l's are even smaller than b
)
I tried to speed it up by doing
for j in range(b):
A[:,:]*=B[j][:,C[:,j]]
but surprisingly to me this performed worse.
More precisely, this did improve performance for small values of m and a (the "large" numbers), but from m=7000,a=700 onwards the first appraoch is roughly twice as fast.
Is there anything else I can do?
Maybe I could parallelize? But I don't really know how.
(I am not committed to either Python 2 or 3)

Here's a vectorized approach assuming B as a list of arrays that are of the same shape -
# Convert B to a 3D array
B_arr = np.asarray(B)
# Use advanced indexing to index into the last axis of B array with C
# and then do product-reduction along the second axis.
# Finally, we perform elementwise multiplication with A
A *= B_arr[np.arange(B_arr.shape[0]),:,C].prod(1).T
For cases with smaller a, we could run a loop that iterates through the length of a instead. Also, for more performance, it might be a better idea to store those elements into a separate 2D array instead and perform the elementwise multiplication only once after we get out of the loop.
Thus, we would have an alternative implementation like so -
range_arr = np.arange(B_arr.shape[0])
out = np.empty_like(A)
for i in range(a):
out[:,i] = B_arr[range_arr,:,C[i,:]].prod(0)
A *= out

NumPy tensordot MemoryError

I have two matrices -- A is 3033x3033, and X is 3033x20. I am running the following lines (as suggested in the answer to another question I asked):
n, d = X.shape
c = X.reshape(n, -1, d) - X.reshape(-1, n, d)
return np.tensordot(A.reshape(n, n, -1) * c, c, axes=[(0,1),(0,1)])
On the final line, Python simply stops and says "MemoryError". How can I get around this, either by changing some setting in Python or performing this operation in a more memory-efficient way?

Here is a function that does the calculation without any for loops and without any large temporary array. See the related question for a longer answer, complete with a test script.
def fbest(A, X):
""
KA_best = np.tensordot(A.sum(1)[:,None] * X, X, axes=[(0,), (0,)])
KA_best += np.tensordot(A.sum(0)[:,None] * X, X, axes=[(0,), (0,)])
KA_best -= np.tensordot(np.dot(A, X), X, axes=[(0,), (0,)])
KA_best -= np.tensordot(X, np.dot(A, X), axes=[(0,), (0,)])
return KA_best
I profiled the code with your size arrays:
I love sp.einsum by the way. It is a great place to start when speeding up array operations by removing for loops. You can do SOOOO much with one call to sp.einsum.
The advantage of np.tensordot is that it links to whatever fast numerical library you have installed (i.e. MKL). So, tensordot will run faster and in parallel when you have the right libraries installed.

If you replace the final line with
return np.einsum('ij,ijk,ijl->kl',A,c,c)
you avoid creating the A.reshape(n, n, -1) * c (3301 by 3301 by 20) intermediate that I think is your main problem.
My impression is that the version I give is probably slower (for cases where it doesn't run out of memory), but I haven't rigourously timed it.
It's possible you could go further and avoid creating c, but I can't immediately see how to do it. It'd be a case of following writing the whole thing in terms of sums of matrix indicies and seeing what it simplified to.

You can employ a two-nested loop format iterating along the last dimension of X. Now, that last dimension is 20, so hopefully it would still be efficient enough and more importantly leave minimum memory footprint. Here's the implementation -
n, d = X.shape
c = X.reshape(n, -1, d) - X.reshape(-1, n, d)
out = np.empty((d,d)) # d is a small number: 20
for i in range(d):
for j in range(d):
out[i,j] = (A*c[:,:,i]*(c[:,:,j])).sum()
return out
You can replace the last line with np.einsum -
out[i,j] = np.einsum('ij->',A*c[:,:,i]*c[:,:,j])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining element-wise and matrix multiplication with multi-dimensional arrays in NumPy - python

You can do np.matmul(A, B) Look at https://numpy.org/doc/stable/reference/generated/numpy.matmul.html. Should be faster than einsum for big enough K.

Related

How to make two arrays contiguous so that Numba can speed up np.dot()

Numpy array addition made as simple as np.einsum()?

Numpy: function that creates block matrices

Speed up double for loop in numpy

NumPy tensordot MemoryError

Categories

Resources