Is there any way to optimise these for loops in numpy?

Is there any way to optimise these for loops in numpy? - python

I have some code here (used for gradient calculation) - Example values are commented:
dE_dx_strided = np.einsum('wxyd,ijkd->wxyijk', dE_dy, f)
# dE_dx_strided.shape = (64, 25, 25, 4, 4, 3)
imax, jmax, di, dj = dE_dx_strided.shape[1:5]
# imax, jmax, di, dj = (25, 25, 4, 4)
dE_dx = np.zeros_like(x)
# dE_dx.shape = (64, 28, 28, 3)
for i in range(imax):
for j in range(jmax):
dE_dx[:, i:i+di, j:j+dj, :] += dE_dx_strided[:, i, j, ...]
where dE_dx is the object of interest and dE_dx_strided is a 6-tensor which is being summed over 'piecewise', effectively, and it looks reminiscent of a convolution operation along axes 1 and 2:
# Verbose convolution operation (not my actual implementation)
for i in range(imax):
for j in range(jmax):
# Vaguely similar, but with filter multiplication, and = instead of +=
y[i, j] = x[i:i+di, j:dj] * f[di, dj]
My original idea was to make all elements of dE_dx_strided that are to be added to a single dE_dx[:, i:i+di, j:j+dj, :] lie along one axis, and then sum over it; but I couldn't get this to work.
Now I know that for loops aren't inherently slow, but is there a numpy-esque way to optimise this further, perhaps by reshaping, summing, strides, etc.?

Related

Matrix multiplication while subsetting elements from matrices and storing in a new matrix

I am attempting a numpy.matmul call using as variables
Matrix A of dimensions (p, t, q)
Matrix B of dimensions (r, t).
A categories vector of shape r and p categories, used to take slices of B and define the index of A do use.
The multiplications are done iteratively using the indices of each category. For each category p_i, I extract from A a submatrix (t, q). Then, I multiply those with a subset of B (x, t), where x is a mask defined by r == p_i. Finally, the matrix multiplication of (x, t) and (t, q) produces the output (x, q) which is stored at S[x].
I have noted that I cannot figure out a non-iterative version of this algorithm. The first snippet describes an iterative solution. The second one is an attempt at what I would wish to get, where everything is calculated as a single-step and would be presumably faster. However, it is incorrect because matrix A has three dimensions instead of two. Maybe there is no way to do this in NumPy with a single call, and in general, looking for advice/ideas to try out.
Thanks!
import numpy as np
p, q, r, t = 2, 9, 512, 4
# data initialization (random)
np.random.seed(500)
S = np.random.rand(r, q)
A = np.random.randint(0, 3, size=(p, t, q))
B = np.random.rand(r, t)
categories = np.random.randint(0, p, r)
print('iterative') # iterative
for i in range(p):
# print(i)
a = A[i, :, :]
mask = categories == i
b = B[mask]
print(b.shape, a.shape, S[mask].shape,
np.matmul(b, a).shape)
S[mask] = np.matmul(b, a)
print(S.shape)
a simple way to write it down
S = np.random.rand(r, q)
print(A[:p,:,:].shape)
result = np.matmul(B, A[:p,:,:])
# iterative assignment
i = 0
S[categories == i] = result[i, categories == i, :]
i = 1
S[categories == i] = result[i, categories == i, :]
The next snippet will produce an error during the multiplication step.
# attempt to multiply once, indexing all categories only once (not possible)
np.random.seed(500)
S = np.random.rand(r, q)
# attempt to use the categories vector
a = A[categories, :, :]
b = B[categories]
# due to the shapes of the arrays, this multiplication is not possible
print('\nsingle step (error due to shapes of the matrix a')
print(b.shape, a.shape, S[categories].shape)
S[categories] = np.matmul(b, a)
print(scores.shape)
iterative
(250, 4) (4, 9) (250, 9) (250, 9)
(262, 4) (4, 9) (262, 9) (262, 9)
(512, 9)
single step (error due to shapes of the 2nd matrix a).
(512, 4) (512, 4, 9) (512, 9)

In [63]: (np.ones((512,4))#np.ones((512,4,9))).shape
Out[63]: (512, 512, 9)
This because the first array is broadcasted to (1,512,4). I think you want instead to do:
In [64]: (np.ones((512,1,4))#np.ones((512,4,9))).shape
Out[64]: (512, 1, 9)
Then remove the middle dimension to get a (512,9).
Another way:
In [72]: np.einsum('ij,ijk->ik', np.ones((512,4)), np.ones((512,4,9))).shape
Out[72]: (512, 9)

To remove the loop altogether, you can try this
bigmask = np.arange(p)[:, np.newaxis] == categories
C = np.matmul(B, A)
res = C[np.broadcast_to(bigmask[..., np.newaxis], C.shape)].reshape(r, q)
# `res` has the same rows as the iterative `S` but in the wrong order
# so we need to reorder the rows
sort_index = np.argsort(np.broadcast_to(np.arange(r), bigmask.shape)[bigmask])
assert np.allclose(S, res[sort_index])
Though I'm not sure it's much faster than the iterative version.

NumPy template matching SQDIFF with `sliding window_view`

The SQDIFF is defined as openCV definition. (I believe they omit channels)
Which in junior numpy Python should be
A = np.arange(27, dtype=np.float32)
A = A.reshape(3,3,3) # The "image"
B = np.ones([2, 2, 3], dtype=np.float32) # window
rw, rh = A.shape[0] - B.shape[0] + 1, A.shape[1] - B.shape[1] + 1 # End result size
result = np.zeros([rw, rh])
for i in range(rw):
for j in range(rh):
w = A[i:i + B.shape[0], j:j + B.shape[1]]
res = B - w
result[i, j] = np.sum(
res ** 2
)
cv_result = cv.matchTemplate(A, B, cv.TM_SQDIFF) # this result is the same as the simple for loops
assert np.allclose(cv_result, result)
This is comparatively slow solution. I have read about sliding_window_view but cannot get it correct.
# This will fail with these large arrays but is ok for smaller ones
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
sqdiff = np.sum((B - locations) ** 2, axis=(-1,-2, -3, -4)) # This will fail with normal sized images
will fail with MemoryError even if the result easily fits to memory. How can I produce similar results to the cv2.matchTemplate function with this faster way?

As a last resort, you may perform the computation in tiles, instead of computing "all at once".
np.lib.stride_tricks.sliding_window_view returns a view of the data, so it doesn't consume a lot of RAM.
The expression B - locations can't use a view, and requires the RAM for storing an array with shape (781, 984, 1, 248, 249, 3) of float elements.
The total RAM for storing B - locations is 781*984*1*248*249*3*4 = 569,479,908,096 bytes.
For avoiding the need for storing B - locations at the RAM at once, we may compute sqdiff in tiles, when "tile" computation requires less RAM.
A simple tiles division is using every row as a tile - loop over the rows of sqdiff, and compute the output row by row.
Example:
sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32) # Allocate an array for storing the result.
# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))
Executable code sample:
import numpy as np
import cv2
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
cv_result = cv2.matchTemplate(A, B, cv2.TM_SQDIFF) # this result is the same as the simple for loops
#sqdiff = np.sum((B - locations) ** 2, axis=(-1, -2, -3, -4)) # This will fail with normal sized images
sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32) # Allocate an array for storing the result.
# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))
assert np.allclose(cv_result, sqdiff)
I know the solution is a bit disappointing... But it is the only generic solution I could find.

is equivalent to
where the 'star' operation is a cross-correlation, the 1_[m, n] is a window the size of the template, and 1_[k, l] is a window with the size of the image.
You can compute the cross-correlation terms using 'scipy.signal.correlate' and find the matches by looking for local minima in the square difference map.
You might want to do some non-minimum suppression too.
This solution will require orders of magnitude less memory to store.
For more help, please post a reproducible example with an image and template that are valid for the algorithm. Using noise will result in meaningless outputs.

How to sum application of function to each 3d matrix inside an array?

I have a np.ndarray of 4 dimensions (x, y, z, p) and want to add the results of applying a function over each matrix (y, z, p) inside the x dimension.
What I want to do is something like:
a = np.random.random((4, 12, 10, 100))
collect += np.greater(a, 10)
Thus, collect should have the sum of np.greater(a[0], 10) + np.greater(a[1], 10) + np.greater(a[2], 10) + np.greater(a[3], 10) and shape (12, 10, 100).
Is there a way to do such thing with numpy without an explicit loop traversing all elements inside x dimension?

The simple solution for adding all the numbers along an axis is of course to add the numbers along that axis:
a = np.random.randint(20, size=(4, 12, 10, 100))
np.sum(a > 10, axis=0)
or more concisely:
(a > 10).sum(0)
There are other ways of doing the same thing. Absolutely massive overkill is the suggestion to use np.einsum on a single array. In this case, you do have to explicitly convert the input to an integer, since einsum does not promote booleans to integers, unlike sum:
np.einsum('ijkl->jkl', (a > 10).astype(int))
The condition np.greater(a, 10) is more intuitive as a > 10, and will always be false for np.random.random, since that generates in the range [0.0, 1.0).

Sum all diagonals in feature maps in parallel in PyTorch

Let's say I have a tensor shaped (1, 64, 128, 128) and I want to create a tensor of shape (1, 64, 255) holding the sums of all diagonals for every (128, 128) matrix (there are 1 main, 127 below, 127 above diagonals so in total 255). What I am currently doing is the following:
x = torch.rand(1, 64, 128, 128)
diag_sums = torch.zeros(1, 64, 255)
j = 0
for k in range(-127, 128):
diag_sums[j, :, k + 127] = torch.diagonal(x, offset=k, dim1=-2, dim2=-1).sum(dim=2)
This is obviously very slow, since it is using Python loops and is not done in parallel with respect to k.
I don't think this can be done using torch.diagonal since the function explicitly uses a single int for the offset parameter. If I could pass a list there, this would work, but I guess it would be complicated to implement (requiring changes in PyTorch itself).
I think it could be possible to implement this using torch.einsum, but I cannot think of a way to do it.
So this is my question: how do I get the tensor described above?

Have you considered using torch.nn.functional.conv2d?
You can sum the diagonals with a diagonal filter sliding across the tensor with appropriate zero padding.
import torch
import torch.nn.functional as nnf
# construct a diagonal filter using `eye` function, shape it appropriately
f = torch.eye(x.shape[2])[None, None,...].repeat(x.shape[1], 1, 1, 1)
# compute the diagonal sum with appropriate zero padding
conv_diag_sums = nnf.conv2d(x, f, padding=(x.shape[2]-1,0), groups=x.shape[1])[..., 0]
Note the the result has a slightly different order than the one you computed in the loop:
diag_sums = torch.zeros(1, 64, 255)
for k in range(-127, 128):
diag_sums[j, :, 127-k] = torch.diagonal(x, offset=k, dim1=-2, dim2=-1).sum(dim=2)
# compare
(conv_diag_sums == diag_sums).all()
results with True - they are the same.

Shai's answer works, however it looks like it has a lot of multiplications, due to the large size of the kernel. I figured out a way to do this for my use case. It is based on this answer for a similar question in Numpy: https://stackoverflow.com/a/35074207/6636290
I am doing the following:
digitized = np.sum(np.indices(a.shape), axis=0).ravel()
digitized_tensor = torch.Tensor(digitized).int()
a_tensor = torch.Tensor(a)
torch.bincount(digitized_tensor, a_tensor.view(-1))
If I could figure out a way to do this entirely in PyTorch (without Numpy's indices function), this would be great, but this answers the question.

The previous answers work, but there is another faster solution using strides (and that only uses Pytorch).
First I'll explain with a matrix as it is easier to understand.
Given you have a matrix M with size (n, n), you can change the matrix strides so that the resulting matrix has M's diagonals as columns. Then you can just sum the column to get your result.
import torch
def sum_all_diagonal_matrix(mat: torch.tensor):
n,_ = mat.shape
zero_mat = torch.zeros((n, n)) # Zero matrix used for padding
mat_padded = torch.cat((zero_mat, mat, zero_mat), 1) # pads the matrix on left and right
print(mad_padded)
mat_strided = mat_padded.as_strided((n, 2*n), (3*n + 1, 1)) # Change the strides
print(mat_strided)
sum_diags = torch.sum(mat_strided, 0) # Sums the resulting matrix's columns
return sum_diags[1:]
X = torch.arange(9).reshape(3,3)
print(X)
# tensor([[0, 1, 2],
# [3, 4, 5],
# [6, 7, 8]])
print(sum_all_diagonal_matrix(X))
# tensor([ 6., 10., 12., 6., 2.])
You can do exactly the same with one more dimension:
def sum_all_diagonal(mat: torch.tensor):
k,n,_ = mat.shape
zero_mat = torch.zeros((k, n, n))
mat_padded = torch.cat((zero_mat, mat, zero_mat), 2)
mat_strided = mat_padded.as_strided((k, n, 2*n), (3*n*n, 3*n + 1, 1))
sum_diags = torch.sum(mat_strided, 1)
return sum_diags[:, n:]

Create parameter for multidimensional evaluation in Python/Numpy

I have a functional evaluation which has many parameters, and I want to vectorize the evaluation. Something like this:
I = 100
J = 34
K = 6
i, j, k = array(range(I)), array(range(J)), array(range(K))
i, j, k = meshgrid(i, j, k)
f = myfun(i, j, k)
This is excellent, however, now I also have a parameter that I want to send to myfun that I generate with some other function and that is invariant over some of the indices above, thus:
p = my_param_gen()
and let's say
p.shape()
will output
(100, 6)
This would correspond to p being invariant over the index J. Now, I would like to expand the shape of p to be
(100, 34, 6)
in a meshgrid-kind of fashion so that the new dimension is filled constant with the old dimensions. How do I do this the best? The approach should work also with adding many new dimensions. I have seen numpy.expand_dims, but it does not do this.

Your
In [116]: i.shape
Out[116]: (34, 100, 6)
If p.shape is (100,6), then p will broadcast with i,j,k without further change. That is p[None,:,:] expansion is automatic.
If you'd used i, j, k = np.meshgrid(i, j, k, indexing='ij'),
In [121]: i.shape
Out[121]: (100, 34, 6)
And p[:,None,:] would be needed for broadcasting (equilvalently np.expand_dims(p,1))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there any way to optimise these for loops in numpy? - python

Related

Matrix multiplication while subsetting elements from matrices and storing in a new matrix

NumPy template matching SQDIFF with `sliding window_view`

How to sum application of function to each 3d matrix inside an array?

Sum all diagonals in feature maps in parallel in PyTorch

Create parameter for multidimensional evaluation in Python/Numpy

Categories

Resources