NumPy template matching SQDIFF with `sliding window_view`

NumPy template matching SQDIFF with `sliding window_view` - python

The SQDIFF is defined as openCV definition. (I believe they omit channels)
Which in junior numpy Python should be
A = np.arange(27, dtype=np.float32)
A = A.reshape(3,3,3) # The "image"
B = np.ones([2, 2, 3], dtype=np.float32) # window
rw, rh = A.shape[0] - B.shape[0] + 1, A.shape[1] - B.shape[1] + 1 # End result size
result = np.zeros([rw, rh])
for i in range(rw):
for j in range(rh):
w = A[i:i + B.shape[0], j:j + B.shape[1]]
res = B - w
result[i, j] = np.sum(
res ** 2
)
cv_result = cv.matchTemplate(A, B, cv.TM_SQDIFF) # this result is the same as the simple for loops
assert np.allclose(cv_result, result)
This is comparatively slow solution. I have read about sliding_window_view but cannot get it correct.
# This will fail with these large arrays but is ok for smaller ones
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
sqdiff = np.sum((B - locations) ** 2, axis=(-1,-2, -3, -4)) # This will fail with normal sized images
will fail with MemoryError even if the result easily fits to memory. How can I produce similar results to the cv2.matchTemplate function with this faster way?

As a last resort, you may perform the computation in tiles, instead of computing "all at once".
np.lib.stride_tricks.sliding_window_view returns a view of the data, so it doesn't consume a lot of RAM.
The expression B - locations can't use a view, and requires the RAM for storing an array with shape (781, 984, 1, 248, 249, 3) of float elements.
The total RAM for storing B - locations is 781*984*1*248*249*3*4 = 569,479,908,096 bytes.
For avoiding the need for storing B - locations at the RAM at once, we may compute sqdiff in tiles, when "tile" computation requires less RAM.
A simple tiles division is using every row as a tile - loop over the rows of sqdiff, and compute the output row by row.
Example:
sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32) # Allocate an array for storing the result.
# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))
Executable code sample:
import numpy as np
import cv2
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
cv_result = cv2.matchTemplate(A, B, cv2.TM_SQDIFF) # this result is the same as the simple for loops
#sqdiff = np.sum((B - locations) ** 2, axis=(-1, -2, -3, -4)) # This will fail with normal sized images
sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32) # Allocate an array for storing the result.
# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))
assert np.allclose(cv_result, sqdiff)
I know the solution is a bit disappointing... But it is the only generic solution I could find.

is equivalent to
where the 'star' operation is a cross-correlation, the 1_[m, n] is a window the size of the template, and 1_[k, l] is a window with the size of the image.
You can compute the cross-correlation terms using 'scipy.signal.correlate' and find the matches by looking for local minima in the square difference map.
You might want to do some non-minimum suppression too.
This solution will require orders of magnitude less memory to store.
For more help, please post a reproducible example with an image and template that are valid for the algorithm. Using noise will result in meaningless outputs.

Related

Convert / "inflate" unaligned pixel data (bgr4) to a byte-aligned numpy array

I have an image in an esoteric format (BGR4) that I would like to load into numpy. In BGR4 individual pixels are byte aligned (thank god) and are comprised of 3 components (B, G, and R) encoded in a single byte. They are ordered like this: b0000BGGR.
Here is an example image with size (1, 2), aka. 2 pixels:
img_bytes = b"\x0F\x09" # this is how it looks in memory
img = np.array([[1, 3, 1], [1, 0, 1]], dtype=np.uint8) # this is my desired result
Since there are a lot of pixels in each image, what is the most performant way to inflate such an array?
I have the same question for BGR8 (ordered: bBBBGGGRR), but I assume the approach is similar, and I will cross that bridge when I get there :)

Here is a numpy implementation that follows the suggestion #MichaelButscher made in the comments:
img_bytes = b"\x0f\x09" # this is how it looks in memory
# b0000BGGR
b = 0b00001000
g = 0b00000110
r = 0b00000001
template = np.array([b, g, r], dtype=np.uint8)[:,None]
shifts = np.array([3, 1, 0], dtype=np.uint8)[:,None]
arr = np.frombuffer(img_bytes, dtype=np.uint8)
res = (arr & template) >> shifts
print(res.T)
[[1 3 1]
[1 0 1]]
You may want to tune transpose order for better performance.

Numba-compatible implementation of np.tile?

I'm working on some code for dehazing images, based on this paper, and I started with an abandoned Py2.7 implementation. Since then, particularly with Numba, I've made some real performance improvements (important since I'll have to run this on 8K images).
I'm pretty convinced my last significant performance bottleneck is in performing the box filter step (I've already shaved off almost a minute per image, but this last slow step is ~30s/image), and I'm close to getting it to run as nopython in Numba:
#njit # Row dependencies means can't be parallel
def yCumSum(a):
"""
Numba based computation of y-direction
cumulative sum. Can't be parallel!
"""
out = np.empty_like(a)
out[0, :] = a[0, :]
for i in prange(1, a.shape[0]):
out[i, :] = a[i, :] + out[i - 1, :]
return out
#njit(parallel= True)
def xCumSum(a):
"""
Numba-based parallel computation
of X-direction cumulative sum
"""
out = np.empty_like(a)
for i in prange(a.shape[0]):
out[i, :] = np.cumsum(a[i, :])
return out
#jit
def _boxFilter(m, r, gpu= hasGPU):
if gpu:
m = cp.asnumpy(m)
out = __boxfilter__(m, r)
if gpu:
return cp.asarray(out)
return out
#jit(fastmath= True)
def __boxfilter__(m, r):
"""
Fast box filtering implementation, O(1) time.
Parameters
----------
m: a 2-D matrix data normalized to [0.0, 1.0]
r: radius of the window considered
Return
-----------
The filtered matrix m'.
"""
#H: height, W: width
H, W = m.shape
#the output matrix m'
mp = np.empty(m.shape)
#cumulative sum over y axis
ySum = yCumSum(m) #np.cumsum(m, axis=0)
#copy the accumulated values of the windows in y
mp[0:r+1,: ] = ySum[r:(2*r)+1,: ]
#differences in y axis
mp[r+1:H-r,: ] = ySum[(2*r)+1:,: ] - ySum[ :H-(2*r)-1,: ]
mp[(-r):,: ] = np.tile(ySum[-1,: ], (r, 1)) - ySum[H-(2*r)-1:H-r-1,: ]
#cumulative sum over x axis
xSum = xCumSum(mp) #np.cumsum(mp, axis=1)
#copy the accumulated values of the windows in x
mp[:, 0:r+1] = xSum[:, r:(2*r)+1]
#difference over x axis
mp[:, r+1:W-r] = xSum[:, (2*r)+1: ] - xSum[:, :W-(2*r)-1]
mp[:, -r: ] = np.tile(xSum[:, -1][:, None], (1, r)) - xSum[:, W-(2*r)-1:W-r-1]
return mp
There's plenty to do around the edges, but if I can get the tile operation as a nopython call, I can nopython the whole boxfilter step and get a big performance boost. I'm not super inclined to do something really really specific as I'd love to reuse this code elsewhere, but I wouldn't particularly object to it being limited to a 2D scope. For whatever reason I'm just staring at this and not really sure where to start.

np.tile is a bit too complicated to reimplement in full, but unless I'm misreading it looks like you only need to take a vector and then repeat it along a different axis r times.
A Numba-compatible way to do this is to write
y = x.repeat(r).reshape((-1, r))
Then x will be repeated r times along the second dimension, so that y[i, j] == x[i].
Example:
In [2]: x = np.arange(5)
In [3]: x.repeat(3).reshape((-1, 3))
Out[3]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
If you want x to be repeated along the first dimension instead, just take the transpose y.T.

How to choose elements out of a matrix randomly weighted

I am pretty new to python and have some problems with Randomness.
I am looking for something similar then RandomChoice in Mathematica.
I create a Matrix of dimension let's say 10x3 with random numbers greater 0. Let us call the total sum of every row s_i for i=0,...,9
Later I want to choose for every row 2 out of 3 elements (no repetition) with weighted probability s_ij/s_i
So I need something like this but with weigthed propabilities
n=10
aa=np.random.uniform(1000, 2500, (n,3))
print(aa)
help=[0,1,2]
dd=np.zeros((n,2))
for i in range(n):
cc=random.sample(help,2)
dd[i,0]=aa[i,cc[0]]
dd[i,1]=aa[i,cc[1]]
print(dd)
Here, additionally speed is an important factor since I will use it in an Montecarlo approach (that's the reason I switched from Mathematica to Python) and I guess, the above code can be improved heavily
Thanks in advance for any tipps/help
EDIT: I now have the following, which is working but does not look like good gode to me
#pre-defined lists
nn=3
aa=np.random.uniform(1000, 2500, (nn,3))
help1=[0,1,2]
help2=aa.sum(axis=1)
#now I create a weigthed prob list and fill it
help3=np.zeros((nn,3))
for i in range(nn):
help3[i,0]=aa[i,0]/help2[i]
help3[i,1]=aa[i,1]/help2[i]
help3[i,2]=aa[i,2]/help2[i]
#every timestep when I have to choose 2 out of 3
help5=np.zeros((nn,2))
for i in range(nn):
#cc=random.sample(help1,2)
help4=np.random.choice(help1, 2, replace=False, p=[help3[i,0], help3[i,1], help3[i,2]])
help5[i,0]=aa[i,cc[0]]
help5[i,1]=aa[i,cc[1]]
print(help5)

As pointed out in the comments, np.random.choice accepts a weights parameter, so you can simply use that in a loop:
import numpy as np
# Make input data
np.random.seed(0)
n = 10
aa = np.random.uniform(1000, 2500, (n, 3))
s = np.random.rand(n, 3)
# Normalize weights
s_norm = s / s.sum(1, keepdims=True)
# Output array
out = np.empty((n, 2), dtype=aa.dtype)
# Sample iteratively
for i in range(n):
out[i] = aa[i, np.random.choice(3, size=2, replace=False, p=s_norm[i])]
This is not the most efficient way to do things, though, as usually using vectorized operations is much faster than looping. Unfortunately, I don't think there is any way to sample from multiple categorical distributions at the same time (see NumPy issue #15201). However, since you always want to get two elements out of three, you could sample the element that you want to remove (with inverted probabilities) and then keep the other two. This snippet does something like that:
import numpy as np
# Make input data
np.random.seed(0)
n = 10
aa = np.random.uniform(1000, 2500, (n, 3))
s = np.random.rand(n, 3)
print(s)
# [[0.26455561 0.77423369 0.45615033]
# [0.56843395 0.0187898 0.6176355 ]
# [0.61209572 0.616934 0.94374808]
# [0.6818203 0.3595079 0.43703195]
# [0.6976312 0.06022547 0.66676672]
# [0.67063787 0.21038256 0.1289263 ]
# [0.31542835 0.36371077 0.57019677]
# [0.43860151 0.98837384 0.10204481]
# [0.20887676 0.16130952 0.65310833]
# [0.2532916 0.46631077 0.24442559]]
# Invert weights
si = 1 / s
# Normalize
si_norm = si / si.sum(1, keepdims=True)
# Accumulate
si_cum = np.cumsum(si_norm, axis=1)
# Sample according to inverted probabilities
t = np.random.rand(n, 1)
idx = np.argmax(t < si_cum, axis=1)
# Get non-sampled indices
r = np.arange(3)
m = r != idx[:, np.newaxis]
choice = np.broadcast_to(r, m.shape)[m].reshape(n, -1)
print(choice)
# [[1 2]
# [0 2]
# [0 2]
# [1 2]
# [0 2]
# [0 2]
# [0 1]
# [1 2]
# [0 2]
# [1 2]]
# Get corresponding data
out = np.take_along_axis(aa, choice, 1)
One possible drawback of this is that the chosen elements will always be in order (that is, for a given row, you may get the pairs of indices (0, 1), (0, 2) or (1, 2), but not (1, 0), (2, 0) or (2, 1)).
Of course, if you really just need a few samples, then the loop is probably the most convenient and maintainable solution, the second one would only be useful if you need to do this at larger scale.

Sum all diagonals in feature maps in parallel in PyTorch

Let's say I have a tensor shaped (1, 64, 128, 128) and I want to create a tensor of shape (1, 64, 255) holding the sums of all diagonals for every (128, 128) matrix (there are 1 main, 127 below, 127 above diagonals so in total 255). What I am currently doing is the following:
x = torch.rand(1, 64, 128, 128)
diag_sums = torch.zeros(1, 64, 255)
j = 0
for k in range(-127, 128):
diag_sums[j, :, k + 127] = torch.diagonal(x, offset=k, dim1=-2, dim2=-1).sum(dim=2)
This is obviously very slow, since it is using Python loops and is not done in parallel with respect to k.
I don't think this can be done using torch.diagonal since the function explicitly uses a single int for the offset parameter. If I could pass a list there, this would work, but I guess it would be complicated to implement (requiring changes in PyTorch itself).
I think it could be possible to implement this using torch.einsum, but I cannot think of a way to do it.
So this is my question: how do I get the tensor described above?

Have you considered using torch.nn.functional.conv2d?
You can sum the diagonals with a diagonal filter sliding across the tensor with appropriate zero padding.
import torch
import torch.nn.functional as nnf
# construct a diagonal filter using `eye` function, shape it appropriately
f = torch.eye(x.shape[2])[None, None,...].repeat(x.shape[1], 1, 1, 1)
# compute the diagonal sum with appropriate zero padding
conv_diag_sums = nnf.conv2d(x, f, padding=(x.shape[2]-1,0), groups=x.shape[1])[..., 0]
Note the the result has a slightly different order than the one you computed in the loop:
diag_sums = torch.zeros(1, 64, 255)
for k in range(-127, 128):
diag_sums[j, :, 127-k] = torch.diagonal(x, offset=k, dim1=-2, dim2=-1).sum(dim=2)
# compare
(conv_diag_sums == diag_sums).all()
results with True - they are the same.

Shai's answer works, however it looks like it has a lot of multiplications, due to the large size of the kernel. I figured out a way to do this for my use case. It is based on this answer for a similar question in Numpy: https://stackoverflow.com/a/35074207/6636290
I am doing the following:
digitized = np.sum(np.indices(a.shape), axis=0).ravel()
digitized_tensor = torch.Tensor(digitized).int()
a_tensor = torch.Tensor(a)
torch.bincount(digitized_tensor, a_tensor.view(-1))
If I could figure out a way to do this entirely in PyTorch (without Numpy's indices function), this would be great, but this answers the question.

The previous answers work, but there is another faster solution using strides (and that only uses Pytorch).
First I'll explain with a matrix as it is easier to understand.
Given you have a matrix M with size (n, n), you can change the matrix strides so that the resulting matrix has M's diagonals as columns. Then you can just sum the column to get your result.
import torch
def sum_all_diagonal_matrix(mat: torch.tensor):
n,_ = mat.shape
zero_mat = torch.zeros((n, n)) # Zero matrix used for padding
mat_padded = torch.cat((zero_mat, mat, zero_mat), 1) # pads the matrix on left and right
print(mad_padded)
mat_strided = mat_padded.as_strided((n, 2*n), (3*n + 1, 1)) # Change the strides
print(mat_strided)
sum_diags = torch.sum(mat_strided, 0) # Sums the resulting matrix's columns
return sum_diags[1:]
X = torch.arange(9).reshape(3,3)
print(X)
# tensor([[0, 1, 2],
# [3, 4, 5],
# [6, 7, 8]])
print(sum_all_diagonal_matrix(X))
# tensor([ 6., 10., 12., 6., 2.])
You can do exactly the same with one more dimension:
def sum_all_diagonal(mat: torch.tensor):
k,n,_ = mat.shape
zero_mat = torch.zeros((k, n, n))
mat_padded = torch.cat((zero_mat, mat, zero_mat), 2)
mat_strided = mat_padded.as_strided((k, n, 2*n), (3*n*n, 3*n + 1, 1))
sum_diags = torch.sum(mat_strided, 1)
return sum_diags[:, n:]

How to derive with respect to a Matrix element with Sympy

Given the product of a matrix and a vector
A.v
with A of shape (m,n) and v of dim n, where m and n are symbols, I need to calculate the Derivative with respect to the matrix elements.
I haven't found the way to use a proper vector, so I started with 2 MatrixSymbol:
n, m = symbols('n m')
j = tensor.Idx('j')
i = tensor.Idx('i')
l = tensor.Idx('l')
h = tensor.Idx('h')
A = MatrixSymbol('A', n,m)
B = MatrixSymbol('B', m,1)
C=A*B
Now, if I try to derive with respect to one of A's elements with the indices I get back the unevaluated expression:
diff(C, A[i,j])
>>>> Derivative(A*B, A[i, j])
If I introduce the indices in C also (it won't let me use only one index in the resulting vector) I get back the product expressed as a Sum:
C[l,h]
>>>> Sum(A[l, _k]*B[_k, h], (_k, 0, m - 1))
If I derive this with respect to the matrix element I end up getting 0 instead of an expression with the KroneckerDelta, which is the result that I would like to get:
diff(C[l,h], A[i,j])
>>>> 0
I wonder if maybe I shouldn't be using MatrixSymbols to start with. How should I go about implementing the behaviour that I want to get?

SymPy does not yet know matrix calculus; in particular, one cannot differentiate MatrixSymbol objects. You can do this sort of computation with Matrix objects filled with arrays of symbols; the drawback is that the matrix sizes must be explicit for this to work.
Example:
from sympy import *
A = Matrix(symarray('A', (4, 5)))
B = Matrix(symarray('B', (5, 3)))
C = A*B
print(C.diff(A[1, 2]))
outputs:
Matrix([[0, 0, 0], [B_2_0, B_2_1, B_2_2], [0, 0, 0], [0, 0, 0]])

The git version of SymPy (and the next version) handles this better:
In [55]: print(diff(C[l,h], A[i,j]))
Sum(KroneckerDelta(_k, j)*KroneckerDelta(i, l)*B[_k, h], (_k, 0, m - 1))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

NumPy template matching SQDIFF with `sliding window_view` - python

Related

Convert / "inflate" unaligned pixel data (bgr4) to a byte-aligned numpy array

Numba-compatible implementation of np.tile?

How to choose elements out of a matrix randomly weighted

Sum all diagonals in feature maps in parallel in PyTorch

How to derive with respect to a Matrix element with Sympy

Categories

Resources