Python CNN im2col function doesn't make sense

Python CNN im2col function doesn't make sense - python

In CNN Convolution learning, im2col function code is not understood.
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
N, C, H, W = input_data.shape
out_h = (H + 2*pad - filter_h)//stride + 1
out_w = (W + 2*pad - filter_w)//stride + 1
img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))
#I DON'T KNOW!
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
return col
Q1. I don't know why the input data(img) is converted to six dimensions(col). Why are you converting like that?
Q2. Python syntax is not familiar. So,I don't understand the syntax of lines 9-13.
Can you explain it in C / C ++ / JAVA?
I tried a lot to understand. Help me.
I'm sorry about the weird grammar. :(
Have a nice day and thank you all!

Related

How do i speed up this looping of pixels in PIL

I want to treat the r and g channel of a pixel and convert it from 0 <-> 255 to -1 <-> 1, then rotate (r, g) around (0,0) using the angle stored in rotations[i]. This is how I normally do it with regular for loops, but since the images I work with are ~4k*4k in dimensions, this takes a long time, and I would love to speed this up. I have little knowledge about parallelization, etc., but any resources would be helpful. I've tried libraries like joblib and multiprocessing, but I'm feeling as though I've made some fundamental mistake in those implementations usually resulting in some pickle error.
c = math.cos(rotations[i])
s = math.sin(rotations[i])
pixels = texture_.load()
for X in range(width):
for Y in range(height):
x = (pixels[X, Y][0]/255 -.5)*2
y = (pixels[X, Y][1]/255 -.5)*2
z = pixels[X, Y][2]
x_ = x*c-y*s
y_ = x*s+y*c
x_ = 255*(x_/2+.5)
y_ = 255*(y_/2+.5)
pixels[X, Y] = (math.floor(x_), math.floor(y_), z)

Use numpy to vectorize the computation and compute all individual elements at once in a matrix style computation.
Try something like this:
import numpy as np
pixels = np.array(pixels) # Assuming shape of (width, length, 3)
x = 2 * (pixels[:, :, 0]/255 - 0.5)
y = 2 * (pixels[:, :, 1]/255 - 0.5)
z = pixels[:, :, 2]
x_ = x * c - y * s
y_ = x * s + y * c
x_ = 255 * (x_ / 2 + .5)
y_ = 255 * (y_ / 2 + .5)
pixels[:, :, 0] = np.floor(x_)
pixels[:, :, 1] = np.floor(y_)
pixels[:, :, 2] = z

fast <image,time> linear interpolation

I'm trying to achieve linear interpolation, where the data points are N images of shape: HxWx3 (stored in buf (NxHxWx3)), and the points to interpolate are specified in another (2D) grid (interp_values).
Non-vectorizable approach:
In principle I have made interp_values a HxW grid with values 0..N-1 indicating for each i,j element from which image (in buf) to read it from, including fractional values meaning interpolation.
E.g.: a value of 3.6 means blend 40% (1-0.6) of image 3 with 60% (0.6) of image 4. However with this approach it is quite impossible to vectorize the code, and performance was poor.
One vectorization approach:
So I changed interp_values to be a NxHxWx3 grid with values 0..1. Each column :,i,j,c would specify blend coefficients for the N images, where only 1 or 2 elements are non-zero, e.g. for 3.6 we have: [0, 0, 0, 0.6, 0.4, 0, 0, ...]. I can convert interp_values from HxW to NxHxWx3 with:
def expand_interp_values(interp_values):
r = np.zeros((N,) + interp_values.shape + (3,))
for i in range(interp_values.shape[0]):
for j in range(interp_values.shape[1]):
v = interp_values[i, j]
a, b, x = math.floor(v), math.ceil(v), math.fmod(v, 1)
if int(a) == int(b):
r[a, i, j, :] = 3 * [1]
else:
r[a, i, j, :] = 3 * [1 - x]
r[b, i, j, :] = 3 * [x]
return r
This representation is more sparse (many zeros) but now interpolation can be computed as element-wise multiplication between buf and interp_values (the multiplication part of the linear interpolation) followed by a sum(..., axis=0) (i.e. the addition part of the linear interpolation):
def linear_interp(data, interp_values):
return np.sum(data * interp_values, axis=0)
With this approach, there is some performance improvement, however it seems with this approach the CPU will be most of the times busy computing x1*0, x2*0, ... or 0 + 0 + 0...
Can this be improved any better?
Additionally, the creation of the expanded interp_values grid is not vectorized, so perhaps performance would be bad if that grid has to be updated continuously.
Complete python+opencv code:
import cv2
import numpy as np
import math
vid = cv2.VideoCapture(0)
vid.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
vid.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
# store last N images into a NxHxWx3 grid (circular buffer):
N = 25
buf = None
interp_values = None
DOWNSAMPLING = 6
def linear_interp(data, interp_values):
return np.sum(data * interp_values / 256, axis=0)
def expand_interp_values(interp_values):
r = np.zeros((N,) + interp_values.shape + (3,))
for i in range(interp_values.shape[0]):
for j in range(interp_values.shape[1]):
v = interp_values[i, j]
a, b, x = math.floor(v), math.ceil(v), math.fmod(v, 1)
if int(a) == int(b):
r[a, i, j, :] = 3 * [1]
else:
r[a, i, j, :] = 3 * [1 - x]
r[b, i, j, :] = 3 * [x]
return r
while True:
ret, frame = vid.read()
H, W, Ch = frame.shape
frame = cv2.resize(frame, dsize=(W//DOWNSAMPLING, H//DOWNSAMPLING), interpolation=cv2.INTER_LINEAR)
# circular buffer:
if buf is None:
buf = np.zeros((N,) + frame.shape, dtype=np.uint8)
# there should be a simpler way to a FIFO-grid...
for i in reversed(range(1, N)):
buf[i] = buf[i - 1]
buf[0] = frame
if interp_values is None:
# create a lookup pattern here:
interp_values = np.zeros(frame.shape[:2])
for i in range(frame.shape[0]):
for j in range(frame.shape[1]):
y = i / (frame.shape[0] - 1) * 2 - 1
x = j / (frame.shape[1] - 1) * 2 - 1
#interp_values[i, j] = (N - 1) * min(1, math.hypot(x, y))
interp_values[i, j] = (N - 1) * (y + 1) / 2
interp_values = expand_interp_values(interp_values)
im = linear_interp(buf, interp_values)
im = cv2.resize(im, dsize=(W, H), interpolation=cv2.INTER_LANCZOS4)
cv2.imshow('image', im)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
vid.release()
cv2.destroyAllWindows()

How to optimize/vectorize this python code (homography transform & cropping)?

I have an image INPUT, a painted CANVAS, and a homography matrix H, this code below will "crop" the CANVAS (bigger than INPUT) which was warped using the homography matrix H to the size of the INPUT. But so far my code is so inefficient that the loop starts to slow down the whole process. The example code below will effectively produce 1280*720 of loops for 3 channels image (I will be dealing with larger channels image/tensor). Is there a way to optimize/vectorize this process? Thanks in advance!
INPUT = np.zeros(720, 1280, 3)
CANVAS = np.zeros(1548, 1104, 3)
H = np.random.uniform(size=(3, 3))
inputIm_shape = INPUT.shape
h, w, c = inputIm_shape
xs, ys, a = [], [], np.zeros((w, h))
# this is where the bottleneck happens
for index, _ in np.ndenumerate(a):
xs.append(index[0]), ys.append(index[1])
input_coords = np.vstack(
(np.array(xs), np.array(ys), np.ones(len(xs))))
transformed = np.matmul(H, input_coords)
transformed[0, :] = np.divide(transformed[0, :], transformed[2, :])
transformed[1, :] = np.divide(transformed[1, :], transformed[2, :])
map_image = np.zeros(inputIm_shape)
hc, wc, cc = CANVAS.shape
badcount = 0
# and this too
for k in range(0, input_coords.shape[1]):
if int(transformed[0, k]) < 0 or int(transformed[0, k]) > wc - 1 or int(transformed[1, k]) < 0 or int(transformed[1, k]) > hc - 1:
badcount += 1
x_input = max(min(int(input_coords[0, k]), w - 1), 0)
y_input = max(min(int(input_coords[1, k]), h - 1), 0)
x_canvas = max(min(int(transformed[0, k]), wc - 1), 0)
y_canvas = max(min(int(transformed[1, k]), hc - 1), 0)
map_image[y_input, x_input] = CANVAS[y_canvas, x_canvas]

How to find fundamental matrix based on other fundamental matrix and camera movement?

I am trying to speed up some multi-camera system that relies on calculation of fundamental matrices between each camera pair.
Please notice the following is pseudocode. # means matrix multiplication, | means concatenation.
I have code to calculate F for each pair calculate_f(camera_matrix1_3x4, camera_matrix1_3x4), and the naiive solution is
for c1 in cameras:
for c2 in cameras:
if c1 != c2:
f = calculate_f(c1.proj_matrix, c2.proj_matrix)
This is slow, and I would like to speed it up. I have ~5000 cameras.
I have pre calculated all rotations and translations (in world coordinates) between every pair of cameras, and internal parameters k, such that for each camera c, it holds that c.matrix = c.k # (c.rot | c.t)
Can I use the parameters r, t to help speed up following calculations for F?
In mathematical form, for 3 different cameras c1, c2, c3 I have
f12=(c1.proj_matrix, c2.proj_matrix), and I want f23=(c2.proj_matrix, c3.proj_matrix), f13=(c1.proj_matrix, c3.proj_matrix) with some function f23, f13 = fast_f(f12, c1.r, c1.t, c2.r, c2.t, c3.r, c3.t)?
A working function for calculating the fundamental matrix in numpy:
def fundamental_3x3_from_projections(p_left_3x4: np.array, p_right__3x4: np.array) -> np.array:
# The following is based on OpenCv-contrib's c++ implementation.
# see https://github.com/opencv/opencv_contrib/blob/master/modules/sfm/src/fundamental.cpp#L109
# see https://sourishghosh.com/2016/fundamental-matrix-from-camera-matrices/
# see https://answers.opencv.org/question/131017/how-do-i-compute-the-fundamental-matrix-from-2-projection-matrices/
f_3x3 = np.zeros((3, 3))
p1, p2 = p_left_3x4, p_right__3x4
x = np.empty((3, 2, 4), dtype=np.float)
x[0, :, :] = np.vstack([p1[1, :], p1[2, :]])
x[1, :, :] = np.vstack([p1[2, :], p1[0, :]])
x[2, :, :] = np.vstack([p1[0, :], p1[1, :]])
y = np.empty((3, 2, 4), dtype=np.float)
y[0, :, :] = np.vstack([p2[1, :], p2[2, :]])
y[1, :, :] = np.vstack([p2[2, :], p2[0, :]])
y[2, :, :] = np.vstack([p2[0, :], p2[1, :]])
for i in range(3):
for j in range(3):
xy = np.vstack([x[j, :], y[i, :]])
f_3x3[i, j] = np.linalg.det(xy)
return f_3x3

Numpy is clearly not optimized for working on small matrices. The parsing of CPython input objects, internal checks and function calls introduce a significant overhead which is far bigger than the execution time need to perform the actual computation. Not to mention the creation of many temporary arrays is also expensive. One solution to solve this problem is to use Numba or Cython.
Moreover, the computation of the determinant can be optimized a lot since you know the exact size of the matrix and a part of the matrix does not always change. Indeed, using a basic algebraic expression for the 4x4 determinant help compilers to optimize a lot the overall computation thanks to the common sub-expression elimination (not performed by the CPython interpreter) and the removal of complex loops/conditionals in np.linalg.det.
Here is the resulting code:
import numba as nb
#nb.njit('float64(float64[:,::1])')
def det_4x4(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
#nb.njit('float64[:,::1](float64[:,::1], float64[:,::1])')
def fundamental_3x3_from_projections(p_left_3x4, p_right_3x4):
f_3x3 = np.empty((3, 3))
p1, p2 = p_left_3x4, p_right_3x4
x = np.empty((3, 2, 4), dtype=np.float64)
x[0, 0, :] = p1[1, :]
x[0, 1, :] = p1[2, :]
x[1, 0, :] = p1[2, :]
x[1, 1, :] = p1[0, :]
x[2, 0, :] = p1[0, :]
x[2, 1, :] = p1[1, :]
y = np.empty((3, 2, 4), dtype=np.float64)
y[0, 0, :] = p2[1, :]
y[0, 1, :] = p2[2, :]
y[1, 0, :] = p2[2, :]
y[1, 1, :] = p2[0, :]
y[2, 0, :] = p2[0, :]
y[2, 1, :] = p2[1, :]
xy = np.empty((4, 4), dtype=np.float64)
for i in range(3):
xy[2:4, :] = y[i, :, :]
for j in range(3):
xy[0:2, :] = x[j, :, :]
f_3x3[i, j] = det_4x4(xy)
return f_3x3
This is 130 times faster on my machine (85.6 us VS 0.66 us).
You can speed up the process even more by a factor of two if the applied function is commutative (ie. f(c1, c2) == f(c2, c1)). If so, you could compute only the upper part. It turns out that your function have some interesting property since f(c1, c2) == f(c2, c1).T appear to be always true. Another possible optimization is to run the loop in parallel.
With all these optimizations, the resulting program should be about 3 order of magnitude faster.
Analysis of the accuracy of the approach
The precision provided appear to be similar than the original one. Regarding the input matrix, results are sometime more accurate and sometimes less accurate than the Numpy method. This is specifically due to the computation of the determinant. With 24-digit decimals, there is no visible error compared to the reliable result of Wolphram Alpha. This show that the method is correct, results as not the same due to numerical stability details. Here is the code used to test the accuracy of the methods:
# Imports
from decimal import Decimal
import numba as nb
# Definitions
def det_4x4(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
#nb.njit('float64(float64[:,::1])')
def det_4x4_numba(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
# Example matrix
precise_xy = np.array(
[[Decimal('42'),Decimal('-6248'),Decimal('4060'),Decimal('845')],
[Decimal('-0.00992'),Decimal('-0.704'),Decimal('-0.71173298417'),Decimal('300.532')],
[Decimal('-8.94274'),Decimal('-7554.39'),Decimal('604.57'),Decimal('706282')],
[Decimal('-0.0132'),Decimal('-0.2757'),Decimal('-0.961'),Decimal('247.65')]]
)
xy = precise_xy.astype(np.float64)
res_numpy = Decimal(np.linalg.det(xy))
res_numba = Decimal(det_4x4_numba(xy))
res_precise = det_4x4(precise_xy)
# The Wolphram Alpha expression used is:
# det({{42,-6248,4060,845},
# {-0.00992,-0.704,-0.71173298417,300.532},
# {-8.94274,-7554.39,604.57,706282},
# {-0.0132,-0.2757,-0.961,247.65}})
res_wolframalpha = Decimal('-323312.2164828991329828243')
# The result got from Wolfram-Alpha have a 25-digit precision
# and is exactly the same than the one of det_4x4 using 24-digit decimals.
assert res_precise == res_wolframalpha
print(abs((res_numpy-res_precise)/res_precise)) # 1.7E-14
print(abs((res_numba-res_precise)/res_precise)) # 3.1E-14
# => Similar relative error (Numba slightly less accurate
# but both are not close to the 1e-16 relative epsilon)

Gradient not defined Tensorflow

I had asked a question and was implementing the solution when I found out that the operation tf.math.count_nonzero does not have gradient defined. So I tried the following round about method:
eps = 1e-6
a = tf.ones((4, 4, 2, 2), tf.float32)
h = tf.linalg.svd(a, full_matrices=False, compute_uv=False)
cond = tf.less(h, eps)
h = tf.where(cond, tf.zeros(tf.shape(h)), h)
i = tf.reduce_sum(h, axis=-1)
j = h[:, :, 0]
rank_mat = tf.multiply(2., tf.ones((4, 4)))
cond = tf.not_equal(i, j)
rank_mat = tf.where(cond, rank_mat, tf.ones(tf.shape(rank_mat)))
cond = tf.equal(i, tf.zeros(shape=tf.shape(i), dtype=tf.float32))
rank_mat = tf.where(cond, tf.zeros(tf.shape(rank_mat)), rank_mat)
min_rank = tf.reduce_min(rank_mat)
Still the same error persists. I partly understand why this is happening, but is there a differentiable way of implementing this? Thanks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python CNN im2col function doesn't make sense - python

Related

How do i speed up this looping of pixels in PIL

fast <image,time> linear interpolation

How to optimize/vectorize this python code (homography transform & cropping)?

How to find fundamental matrix based on other fundamental matrix and camera movement?

Gradient not defined Tensorflow

Categories

Resources