I want to treat the r and g channel of a pixel and convert it from 0 <-> 255 to -1 <-> 1, then rotate (r, g) around (0,0) using the angle stored in rotations[i]. This is how I normally do it with regular for loops, but since the images I work with are ~4k*4k in dimensions, this takes a long time, and I would love to speed this up. I have little knowledge about parallelization, etc., but any resources would be helpful. I've tried libraries like joblib and multiprocessing, but I'm feeling as though I've made some fundamental mistake in those implementations usually resulting in some pickle error.
c = math.cos(rotations[i])
s = math.sin(rotations[i])
pixels = texture_.load()
for X in range(width):
for Y in range(height):
x = (pixels[X, Y][0]/255 -.5)*2
y = (pixels[X, Y][1]/255 -.5)*2
z = pixels[X, Y][2]
x_ = x*c-y*s
y_ = x*s+y*c
x_ = 255*(x_/2+.5)
y_ = 255*(y_/2+.5)
pixels[X, Y] = (math.floor(x_), math.floor(y_), z)
Use numpy to vectorize the computation and compute all individual elements at once in a matrix style computation.
Try something like this:
import numpy as np
pixels = np.array(pixels) # Assuming shape of (width, length, 3)
x = 2 * (pixels[:, :, 0]/255 - 0.5)
y = 2 * (pixels[:, :, 1]/255 - 0.5)
z = pixels[:, :, 2]
x_ = x * c - y * s
y_ = x * s + y * c
x_ = 255 * (x_ / 2 + .5)
y_ = 255 * (y_ / 2 + .5)
pixels[:, :, 0] = np.floor(x_)
pixels[:, :, 1] = np.floor(y_)
pixels[:, :, 2] = z
I'm trying to achieve linear interpolation, where the data points are N images of shape: HxWx3 (stored in buf (NxHxWx3)), and the points to interpolate are specified in another (2D) grid (interp_values).
Non-vectorizable approach:
In principle I have made interp_values a HxW grid with values 0..N-1 indicating for each i,j element from which image (in buf) to read it from, including fractional values meaning interpolation.
E.g.: a value of 3.6 means blend 40% (1-0.6) of image 3 with 60% (0.6) of image 4. However with this approach it is quite impossible to vectorize the code, and performance was poor.
One vectorization approach:
So I changed interp_values to be a NxHxWx3 grid with values 0..1. Each column :,i,j,c would specify blend coefficients for the N images, where only 1 or 2 elements are non-zero, e.g. for 3.6 we have: [0, 0, 0, 0.6, 0.4, 0, 0, ...]. I can convert interp_values from HxW to NxHxWx3 with:
def expand_interp_values(interp_values):
r = np.zeros((N,) + interp_values.shape + (3,))
for i in range(interp_values.shape[0]):
for j in range(interp_values.shape[1]):
v = interp_values[i, j]
a, b, x = math.floor(v), math.ceil(v), math.fmod(v, 1)
if int(a) == int(b):
r[a, i, j, :] = 3 * [1]
else:
r[a, i, j, :] = 3 * [1 - x]
r[b, i, j, :] = 3 * [x]
return r
This representation is more sparse (many zeros) but now interpolation can be computed as element-wise multiplication between buf and interp_values (the multiplication part of the linear interpolation) followed by a sum(..., axis=0) (i.e. the addition part of the linear interpolation):
def linear_interp(data, interp_values):
return np.sum(data * interp_values, axis=0)
With this approach, there is some performance improvement, however it seems with this approach the CPU will be most of the times busy computing x1*0, x2*0, ... or 0 + 0 + 0...
Can this be improved any better?
Additionally, the creation of the expanded interp_values grid is not vectorized, so perhaps performance would be bad if that grid has to be updated continuously.
Complete python+opencv code:
import cv2
import numpy as np
import math
vid = cv2.VideoCapture(0)
vid.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
vid.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
# store last N images into a NxHxWx3 grid (circular buffer):
N = 25
buf = None
interp_values = None
DOWNSAMPLING = 6
def linear_interp(data, interp_values):
return np.sum(data * interp_values / 256, axis=0)
def expand_interp_values(interp_values):
r = np.zeros((N,) + interp_values.shape + (3,))
for i in range(interp_values.shape[0]):
for j in range(interp_values.shape[1]):
v = interp_values[i, j]
a, b, x = math.floor(v), math.ceil(v), math.fmod(v, 1)
if int(a) == int(b):
r[a, i, j, :] = 3 * [1]
else:
r[a, i, j, :] = 3 * [1 - x]
r[b, i, j, :] = 3 * [x]
return r
while True:
ret, frame = vid.read()
H, W, Ch = frame.shape
frame = cv2.resize(frame, dsize=(W//DOWNSAMPLING, H//DOWNSAMPLING), interpolation=cv2.INTER_LINEAR)
# circular buffer:
if buf is None:
buf = np.zeros((N,) + frame.shape, dtype=np.uint8)
# there should be a simpler way to a FIFO-grid...
for i in reversed(range(1, N)):
buf[i] = buf[i - 1]
buf[0] = frame
if interp_values is None:
# create a lookup pattern here:
interp_values = np.zeros(frame.shape[:2])
for i in range(frame.shape[0]):
for j in range(frame.shape[1]):
y = i / (frame.shape[0] - 1) * 2 - 1
x = j / (frame.shape[1] - 1) * 2 - 1
#interp_values[i, j] = (N - 1) * min(1, math.hypot(x, y))
interp_values[i, j] = (N - 1) * (y + 1) / 2
interp_values = expand_interp_values(interp_values)
im = linear_interp(buf, interp_values)
im = cv2.resize(im, dsize=(W, H), interpolation=cv2.INTER_LANCZOS4)
cv2.imshow('image', im)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
vid.release()
cv2.destroyAllWindows()
I have an image INPUT, a painted CANVAS, and a homography matrix H, this code below will "crop" the CANVAS (bigger than INPUT) which was warped using the homography matrix H to the size of the INPUT. But so far my code is so inefficient that the loop starts to slow down the whole process. The example code below will effectively produce 1280*720 of loops for 3 channels image (I will be dealing with larger channels image/tensor). Is there a way to optimize/vectorize this process? Thanks in advance!
INPUT = np.zeros(720, 1280, 3)
CANVAS = np.zeros(1548, 1104, 3)
H = np.random.uniform(size=(3, 3))
inputIm_shape = INPUT.shape
h, w, c = inputIm_shape
xs, ys, a = [], [], np.zeros((w, h))
# this is where the bottleneck happens
for index, _ in np.ndenumerate(a):
xs.append(index[0]), ys.append(index[1])
input_coords = np.vstack(
(np.array(xs), np.array(ys), np.ones(len(xs))))
transformed = np.matmul(H, input_coords)
transformed[0, :] = np.divide(transformed[0, :], transformed[2, :])
transformed[1, :] = np.divide(transformed[1, :], transformed[2, :])
map_image = np.zeros(inputIm_shape)
hc, wc, cc = CANVAS.shape
badcount = 0
# and this too
for k in range(0, input_coords.shape[1]):
if int(transformed[0, k]) < 0 or int(transformed[0, k]) > wc - 1 or int(transformed[1, k]) < 0 or int(transformed[1, k]) > hc - 1:
badcount += 1
x_input = max(min(int(input_coords[0, k]), w - 1), 0)
y_input = max(min(int(input_coords[1, k]), h - 1), 0)
x_canvas = max(min(int(transformed[0, k]), wc - 1), 0)
y_canvas = max(min(int(transformed[1, k]), hc - 1), 0)
map_image[y_input, x_input] = CANVAS[y_canvas, x_canvas]
I am trying to speed up some multi-camera system that relies on calculation of fundamental matrices between each camera pair.
Please notice the following is pseudocode. # means matrix multiplication, | means concatenation.
I have code to calculate F for each pair calculate_f(camera_matrix1_3x4, camera_matrix1_3x4), and the naiive solution is
for c1 in cameras:
for c2 in cameras:
if c1 != c2:
f = calculate_f(c1.proj_matrix, c2.proj_matrix)
This is slow, and I would like to speed it up. I have ~5000 cameras.
I have pre calculated all rotations and translations (in world coordinates) between every pair of cameras, and internal parameters k, such that for each camera c, it holds that c.matrix = c.k # (c.rot | c.t)
Can I use the parameters r, t to help speed up following calculations for F?
In mathematical form, for 3 different cameras c1, c2, c3 I have
f12=(c1.proj_matrix, c2.proj_matrix), and I want f23=(c2.proj_matrix, c3.proj_matrix), f13=(c1.proj_matrix, c3.proj_matrix) with some function f23, f13 = fast_f(f12, c1.r, c1.t, c2.r, c2.t, c3.r, c3.t)?
A working function for calculating the fundamental matrix in numpy:
def fundamental_3x3_from_projections(p_left_3x4: np.array, p_right__3x4: np.array) -> np.array:
# The following is based on OpenCv-contrib's c++ implementation.
# see https://github.com/opencv/opencv_contrib/blob/master/modules/sfm/src/fundamental.cpp#L109
# see https://sourishghosh.com/2016/fundamental-matrix-from-camera-matrices/
# see https://answers.opencv.org/question/131017/how-do-i-compute-the-fundamental-matrix-from-2-projection-matrices/
f_3x3 = np.zeros((3, 3))
p1, p2 = p_left_3x4, p_right__3x4
x = np.empty((3, 2, 4), dtype=np.float)
x[0, :, :] = np.vstack([p1[1, :], p1[2, :]])
x[1, :, :] = np.vstack([p1[2, :], p1[0, :]])
x[2, :, :] = np.vstack([p1[0, :], p1[1, :]])
y = np.empty((3, 2, 4), dtype=np.float)
y[0, :, :] = np.vstack([p2[1, :], p2[2, :]])
y[1, :, :] = np.vstack([p2[2, :], p2[0, :]])
y[2, :, :] = np.vstack([p2[0, :], p2[1, :]])
for i in range(3):
for j in range(3):
xy = np.vstack([x[j, :], y[i, :]])
f_3x3[i, j] = np.linalg.det(xy)
return f_3x3
Numpy is clearly not optimized for working on small matrices. The parsing of CPython input objects, internal checks and function calls introduce a significant overhead which is far bigger than the execution time need to perform the actual computation. Not to mention the creation of many temporary arrays is also expensive. One solution to solve this problem is to use Numba or Cython.
Moreover, the computation of the determinant can be optimized a lot since you know the exact size of the matrix and a part of the matrix does not always change. Indeed, using a basic algebraic expression for the 4x4 determinant help compilers to optimize a lot the overall computation thanks to the common sub-expression elimination (not performed by the CPython interpreter) and the removal of complex loops/conditionals in np.linalg.det.
Here is the resulting code:
import numba as nb
#nb.njit('float64(float64[:,::1])')
def det_4x4(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
#nb.njit('float64[:,::1](float64[:,::1], float64[:,::1])')
def fundamental_3x3_from_projections(p_left_3x4, p_right_3x4):
f_3x3 = np.empty((3, 3))
p1, p2 = p_left_3x4, p_right_3x4
x = np.empty((3, 2, 4), dtype=np.float64)
x[0, 0, :] = p1[1, :]
x[0, 1, :] = p1[2, :]
x[1, 0, :] = p1[2, :]
x[1, 1, :] = p1[0, :]
x[2, 0, :] = p1[0, :]
x[2, 1, :] = p1[1, :]
y = np.empty((3, 2, 4), dtype=np.float64)
y[0, 0, :] = p2[1, :]
y[0, 1, :] = p2[2, :]
y[1, 0, :] = p2[2, :]
y[1, 1, :] = p2[0, :]
y[2, 0, :] = p2[0, :]
y[2, 1, :] = p2[1, :]
xy = np.empty((4, 4), dtype=np.float64)
for i in range(3):
xy[2:4, :] = y[i, :, :]
for j in range(3):
xy[0:2, :] = x[j, :, :]
f_3x3[i, j] = det_4x4(xy)
return f_3x3
This is 130 times faster on my machine (85.6 us VS 0.66 us).
You can speed up the process even more by a factor of two if the applied function is commutative (ie. f(c1, c2) == f(c2, c1)). If so, you could compute only the upper part. It turns out that your function have some interesting property since f(c1, c2) == f(c2, c1).T appear to be always true. Another possible optimization is to run the loop in parallel.
With all these optimizations, the resulting program should be about 3 order of magnitude faster.
Analysis of the accuracy of the approach
The precision provided appear to be similar than the original one. Regarding the input matrix, results are sometime more accurate and sometimes less accurate than the Numpy method. This is specifically due to the computation of the determinant. With 24-digit decimals, there is no visible error compared to the reliable result of Wolphram Alpha. This show that the method is correct, results as not the same due to numerical stability details. Here is the code used to test the accuracy of the methods:
# Imports
from decimal import Decimal
import numba as nb
# Definitions
def det_4x4(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
#nb.njit('float64(float64[:,::1])')
def det_4x4_numba(mat):
a, b, c, d = mat[0,0], mat[0,1], mat[0,2], mat[0,3]
e, f, g, h = mat[1,0], mat[1,1], mat[1,2], mat[1,3]
i, j, k, l = mat[2,0], mat[2,1], mat[2,2], mat[2,3]
m, n, o, p = mat[3,0], mat[3,1], mat[3,2], mat[3,3]
return a * (f * (k*p - l*o) + g * (l*n - j*p) + h * (j*o - k*n)) + \
b * (e * (l*o - k*p) + g * (i*p - l*m) + h * (k*m - i*o)) + \
c * (e * (j*p - l*n) + f * (l*m - i*p) + h * (i*n - j*m)) + \
d * (e * (k*n - j*o) + f * (i*o - k*m) + g * (j*m - i*n))
# Example matrix
precise_xy = np.array(
[[Decimal('42'),Decimal('-6248'),Decimal('4060'),Decimal('845')],
[Decimal('-0.00992'),Decimal('-0.704'),Decimal('-0.71173298417'),Decimal('300.532')],
[Decimal('-8.94274'),Decimal('-7554.39'),Decimal('604.57'),Decimal('706282')],
[Decimal('-0.0132'),Decimal('-0.2757'),Decimal('-0.961'),Decimal('247.65')]]
)
xy = precise_xy.astype(np.float64)
res_numpy = Decimal(np.linalg.det(xy))
res_numba = Decimal(det_4x4_numba(xy))
res_precise = det_4x4(precise_xy)
# The Wolphram Alpha expression used is:
# det({{42,-6248,4060,845},
# {-0.00992,-0.704,-0.71173298417,300.532},
# {-8.94274,-7554.39,604.57,706282},
# {-0.0132,-0.2757,-0.961,247.65}})
res_wolframalpha = Decimal('-323312.2164828991329828243')
# The result got from Wolfram-Alpha have a 25-digit precision
# and is exactly the same than the one of det_4x4 using 24-digit decimals.
assert res_precise == res_wolframalpha
print(abs((res_numpy-res_precise)/res_precise)) # 1.7E-14
print(abs((res_numba-res_precise)/res_precise)) # 3.1E-14
# => Similar relative error (Numba slightly less accurate
# but both are not close to the 1e-16 relative epsilon)