How to optimize this loop

How to optimize this loop - python

input is a list of observations, every observation is fixed size set of elipses (every elipse is represented by 7 parameters).
output is a list of images, one image for one observation, we are basically putting elipses from observation to completely white image. If few elipses overlap then we are putting mean value of rgb values.
n, m = size of image in pixels, image is represented as (n, m, 3) numpy array (3 because of RGB coding)
N = number of elipses in every individual observation
xx, yy = np.mgrid[:n, :m]
def elipses_population_to_img_population(elipses_population):
population_size = elipses_population.shape[0]
img_population = np.empty((population_size, n, m, 3))
for j in range(population_size):
imarray = np.empty((N, n, m, 3))
imarray.fill(np.nan)
for i in range(N):
x = elipses_population[j, i, 0]
y = elipses_population[j, i, 1]
R = elipses_population[j, i, 2]
G = elipses_population[j, i, 3]
B = elipses_population[j, i, 4]
a = elipses_population[j, i, 5]
b = elipses_population[j, i, 6]
xx_centered = xx - x
yy_centered = yy - y
elipse = (xx_centered / a)**2 + (yy_centered / b)**2 < 1
imarray[i, elipse, :] = np.array([R, G, B])
means_img = np.nanmean(imarray, axis=0)
means_img = np.nan_to_num(means_img, nan=255)
img_population[j, :, :, :] = means_img
return img_population
Code is working correctly, but i am looking for optimization advices. I am running it many times in my code so every small improve would be helpful.

Related

torch gather using two index arrays

The goal is to extract a random 2x5 patch from a 5x10 image, and do so randomly for all images in a batch. Looking to write a faster implementation that avoids for loops. Haven't been able to figure out how to use the torch .gather operation with two index arrays (idx_h and idx_w in code example).
Naive for loop:
import torch
b = 3 # batch size
h = 5 # height
w = 10 # width
crop_border = (3, 5) # number of pixels (height, width) to crop
x = torch.arange(b * h * w).reshape(b, h, w)
print(x)
dh_ = torch.randint(0, crop_border[0], size=(b,))
dw_ = torch.randint(0, crop_border[1], size=(b,))
_dh = h - (crop_border[0] - dh_)
_dw = w - (crop_border[1] - dw_)
idx_h = torch.stack([torch.arange(d_, _d) for d_, _d in zip(dh_, _dh)])
idx_w = torch.stack([torch.arange(d_, _d) for d_, _d in zip(dw_, _dw)])
print(idx_h, idx_w)
new_shape = (b, idx_h.shape[1], idx_w.shape[1])
cropped_x = torch.empty(new_shape)
for batch in range(b):
for height in range(idx_h.shape[1]):
for width in range(idx_w.shape[1]):
cropped_x[batch, height, width] = x[
batch, idx_h[batch, height], idx_w[batch, width]
]
print(cropped_x)

Index arrays needed to be repeated and reshaped to work with gather operation. Fast_crop code based pytorch discussion: https://discuss.pytorch.org/t/similar-to-torch-gather-over-two-dimensions/118827
def fast_crop(x, idx1, idx2):
"""
Compute
x: N x B x V
idx1: N x K matrix where idx1[i, j] is between [0, B)
idx2: N x K matrix where idx2[i, j] is between [0, V)
Return:
cropped: N x K matrix where y[i, j] = x[i, idx1[i,j], idx2[i,j]]
"""
x = x.contiguous()
assert idx1.shape == idx2.shape
lin_idx = idx2 + x.size(-1) * idx1
x = x.view(-1, x.size(1) * x.size(2))
lin_idx = lin_idx.view(-1, lin_idx.shape[1] * lin_idx.shape[2])
cropped = x.gather(-1, lin_idx)
return cropped.reshape(idx1.shape)
idx1 = torch.repeat_interleave(idx_h, idx_w.shape[1]).reshape(new_shape)
idx2 = torch.repeat_interleave(idx_w, idx_h.shape[1], dim=0).reshape(new_shape)
cropped = fast_crop(x, idx1, idx2)
(cropped == cropped_x).all()
Using realistic numbers for b = 100, h = 100, w = 130 and crop_border = (40, 95), a 10 trial run takes the for loop 32s while fast_crop only 0.043s.

fast <image,time> linear interpolation

I'm trying to achieve linear interpolation, where the data points are N images of shape: HxWx3 (stored in buf (NxHxWx3)), and the points to interpolate are specified in another (2D) grid (interp_values).
Non-vectorizable approach:
In principle I have made interp_values a HxW grid with values 0..N-1 indicating for each i,j element from which image (in buf) to read it from, including fractional values meaning interpolation.
E.g.: a value of 3.6 means blend 40% (1-0.6) of image 3 with 60% (0.6) of image 4. However with this approach it is quite impossible to vectorize the code, and performance was poor.
One vectorization approach:
So I changed interp_values to be a NxHxWx3 grid with values 0..1. Each column :,i,j,c would specify blend coefficients for the N images, where only 1 or 2 elements are non-zero, e.g. for 3.6 we have: [0, 0, 0, 0.6, 0.4, 0, 0, ...]. I can convert interp_values from HxW to NxHxWx3 with:
def expand_interp_values(interp_values):
r = np.zeros((N,) + interp_values.shape + (3,))
for i in range(interp_values.shape[0]):
for j in range(interp_values.shape[1]):
v = interp_values[i, j]
a, b, x = math.floor(v), math.ceil(v), math.fmod(v, 1)
if int(a) == int(b):
r[a, i, j, :] = 3 * [1]
else:
r[a, i, j, :] = 3 * [1 - x]
r[b, i, j, :] = 3 * [x]
return r
This representation is more sparse (many zeros) but now interpolation can be computed as element-wise multiplication between buf and interp_values (the multiplication part of the linear interpolation) followed by a sum(..., axis=0) (i.e. the addition part of the linear interpolation):
def linear_interp(data, interp_values):
return np.sum(data * interp_values, axis=0)
With this approach, there is some performance improvement, however it seems with this approach the CPU will be most of the times busy computing x1*0, x2*0, ... or 0 + 0 + 0...
Can this be improved any better?
Additionally, the creation of the expanded interp_values grid is not vectorized, so perhaps performance would be bad if that grid has to be updated continuously.
Complete python+opencv code:
import cv2
import numpy as np
import math
vid = cv2.VideoCapture(0)
vid.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
vid.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
# store last N images into a NxHxWx3 grid (circular buffer):
N = 25
buf = None
interp_values = None
DOWNSAMPLING = 6
def linear_interp(data, interp_values):
return np.sum(data * interp_values / 256, axis=0)
def expand_interp_values(interp_values):
r = np.zeros((N,) + interp_values.shape + (3,))
for i in range(interp_values.shape[0]):
for j in range(interp_values.shape[1]):
v = interp_values[i, j]
a, b, x = math.floor(v), math.ceil(v), math.fmod(v, 1)
if int(a) == int(b):
r[a, i, j, :] = 3 * [1]
else:
r[a, i, j, :] = 3 * [1 - x]
r[b, i, j, :] = 3 * [x]
return r
while True:
ret, frame = vid.read()
H, W, Ch = frame.shape
frame = cv2.resize(frame, dsize=(W//DOWNSAMPLING, H//DOWNSAMPLING), interpolation=cv2.INTER_LINEAR)
# circular buffer:
if buf is None:
buf = np.zeros((N,) + frame.shape, dtype=np.uint8)
# there should be a simpler way to a FIFO-grid...
for i in reversed(range(1, N)):
buf[i] = buf[i - 1]
buf[0] = frame
if interp_values is None:
# create a lookup pattern here:
interp_values = np.zeros(frame.shape[:2])
for i in range(frame.shape[0]):
for j in range(frame.shape[1]):
y = i / (frame.shape[0] - 1) * 2 - 1
x = j / (frame.shape[1] - 1) * 2 - 1
#interp_values[i, j] = (N - 1) * min(1, math.hypot(x, y))
interp_values[i, j] = (N - 1) * (y + 1) / 2
interp_values = expand_interp_values(interp_values)
im = linear_interp(buf, interp_values)
im = cv2.resize(im, dsize=(W, H), interpolation=cv2.INTER_LANCZOS4)
cv2.imshow('image', im)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
vid.release()
cv2.destroyAllWindows()

Cumulative sum over slices of array: Two approaches

I have a hard time understanding why the following two code samples produce different results:
Code 1:
for h in range(n_H):
for w in range(n_W):
# Find indices
vert_start = h * stride # Starting row-index for current slice
vert_end = vert_start + f # Final row-index (+1) for current slice
horiz_start = w * stride # Starting column-index for current slice
horiz_end = horiz_start + f # Final column-index (+1) for current slice
for c in range(n_C):
Aux = (W[:, :, :, c] * Z[:, h, w, c, np.newaxis, np.newaxis, np.newaxis])
A[:, vert_start:vert_end, horiz_start:horiz_end, :] += Aux
Code 2:
for h in range(n_H):
for w in range(n_W):
# Find indices
vert_start = h * stride # Starting row-index for current slice
vert_end = vert_start + f # Final row-index (+1) for current slice
horiz_start = w * stride # Starting column-index for current slice
horiz_end = horiz_start + f # Final column-index (+1) for current slice
Aux = np.zeros((m, f, f, n_CP))
for c in range(n_C):
Aux += (W[:, :, :, c] * Z[:, h, w, c, np.newaxis, np.newaxis, np.newaxis])
A[:, vert_start:vert_end, horiz_start:horiz_end, :] += Aux
In both cases
n_H, n_W, n_C, n_HP, n_WP, n_CP, m, stride and f are scalars
W is an array of shape (f, f, n_CP, n_C)
Z is an array of shape (m, n_H, n_W, n_C)
A is an array of shape (m, n_HP, n_WP, n_CP)
I noticed that the two approaches yield the same result when the "index ranges"(vert_start:vert_end and horiz_start:horiz_end) are scalars instead, i.e. f=1. However, I cannot figure out why it does not work for ranges too.
Below you can find one example for which the code samples result in different ouputs:
np.random.seed(1)
m = 2
f = 2
stride = 1
n_C = 3
n_CP = 1
n_H = 2
n_W = 2
n_HP = 3
n_WP = 3
W = np.random.randn(f, f, n_CP, n_C)
Z = np.random.rand(m, n_H, n_W, n_C)
A = np.zeros((m, n_HP, n_WP, n_CP))
A2 = np.zeros((m, n_HP, n_WP, n_CP))
for h in range(n_H):
for w in range(n_W):
# Find indices
vert_start = h * stride # Starting row-index for current slice
vert_end = vert_start + f # Final row-index (+1) for current slice
horiz_start = w * stride # Starting column-index for current slice
horiz_end = horiz_start + f # Final column-index (+1) for current slice
for c in range(n_C):
Aux = (W[:, :, :, c] * Z[:, h, w, c, np.newaxis, np.newaxis, np.newaxis])
A[:, vert_start:vert_end, horiz_start:horiz_end, :] += Aux
Aux = np.zeros((m, f, f, n_CP))
for c in range(n_C):
Aux += (W[:, :, :, c] * Z[:, h, w, c, np.newaxis, np.newaxis, np.newaxis])
A2[:, vert_start:vert_end, horiz_start:horiz_end, :] += Aux
print(A == A2)

While it appears that there is no difference when printing A and A2, this is only due to the way Python displays the results. Outputting (A - A2) shows that there is indeed a small difference at the positions labeled "False". However, the difference is of dimension e-16. So it simply is a rounding error.

How can I code a matrix within a matrix using a loop?

So I have this 3x3 G matrix (not shown here, it's irrelevant to my problem) that I created using the two variables u (a vector, x - y) and the scalar k. x_j = (x_1 (j), x_2 (j), x_3 (j)) and y_j = (y_1 (j), y_2 (j), y_3 (j)). alpha_j is a 3x3 matrix. The A matrix is block diagonal matrix of size 3nx3n. I am having trouble with the W matrix. How do I code a matrix of size 3nx3n, where the (i,j)th block is the 3x3 matrix given by alpha_i*G_[ij]*alpha_j?? I am lost.
My alpha_j matrix also seems to be having some trouble. The loop keeps throwing me the error, "only length-1 arrays can be converted to Python scalars." pls help :/
def W(x, y, k, alpha, A):
u = x - y
n = x.shape[0]
W = np.zeros((3*n, 3*n))
for i in range(0, n-1):
for j in range(0, n-1):
#u = -np.array([[x[i,0] - x[j,0]], [x[i,1] - x[j,1]], [0]]) ??
W[i][j] = (alpha_j(alpha, A) * G(u, k) * alpha_j(alpha, A))
W[i][i] = np.zeros((n, n))
return W
def alpha_j(a, A):
alph = np.array([[0,0,0],[0,0,0],[0,0,0]],complex)
rho = np.random.rand(3,1)
for i in range(0, 2):
for j in range(0, 2):
alph[i][j] = (rho[i] * a * A[i][j])
return alph
#-------------------------------------------------------------------
x1 = np.array([[1], [2], [0]])
y1 = np.array([[4], [5], [0]])
# SYSTEM PARAMETERS
# incoming Wave angle
theta = 0 # can range from [0, 2pi)
# susceptibility
chi = 10 + 1j
# wavelength
lam = 0.5 # microns (values between .4-.7)
# frequency
k = (2 * np.pi)/lam # 1/microns
# volume
V_0 = (0.05)**3 # microns^3
# incoming wave vector
K = k * np.array([[0], [np.sin(theta)], [np.cos(theta)]])
# polarization vector
vecinc = np.array([[1], [0], [0]]) # (can choose any vector perpendicular to K)
# for the fixed alpha case
alpha = (V_0 * 3 * chi)/(chi + 3)
# 3 x 3 matrix
A = np.matlib.identity(3) # could be any symmetric matrix,
#-------------------------------------------------------------------
# TEST FUNCTIONS
test = G((x1-y1), k)
print(test)
w = W(x1, y1, k, alpha, A)
print(w)
Sometimes my W loops throws me the error, "can't set an array element with a sequence." But I need to set each array element in this arbitrary matrix W to the 3x3 matrix created by multiplying alpha by G...

To your question of how to create a new array with a block for each element, the following should do the trick:
G = np.random.random([3,3])
result = np.zeros([9,9])
num_blocks = 3
a = np.random.random([3,3])
b = np.random.random([3,3])
for i in range(G.shape[0]):
for j in range(G.shape[1]):
block_result = a*G[i,j]*b
for k in range(num_blocks):
for l in range(num_blocks):
result[3*i + k, 3*j + l] = block_result[i, j]
You should be able to generalize from there. I hope I've understood correctly.
EDIT: It looks like I haven't understood correctly. I'm leaving it in hopes it spurs you to an answer. The general idea is to generate ranges of indices to operate on, and then just operate on them directly. Slicing might be helpful, too.
Ah, you asked how to create a diagonal filled with blocks. In that case:
num_diagonal_blocks = 3 # for example
for block_dim in range(num_diagonal_blocks)
# do your block calculation...
for k in range(G.shape[0]):
for l in range(G.shape[1]):
result[3*block_dim + k, 3*block_dim + l] = # assign to element of block
I think that's nearly it.

evaluate many monomials at many points

The following problem concerns evaluating many monomials (x**k * y**l * z**m) at many points.
I would like to compute the "inner power" of two numpy arrays, i.e.,
import numpy
a = numpy.random.rand(10, 3)
b = numpy.random.rand(3, 5)
out = numpy.ones((10, 5))
for i in range(10):
for j in range(5):
for k in range(3):
out[i, j] *= a[i, k]**b[k, j]
print(out.shape)
If instead the line would read
out[i, j] += a[i, k]*b[j, k]
this would be a a number of inner products, computable with a simple dot or einsum.
Is it possible to perform the above loop in just one numpy line?

What about thinking of it in terms of logarithms:
import numpy
a = numpy.random.rand(10, 3)
b = numpy.random.rand(3, 5)
out = np.exp(np.matmul(np.log(a), b))
Since c_ij = prod(a_ik ** b_kj, k=1..K), then log(c_ij) = sum(log(a_ik) * b_ik, k=1..K).
Note: Having zeros in a may mess up the result (also negatives, but then the result wouldn't be well defined anyway). I have given it a try and it doesn't seem to actually break somehow; I don't know if that behavior is guaranteed by NumPy but, to be safe, you can add something at the end like:
out[np.logical_or.reduce(a < eps, axis=1)] = 0

You can use broadcasting after extending those arrays to 3D versions -
(a[:,:,None]**b[None,:,:]).prod(axis=1)
Simply put -
(a[...,None]**b[None]).prod(1)
Basically, we are keeping the last axis and first axis from the two arrays aligned, while performing element-wise powers between the first and last axes from the two inputs. Schematically put using the given sample on shapes -
10 x 3 x 1
1 x 3 x 5

Two more solutions:
Inlining
numpy.array([
numpy.prod([a[:, i]**bb[i] for i in range(len(bb))], axis=0)
for bb in b.T
]).T
and using power.outer:
numpy.prod([numpy.power.outer(a[:, k], b[k]) for k in range(len(b))], axis=0)
Both are a bit slower than the broadcasting solution.
Even with some logic to accommodate for zero and negative values, the exp-log solution takes the cake.
Code to reproduce the plot:
import numpy
import perfplot
def loop(data):
a, b = data
m = a.shape[0]
n = b.shape[1]
out = numpy.ones((m, n))
for i in range(m):
for j in range(n):
for k in range(3):
out[i, j] *= a[i, k]**b[k, j]
return out
def broadcasting(data):
a, b = data
return (a[..., None]**b[None]).prod(1)
def log_exp(data):
a, b = data
neg_a = numpy.zeros(a.shape, dtype=int)
neg_a[a < 0.0] = 1
odd_b = numpy.zeros(b.shape, dtype=int)
odd_b[b % 2 == 1] = 1
negative_count = numpy.dot(neg_a, odd_b)
out = (-1)**negative_count * numpy.exp(
numpy.matmul(
numpy.log(abs(a), where=abs(a) > 0.0),
b
))
zero_a = numpy.zeros(a.shape, dtype=int)
zero_a[a == 0.0] = 1
pos_b = numpy.zeros(b.shape, dtype=int)
pos_b[b > 0] = 1
zero_count = numpy.dot(zero_a, pos_b)
out[zero_count > 0] = 0.0
return out
def inline(data):
a, b = data
return numpy.array([
numpy.prod([a[:, i]**bb[i] for i in range(len(bb))], axis=0)
for bb in b.T
]).T
def outer_power(data):
a, b = data
return numpy.prod([
numpy.power.outer(a[:, k], b[k]) for k in range(len(b))
], axis=0)
perfplot.show(
setup=lambda n: (
numpy.random.rand(n, 3) - 0.5,
numpy.random.randint(0, 10, (3, n))
),
n_range=[2**k for k in range(11)],
repeat=10,
kernels=[
loop,
broadcasting,
inline,
log_exp,
outer_power
],
logx=True,
logy=True,
xlabel='len(a)',
)

import numpy
a = numpy.random.rand(10, 3)
b = numpy.random.rand(3, 5)
out = [[numpy.prod([a[i, k]**b[k, j] for k in range(3)]) for j in range(5)] for i in range(10)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize this loop - python

Related

torch gather using two index arrays

fast <image,time> linear interpolation

Cumulative sum over slices of array: Two approaches

How can I code a matrix within a matrix using a loop?

evaluate many monomials at many points

Categories

Resources