Explain pitch, width, height, depth in memory for 3D arrays - python

I am working with CUDA and 3D textures in python (using pycuda). There is a function called Memcpy3D which has the same members as Memcpy2D plus a few extras. In it it calls you to describe things such as width_in_bytes, src_pitch, src_height, height and copy_depth. This is what I am struggling with (in 3D) and its relevance with C or F style indexing. For instance, if I simply change the ordering from F to C in the working example below, it stops working - and I don't know why.
First of all, I understand pitch to be how many bytes in memory it takes to move one index across in threadIdx.x (or the x direction, or a column). So for a float32 array of C shape (3,2,4), to move one value in x I expect to move 4 values in memory (as the indexing goes down the z axis first?). Therefore my pitch would be 4*32bits.
I understand height to be the number of rows. (In this example, 3)
I understand width to be the number of cols. (In this example, 2)
I understand depth to be the number of z slices. (In this example, 4)
I understand width_in_bytes to be the width of a row in x inclusive of the z elements behind it, i.e. a row slice, (0,:,:). This would be how many addresses in memory it takes to transverse one element in the y-direction.
So when I change the ordering from F to C in the code below, and adapt the code to change the height/width values accordingly it still doesn't work. It just presents a logic failure which makes me think I'm not understanding the concept of pitch, width, height, depth correctly.
Please educate me.
Below is a full working script that copies an array to the GPU as a texture and copies the contents back.
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
w = 2
h = 3
d = 4
shape = (w, h, d)
a = np.arange(24).reshape(*shape,order='F').astype('float32')
print(a.shape,a.strides)
print(a)
descr = drv.ArrayDescriptor3D()
descr.width = w
descr.height = h
descr.depth = d
descr.format = drv.dtype_to_array_format(a.dtype)
descr.num_channels = 1
descr.flags = 0
ary = drv.Array(descr)
copy = drv.Memcpy3D()
copy.set_src_host(a)
copy.set_dst_array(ary)
copy.width_in_bytes = copy.src_pitch = a.strides[1]
copy.src_height = copy.height = h
copy.depth = d
copy()
mod = SourceModule("""
texture<float, 3, cudaReadModeElementType> mtx_tex;
__global__ void copy_texture(float *dest)
{
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
int dx = blockDim.x;
int dy = blockDim.y;
int i = (z*dy + y)*dx + x;
dest[i] = tex3D(mtx_tex, x, y, z);
}
""")
copy_texture = mod.get_function("copy_texture")
mtx_tex = mod.get_texref("mtx_tex")
mtx_tex.set_array(ary)
dest = np.zeros(shape, dtype=np.float32, order="F")
copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
print(dest)

Not sure I fully understand the problem in your code, but I'll attempt to clarify.
In CUDA, width (x) refers to the fastest-changing dimension, height (y) is the middle dimension, and depth (z) is the slowest-changing dimension. The pitch refers to the stride in bytes required to step between values along the y dimension.
In Numpy, an array defined as np.empty(shape=(3,2,4), dtype=np.float32, order="C") has strides=(32, 16, 4), and corresponds to width=4, height=2, depth=3, pitch=16.
Using "F" ordering in Numpy means the order of dimensions is reversed in memory.
Your code appears to work if I make the following changes:
#shape = (w, h, d)
shape = (d, h, w)
#a = np.arange(24).reshape(*shape,order='F').astype('float32')
a = np.arange(24).reshape(*shape,order='C').astype('float32')
...
#dest = np.zeros(shape, dtype=np.float32, order="F")
dest = np.zeros(shape, dtype=np.float32, order="C")
#copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
copy_texture(drv.Out(dest), block=(w,h,d), texrefs=[mtx_tex])

Related

Python/Numpy - Vectorized implementation of this for loop?

This is a lethargic implementation of a cloud mask based on interpolation across temporal channels of a satellite image. The image array is of shape (n_samples x n_months x width x height x channels). The channels are not just RGB, but also from the invisible spectrum such as SWIR, NIR, etc. One of the channels (or bands, in the satellite image world) is a cloud mask that tells me 0 means "no cloud" and 1024 or 2048 means "cloud" in that pixel.
I'm using this pixel-wise cloud mask channel to change the values on all remaining channels by interpolation between the previous/next month. This implementation is super slow and I'm having a hard time coming up with vectorized implementation.
Is it possible to vectorize this implementation? What is it?
Any suggestion on how to deduce the logic of vectorized implementation of complex array operations? In other words, how do I learn the art of vectorization?
I'm a novice, so please excuse my ignorance.
n_samples = 1055
n_months = 12
width = 40
height = 40
channels = 13 # channel 13 is the cloud mask, based on which the first 12 channel pixels are interpolated)
# This function fills nan values in a list by interpolation
def fill_nan(y):
nans = np.isnan(y)
x = lambda z: z.nonzero()[0]
y[nans]= np.interp(x(nans), x(~nans), y[~nans])
return y
#for loop to first fill cloudy pixels with nan
for sample in range(1055):
for temp in range(12):
for w in range(40):
for h in range(40):
if Xtest[sample,temp,w,h,13] > 0:
Xtest[sample,temp,w,h,:12] = np.nan
#for loop to fill nan with interpolated values
for sample in range(1055):
for w in range(40):
for h in range(40):
for ch in range(12):
Xtest[sample,: , w, h, ch] = fill_nan(Xtest[sample,: , w, h, ch])
For the first loop,
import numpy as np
Xtest = np.random.rand(10, 3, 2, 4, 14)
Xtest_v = Xtest.copy()
for sample in range(10):
for temp in range(3):
for w in range(2):
for h in range(4):
if Xtest[sample,temp,w,h,13] > 0:
Xtest[sample,temp,w,h,:12] = np.nan
Xtest_v[..., :12][Xtest_v[..., 13]>0] = np.nan
print(np.nansum(Xtest))
print(np.nansum(Xtest_v))
You can verify that both the arrays are the same by printing out the sum ignoring nans.

Most efficient way to transfrom a 2d array to a different coordinate system using a function, then interpolate the resultant holes

To start, Im basically trying to go from this:
To this:
Given that each coordinate [x,y] correspond with a given point in the second image after a function is applied to x and y. f(x,y)=coords of the second image for the value of [x,y]. The way Im handling this part as of now is to make a "map" array of x and y and the lookup in that array to find the new point. so mapArrayX[x] will give the new x value and mapArray[y] will give the new Y value. The Issue with this is that I have to iterate over the entire image (256,000 points) and that takes roughly .4 seconds. Is there a better way to do this?
The second issue is after transforming the coordinates I get an image with holes in it that looks like this:
which I make look like the image above without the holes by doing this:
dewarpedImage[dewarpedImage == 0] = np.nan
x = np.arange(0, dewarpedImage.shape[1])
y = np.arange(0, dewarpedImage.shape[0])
# mask invalid values
dewarpedImage = np.ma.masked_invalid(dewarpedImage)
xx, yy = np.meshgrid(x, y)
# get only the valid values
x1 = xx[~dewarpedImage.mask]
y1 = yy[~dewarpedImage.mask]
newarr = dewarpedImage[~dewarpedImage.mask]
startTime = time.time()
dewarpedImage = interpolate.griddata((x1, y1), newarr.ravel(),
(xx, yy),
method='linear')
This takes roughly 3 seconds to perform. Is there a faster way to do this maybe. I ideally need to get this whole process to go from taking 3+seconds to less than 1 second.
Here is my conversion function/how I generate my mapping:
RANGE_BIN_SIZE = .39
def rangeBinToRange(rangeBin):
return rangeBin * RANGE_BIN_SIZE
def azToDegree(azBin):
degree = math.degrees(math.asin((azBin - 127.5) * 0.3771/(0.19812*255)))
return degree
def makeWarpMap():
print("making warp maps")
xMap = np.zeros((1024, 256))
yMap = np.zeros((1024, 256))
for az in range(256):
for rang in range(1024):
azDegree = azToDegree(az)
dist = rangeBinToRange(rang)
x = round(dist * math.sin(math.radians(azDegree)) + 381)
y = round(dist * math.cos(math.radians(azDegree)))
xMap[rang][az] = x
yMap[rang][az] = y
np.save("warpmapX", xMap)
np.save("warpmapY", yMap)
print(azToDegree(0))
if not path.exists("warpmapX.npy") or not path.exists("warpmapY.npy"):
makeWarpMap()
data = np.load(filename)
xMap = np.load("warpmapX.npy")
yMap = np.load("warpmapY.npy")
dewarpedImage = np.zeros((400, 762))
print(data.shape)
for az in range(256):
azslice = data[:, az]
for rang in range(1024):
intensity = azslice[rang]
x = xMap[rang][az]
y = yMap[rang][az]
dewarpedImage[int(y)][int(x)] = intensity
You have holes in your converted image because your conversion does not span the entire polar image. I would recommend to do the reverse conversion. In other words, for each (X,Y) in polar image, find corresponding point (x,y) in cartesian image and get that color. That way you won't need to deal with holes at all and it will give you a full image (it will get rid of 3sec conversion). If you provide your conversion function, I can help you do the reverse conversion.

Compute a least square only on n best values of an array

I have :
print(self.L.T.shape)
print(self.M.T.shape)
(8, 3)
(8, 9082318)
self.N = np.linalg.lstsq(self.L.T, self.M.T, rcond=None)[0].T
which is working fine and return
(9082318, 3)
But
I want to perform a kind of sort on M and compute the solution only on the best 8 - n values of M.
Or ignore values of M below and/or higher than a certain value.
Any pointer on how to do that would be extremely appreciated.
Thank you.
Tried to copy this solution exactly but it return an error
The original working function but basically it's just one line.
M is a stack of 8 grayscale images reshaped.
L is a stack of 8 light direction vectors.
M contains shadows but not always at the same location in the image.
So I need to remove those pixel from the computation but L must retains its dimensions.
def _solve_l2(self):
"""
Lambertian Photometric stereo based on least-squares
Woodham 1980
:return: None
Compute surface normal : numpy array of surface normal (p \times 3)
"""
self.N = np.linalg.lstsq(self.L.T, M.T, rcond=None)[0].T
print(self.N.shape)
self.N = normalize(self.N, axis=1) # normalize to account for diffuse reflectance
Here is the borrowed code from the link for trying to resolve this :
L and M as previously used
Ma = self.M.copy()
thresh = 300
Ma[self.M <= thresh] = 0
Ma[self.M > thresh] = 1
Ma = Ma.T
self.M = self.M.T
self.L = self.L.T
print(self.L.shape)
print(self.M.shape)
print(Ma.shape)
A = self.L
B = self.M
M = Ma #http://alexhwilliams.info/itsneuronalblog/2018/02/26/censored-lstsq/
# else solve via tensor representation
rhs = np.dot(A.T, M * B).T[:,:,None] # n x r x 1 tensor
T = np.matmul(A.T[None,:,:], M.T[:,:,None] * A[None,:,:]) # n x r x r tensor
self.N = np.squeeze(np.linalg.solve(T, rhs)).T # transpose to get r x n
return
numpy.linalg.LinAlgError: Singular matrix

How to best optimize calculations iterated over NxM grid in Python

Working in Python, I am doing some physics calculations over an NxM grid of values, where N goes from 1 to 3108 and M goes from 1 to 2304 (this corresponds to a large image). I need calculate a value at each and every point in this space, which totals ~ 7 million calculations. My current approach is painfully slow, and I am wondering if there is a way to complete this task and it not take hours...
My first approach was just to use nested for loops, but this seemed like the least efficient way to solve my problem. I have tried using NumPy's nditer and iterating over each axis individually, but I've read that it doesn't actually speed up my computations. Rather than looping through each axis individually, I also tried making a 3-D array and looping through the outer axis as shown in Brian's answer here How can I, in python, iterate over multiple 2d lists at once, cleanly? . Here is the current state of my code:
import numpy as np
x,y = np.linspace(1,3108,num=3108),np.linspace(1,2304,num=2304) # x&y dimensions of image
X,Y = np.meshgrid(x,y,indexing='ij')
all_coords = np.dstack((X,Y)) # moves to 3-D
all_coords = all_coords.astype(int) # sets coords to int
For reference, all_coords looks like this:
array([[[1.000e+00, 1.000e+00],
[1.000e+00, 2.000e+00],
[1.000e+00, 3.000e+00],
...,
[1.000e+00, 2.302e+03],
[1.000e+00, 2.303e+03],
[1.000e+00, 2.304e+03]],
[[2.000e+00, 1.000e+00],
[2.000e+00, 2.000e+00],
[2.000e+00, 3.000e+00],
...,
[2.000e+00, 2.302e+03],
[2.000e+00, 2.303e+03],
[2.000e+00, 2.304e+03]],
and so on. Back to my code...
'''
- below is a function that does a calculation on the full grid using the distance between x0,y0 and each point on the grid.
- the function takes x0,y0 and returns the calculated values across the grid
'''
def do_calc(x0,y0):
del_x, del_y = X-x0, Y-y0
np.seterr(divide='ignore', invalid='ignore')
dmx_ij = (del_x/((del_x**2)+(del_y**2))) # x component
dmy_ij = (del_y/((del_x**2)+(del_y**2))) # y component
return dmx_ij,dmy_ij
# now the actual loop
def do_loop():
dmx,dmy = 0,0
for pair in all_coords:
for xi,yi in pair:
DM = do_calc(xi,yi)
dmx,dmy = dmx+DM[0],dmy+DM[1]
return dmx,dmy
As you might see, this code takes an incredibly long time to run... If there is any way to modify my code such that it doesn't take hours to complete, I would be extremely interested in knowing how to do that. Thanks in advance for the help.
Here is a method that gives a 10,000x speedup at N=310, M=230. As the method scales better than the original code I'd expect a factor of more than a million at the full problem size.
The method exploits the shift invariance of the problem. For example, del_x**2 is essentially the same up to shift at each call of do_calc, so we compute it only once.
If the output of do_calc is weighted before summation the problem is no longer fully translation invariant, and this method doesn't work anymore. The result, however, can then be expressed in terms of linear convolution. At N=310, M=230 this still leaves us with a more than 1,000x speedup. And, again, this will be more at full problem size
Code for original problem
import numpy as np
#N, M = 3108, 2304
N, M = 310, 230
### OP's code
x,y = np.linspace(1,N,num=N),np.linspace(1,M,num=M) # x&y dimensions of image
X,Y = np.meshgrid(x,y,indexing='ij')
all_coords = np.dstack((X,Y)) # moves to 3-D
all_coords = all_coords.astype(int) # sets coords to int
'''
- below is a function that does a calculation on the full grid using the distance between x0,y0 and each point on the grid.
- the function takes x0,y0 and returns the calculated values across the grid
'''
def do_calc(x0,y0):
del_x, del_y = X-x0, Y-y0
np.seterr(divide='ignore', invalid='ignore')
dmx_ij = (del_x/((del_x**2)+(del_y**2))) # x component
dmy_ij = (del_y/((del_x**2)+(del_y**2))) # y component
return np.nan_to_num(dmx_ij), np.nan_to_num(dmy_ij)
# now the actual loop
def do_loop():
dmx,dmy = 0,0
for pair in all_coords:
for xi,yi in pair:
DM = do_calc(xi,yi)
dmx,dmy = dmx+DM[0],dmy+DM[1]
return dmx,dmy
from time import time
t = [time()]
### pp's code
x, y = np.ogrid[-N+1:N-1:2j*N - 1j, -M+1:M-1:2j*M - 1J]
den = x*x + y*y
den[N-1, M-1] = 1
xx = x / den
yy = y / den
for zz in xx, yy:
zz[N:] -= zz[:N-1]
zz[:, M:] -= zz[:, :M-1]
XX = xx.cumsum(0)[N-1:].cumsum(1)[:, M-1:]
YY = yy.cumsum(0)[N-1:].cumsum(1)[:, M-1:]
t.append(time())
### call OP's code for reference
X_OP, Y_OP = do_loop()
t.append(time())
# make sure results are equal
assert np.allclose(XX, X_OP)
assert np.allclose(YY, Y_OP)
print('pp {}\nOP {}'.format(*np.diff(t)))
Sample run:
pp 0.015251636505126953
OP 149.1642508506775
Code for weighted problem:
import numpy as np
#N, M = 3108, 2304
N, M = 310, 230
values = np.random.random((N, M))
x,y = np.linspace(1,N,num=N),np.linspace(1,M,num=M) # x&y dimensions of image
X,Y = np.meshgrid(x,y,indexing='ij')
all_coords = np.dstack((X,Y)) # moves to 3-D
all_coords = all_coords.astype(int) # sets coords to int
'''
- below is a function that does a calculation on the full grid using the distance between x0,y0 and each point on the grid.
- the function takes x0,y0 and returns the calculated values across the grid
'''
def do_calc(x0,y0, v):
del_x, del_y = X-x0, Y-y0
np.seterr(divide='ignore', invalid='ignore')
dmx_ij = (del_x/((del_x**2)+(del_y**2))) # x component
dmy_ij = (del_y/((del_x**2)+(del_y**2))) # y component
return v*np.nan_to_num(dmx_ij), v*np.nan_to_num(dmy_ij)
# now the actual loop
def do_loop():
dmx,dmy = 0,0
for pair, vv in zip(all_coords, values):
for (xi,yi), v in zip(pair, vv):
DM = do_calc(xi,yi, v)
dmx,dmy = dmx+DM[0],dmy+DM[1]
return dmx,dmy
from time import time
from scipy import signal
t = [time()]
x, y = np.ogrid[-N+1:N-1:2j*N - 1j, -M+1:M-1:2j*M - 1J]
den = x*x + y*y
den[N-1, M-1] = 1
xx = x / den
yy = y / den
XX, YY = (signal.fftconvolve(zz, values, 'valid') for zz in (xx, yy))
t.append(time())
X_OP, Y_OP = do_loop()
t.append(time())
assert np.allclose(XX, X_OP)
assert np.allclose(YY, Y_OP)
print('pp {}\nOP {}'.format(*np.diff(t)))
Sample run:
pp 0.12683939933776855
OP 158.35225439071655

High performance variable blurring in very big images using Python

I have a large collection of large images (ex. 15000x15000 pixels) that I would like to blur. I need to blur the images using a distance function, so the further away I move from some areas in the image the more heavier the blurring should be. I have a distance map describing how far a given pixel is from the areas.
Due to the large amount of images I have to consider performance. I have looked at NumPY/SciPY, they have some great functions but they seem to use a fixed kernel size and I need to reduce or increase the kernel size depending on the distance to the previous mentioned areas.
How can I solve this problem in python?
UPDATE: My solution so far based on the answer by rth:
# cython: boundscheck=False
# cython: cdivision=True
# cython: wraparound=False
import numpy as np
cimport numpy as np
def variable_average(int [:, ::1] data, int[:,::1] kernel_size):
cdef int width, height, i, j, ii, jj
width = data.shape[1]
height = data.shape[0]
cdef double [:, ::1] data_blurred = np.empty([width, height])
cdef double res
cdef int sigma, weight
for i in range(width):
for j in range(height):
weight = 0
res = 0
sigma = kernel_size[i, j]
for ii in range(i - sigma, i + sigma + 1):
for jj in range(j - sigma, j + sigma + 1):
if ii < 0 or ii >= width or jj < 0 or jj >= height:
continue
res += data[ii, jj]
weight += 1
data_blurred[i, j] = res/weight
return data_blurred
Test:
data = np.random.randint(256, size=(1024,1024))
kernel = np.random.randint(256, size=(1024,1024)) + 1
result = np.asarray(variable_average(data, kernel))
The method using the above settings takes around 186seconds to run. Is that what I can expect to ultimately squeeze out of the method or are there optimizations that I can use to further increase the performance (still using Python)?
As you have noted related scipy functions do not support variable size blurring. You could implement this in pure python with for loops, then use Cython, Numba or PyPy to get a C-like performance.
Here is a low level python implementation, than uses numpy only for data storage,
import numpy as np
def variable_blur(data, kernel_size):
""" Blur with a variable window size
Parameters:
- data: 2D ndarray of floats or integers
- kernel_size: 2D ndarray of integers, same shape as data
Returns:
2D ndarray
"""
data_blurred = np.empty(data.shape)
Ni, Nj = data.shape
for i in range(Ni):
for j in range(Nj):
res = 0.0
weight = 0
sigma = kernel_size[i, j]
for ii in range(i - sigma, i+sigma+1):
for jj in range(j - sigma, j+sigma+1):
if ii<0 or ii>=Ni or jj < 0 or jj >= Nj:
continue
res += data[ii, jj]
weight += 1
data_blurred[i, j] = res/weight
return data_blurred
data = np.random.rand(50, 20)
kernel_size = 3*np.ones((50, 20), dtype=np.int)
variable_blur(data, kernel_size)
that calculates an arithmetic average of pixels with a variable kernel size. It is a bad implementation with respect to numpy, in a sense that is it not vectorized. However, this makes it convenient to port to other high performance solutions:
Cython: simply statically typing variables, and compiling should give you C-like performance,
def variable_blur(double [:, ::1] data, long [:, ::1] kernel_size):
cdef double [:, ::1] data_blurred = np.empty(data.shape)
cdef Py_ssize_t Ni, Nj
Ni = data.shape[0]
Nj = data.shape[1]
for i in range(Ni):
# [...] etc.
see this post for a complete example, as well as the compilation notes.
Numba: Wrapping the above function with the #jit decorator, should be mostly sufficient.
PyPy: installing PyPy + the experimental numpy branch, could be another alternative worth trying. Although, then you would have to use PyPy for all your code, which might not be possible at present.
Once you have a fast implementation, you can then use multiprocessing, etc. to process different images in parallel, if need be. Or even parallelize with OpenMP in Cython the outer for loop.
I came across this while googling and thought I would share my own solution which is mostly vectorized and doesn't include any for loops on pixels. You can approximate a Gaussian blur by running a box blur multiple times in a row. So the approach I decided to use is to iteratively box blur the image, but to vary the number of iterations per pixel using a weighting function.
If you need a large blur radius, the number of iterations grows quadratically, so consider increasing the ksize.
Here is the implementation
import cv2
def variable_blur(im, sigma, ksize=3):
"""Blur an image with a variable Gaussian kernel.
Parameters
----------
im: numpy array, (h, w)
sigma: numpy array, (h, w)
ksize: int
The box blur kernel size. Should be an odd number >= 3.
Returns
-------
im_blurred: numpy array, (h, w)
"""
variance = box_blur_variance(ksize)
# Number of times to blur per-pixel
num_box_blurs = 2 * sigma**2 / variance
# Number of rounds of blurring
max_blurs = int(np.ceil(np.max(num_box_blurs))) * 3
# Approximate blurring a variable number of times
blur_weight = num_box_blurs / max_blurs
current_im = im
for i in range(max_blurs):
next_im = cv2.blur(current_im, (ksize, ksize))
current_im = next_im * blur_weight + current_im * (1 - blur_weight)
return current_im
def box_blur_variance(ksize):
x = np.arange(ksize) - ksize // 2
x, y = np.meshgrid(x, x)
return np.mean(x**2 + y**2)
And here is an example
im = np.random.rand(300, 300)
sigma = 3
# Variable
x = np.linspace(0, 1, im.shape[1])
y = np.linspace(0, 1, im.shape[0])
x, y = np.meshgrid(x, y)
sigma_arr = sigma * (x + y)
im_variable = variable_blur(im, sigma_arr)
# Gaussian
ksize = sigma * 8 + 1
im_gauss = cv2.GaussianBlur(im, (ksize, ksize), sigma)
# Gaussian replica
sigma_arr = np.full_like(im, sigma)
im_approx = variable_blur(im, sigma_arr)
Blurring results
The plot is:
Top left: Source image
Top right: Variable blurring
Bottom left: Gaussian blurring
Bottom right: Approximated Gaussian blurring

Categories