Cupyx 3D FFT convolution on GPU slows down after multiple sequential calls

Cupyx 3D FFT convolution on GPU slows down after multiple sequential calls - python

First off, I have never asked a question on stackoverflow before, and I will do my best to follow the site guidelines, but let me know if I should changes something about my post.
I am attempting to write a function that can quickly extract the pore size distribution from a binary 3D image. I do this by computing the local thickness of the image, in a similar fashion to that implementing in ImageJ's local thickness plugin. I need this function ideally to run in under 1 second, as I am calling it ~200000 times in a simulated annealing process. It is performed partially on the CPU (12th Gen Intel(R) Core(TM) i7-12700KF, 20 cores, 16GB RAM) and partially on the GPU (RTX GeForce 3050, 8GB).
The function works, but there is something happening I think on the backend, which is slowing it down artificially. This may have to do with threading, or GPU to CPU overhead, or some kind of 'cool down' period.
There are three parts of the function:
Euclidean Distance Transform - performed on CPU, in parallel using edt package. Currently takes ~0.25 seconds on a 250^3 binary image
3d Skeletonization - performed on CPU using skimage.morphology.skeletonize_3d, but with the image split into chunks using dask. This implementation is provided by porespy.filters.chunked_func. Multiply skeleton by the distance transform to get a skeleton with values equal to the minimum distance to nearest background voxel. This process takes 0.45 to 0.5 seconds.
Dilate each voxel on the skeleton using a spherical structuring element with radius equal to the value of the skeleton voxel. This is done in a for loop, starting from the maximum structuring element size, and in decreasing order. Larger spheres do not get overwritten by smaller spheres. The dilations are accomplished using fft convolution on the GPU using cupyx.scipy.signal.signaltools.convolve, which takes ~ 0.005 seconds.
Less code is required to reproduce the effect I am seeing, however. The essential part is performing many fft convolutions in sequence.
A minimum reproducible example is as follows:
import skimage
import time
import cupy as cp
from cupyx.scipy.signal.signaltools import convolve
# Generate a binary image
im = cp.random.random((250,250,250)) > 0.4
# Generate spherical structuring kernels for input to convolution
structuring_kernels = {}
for r in range(1,21):
structuring_kernels.update({r: cp.array(skimage.morphology.ball(r))})
# run dilation process in loop
for i in range(10):
s = time.perf_counter()
for j in range(20,0,-1):
convolve(im, structuring_kernels[j], mode='same', method='fft')
e = time.perf_counter()
# time.sleep(2)
print(e-s)
When run as is, after the first couple of loops, each dilation loop takes ~ 1.8 seconds on my computer. If I uncomment the time.sleep(2) line (ie pause for 2 seconds between each loop), then the loop function call only takes 0.05 seconds. I suspect this has to do with threading or GPU use, as it takes a couple of loops for it to reach the 1.8 seconds, then it remains steady at that value. When I monitor my GPU usage, the 3D monitor quickly spikes to 100% and stays close to there.
If I am just being limited by the capacity of my GPU, why do the first couple of loops run faster? Could a memory leak be happening? Does anyone know why this is happening, and if there is a way to prevent it, possibly using backend controls in cupy?
I'm not sure if this is necessary, but my local thickness function in its entirety is as follows:
import porespy as ps
from skimage.morphology import skeletonize_3d
import time
import numpy as np
import cupy as cp
from edt import edt
from cupyx.scipy.signal.signaltools import convolve
def local_thickness_cp(im, masks=None, method='fft'):
"""
Parameters
----------
im: 3D voxelized image for which the local thickness map is desired
masks: (optional) A dictionary of the structuring elements to be used
method: 'fft' or 'direct'
Returns
-------
The local thickness map
"""
s = time.perf_counter()
# Calculate the euclidean distance transform using edt package
dt = cp.array(edt(im, parallel=15))
e = time.perf_counter()
# print(f'EDT took {e - s}')
s = time.perf_counter()
# Calculate the skeleton of the image and multiply by dt
skel = cp.array(ps.filters.chunked_func(skeletonize_3d,
overlap=17,
divs=[2, 3, 3],
cores=20,
image=im).astype(bool)) * dt
e = time.perf_counter()
# print(f'skeletonization took {e - s} seconds')
r_max = int(cp.max(skel))
s = time.perf_counter()
if not masks:
masks = {}
for r in range(int(r_max), 0, -1):
masks.update({r: cp.array(ps.tools.ps_ball(r))})
e = time.perf_counter()
# print(f'mask creation took {e - s} seconds')
# Initialize the local thickness image
final = cp.zeros(cp.shape(skel))
time_in_loop = 0
s = time.perf_counter()
for r in range(r_max, 0, -1):
# Get a mask of where the skeleton has values between r-1 and r
skel_selected = ((skel > r - 1) * (skel <= r)).astype(int)
# Perform dilation on the mask using fft convolve method, and multiply by radius of pore size
dilation = (convolve(skel_selected, masks[r], mode='same', method=method) > 0.1) * r
# Add dilation to local thickness image, where it is still zero (ie don't overwrite previous inserted values)
final = final + (final == 0) * dilation
e = time.perf_counter()
# print(f'Dilation loop took {e - s} seconds')
return final
Now, in theory, the function should take ~ 0.80 seconds to compute. However, when called in a loop on separate images, it takes ~1.5 seconds. However, if I add a time.sleep(1) after each function call, then the function does take approximately 0.8 seconds.

Related

Is there a way to speed up or vectorize a nested for loop?

I am fairly new to coding in general and to Python in particular. I am trying to apply a weighted average scheme into a big dataset, which at the moment is taking hours to complete and I would love to speed up the process also because this has to be repeated several times.
The weighted average represents a method used in marine biogeochemistry that includes the history of gas transfer velocities (k) in-between sampling dates, where k is weighted according to the fraction of water column (f) ventilated by the atmosphere as a function of the history of k and assigning more importance to values that are closer to sampling time (so the weight at sampling time step = 1 and then it decreases moving away in time):
Weight average equation extracted from (https://doi.org/10.1029/2017GB005874) pp. 1168
In my attempt I used a nested for loop where at each time step t I calculated the weighted average:
def kw_omega (k, depth, window, samples_day):
"""
calculate the scheme weights for gas transfer velocity of oxygen
over the previous window of time, where the most recent gas transfer velocity
has a weight of 1, and the weighting decreases going back in time. The rate of decrease
depends on the wind history and MLD.
Parameters
----------
k: ndarray
instantaneous O2 gas transfer velocity
depth: ndarray
Water depth
window: integer
weighting period in days which equals the residence time of oxygen at sampling day
samples_day: integer
number of samples in each day composing window
Returns
---------
weighted_kw: ndarray
Notes
---------
n = the weighting period / the time resolution of the wind data
samples_day = the time resolution of the wind data
omega = is the weighting coefficient at each time step within the weighting window
f = the fraction of the water column (mixed layer, photic zone or full water column) ventilated at each time
"""
Dt = 1./samples_day
f = (k*Dt)/depth
f = np.flip(f)
k = np.flip(k)
n = window*samples_day
weighted_kw = np.zeros(len(k))
for t in np.arange(len(k) - n):
omega = np.zeros((n))
omega[0] = 1.
for i in np.arange(1,len(omega)):
omega[i] = omega[i-1]*(1-f[t+(i-1)])
weighted_kw[t] = sum(k[t:t+n]*omega)/sum(omega)
print(f"t = {t}")
return np.flip(weighted_kw)
This should be used on model simulation data which was set to run for almost 2 years where the model time step was set to 60 seconds, and sampling is done at intervals of 7 days. Therefore k has shape (927360) and n, representing the number of minutes in 7 days has shape (10080). At the moment it is taking several hours to run. Is there a way to make this calculation faster?

I would recommend to use the package numba to speed up your calculation.
import numpy as np
from numba import njit
from numpy.lib.stride_tricks import sliding_window_view
#njit
def k_omega(k_win, f_win):
delta_t = len(k_win)
omega_sum = omega = 1.0
k_omega_sum = k_win[0]
for t in range(1, delta_t):
omega *= (1 - f_win[t])
omega_sum += omega
k_omega_sum = k_win[t] * omega
return k_omega_sum / omega_sum
#njit
def windows_k_omega(k_wins, f_wins):
size = len(k_wins)
result = np.empty(size)
for i in range(size):
result[i] = k_omega(k_wins[i], f_wins[i])
return result
def kw_omega(k, depth, window, samples_day):
n = window * samples_day # delta_t
f = k / depth / samples_day
k_wins = sliding_window_view(k, n)
f_wins = sliding_window_view(f, n)
k_omegas = windows_k_omega(k_wins, f_wins)
weighted_kw = np.pad(weighted_kw, (len(k)-len(k_omegas), 0))
return weighted_kw
Here, I have split up the function into three in order to make it more comprehensible. The function k_omega is basically applying your weighted mean function to a k and f window. The function windows_k_omega is just to speed up the loop to apply the function element wise on the windows. Finally, the outer function kw_omega implements your original function interface. It uses the numpy function sliding_window_view to create the moving windows (note that this is a fancy numpy indexing under the hood, so this is not creating a copy of the original array) and performs the calculation with the helper functions and takes care of the padding of the result array (initial zeros).
A short test with your original function showed some different results, which is likely due to your np.flip calls reverse the arrays for your indexing. I just implemented the formula which you provided without checking your indexing in depth, so I leave this task to you. You should maybe call it with some dummy inputs which you can check manually.
As an additional note to your code: If you want to loop on index, you should use the build in range instead of using np.arange. Internally, python uses a generator for range instead of creating the array of indexes first to iterate over each individually. Furthermore, you should try to reduce the number of arrays which you need to create, but instead re-use them, e.g. the omega = np.zeros(n) could be created outside the outer for loop using omega = np.empty(n) and internally only initialized on each new iteration omega[:] = 0.0. Note, that all kind of memory management which is typically the speed penalty, beside array element access by index, is something which you need to do with numpy yourself, because there is no compiler which helps you, therefore I recommend using numba, which compiles your python code and helps you in many ways to make your number crunching faster.

Python absdiff performance on submatrices

I am calculating the sum of the absolute difference of two matrices (representing two images, in RGBA with uint8 type) to see how similar they are to eachother. I've stumbled upon a weird efficiency problem, where calculating the distance on submatrices (specifically, I'm only considering only the RGB channels) takes roughly 5 times as much as doing it on the whole matrices.
Here's my code:
Firstly, I load two images using OpenCV and convert them in RGBA (since when they load they are BGR) like this (the paths are specified before):
template_image = cv2.cvtColor(cv2.imread(template_path, flags=cv2.IMREAD_UNCHANGED), cv2.COLOR_BGR2RGBA)
big_image = cv2.cvtColor(cv2.imread(big_image_path, flags=cv2.IMREAD_UNCHANGED), cv2.COLOR_BGR2RGBA)
Then I calculate the similarity this way (the 1000 loops are only to exagerate the time that it takes. I do 1000 loops in all the tests so it shouldn't make a difference):
first = template_image[:,:,0:3]
second = big_image[0:200, 0:400,0:3]
t = time.time()
for i in range(1000):
numpy.sum(cv2.absdiff(first, second))
print(f'elapsed time: {(time.time() - t):.3f} s')
Running this multiple times, it gives elapsed times that swing between 1s and 1.2s
This weirded me out because before restricting it to the RGB channels it used to take much less. In fact, with this code:
first = template_image[:,:]
second = big_image[0:200, 0:400]
t = time.time()
for i in range(1000):
numpy.sum(cv2.absdiff(first, second))
print(f'elapsed time: {(time.time() - t):.3f} s')
the average elapsed time is roughly 0.25s
This itself is enough to weird me out as why should it take 5 times as much to do technically less calculations?
I've also tried a different piece of code to check the similarity only on the RGBA and this weirds me out even more:
first = numpy.zeros((200, 400, 3), dtype=numpy.uint8)
first[:,:,:] = template_image[:,:,0:3]
second = numpy.zeros((200, 400, 3), dtype=numpy.uint8)
second[:,:,:] = big_image[0:200, 0:400,0:3]
t = time.time()
for i in range(1000):
numpy.sum(cv2.absdiff(first, second))
print(f'elapsed time: {(time.time() - t):.3f} s')
Now this code takes on average 0.19s. It makes sense that it takes less than the one calculated on RGBA as it is less calculations, but why would it take so much less than the other one done only on the RGB channels?
I've specified the dtype=numpy.uint8 because that's the type of the images as I read them with OpenCV, and otherwise it would create matrices in float64 type which (understandibly) takes much more time.
What's the reason behind this weird behaviour? Thanks!

Slow performance using scipy.interpolate inside loop

I am using Python to generate input data for engineering simulations. I need to define how a certain physical quantity (let's call it force) varies with time.
The force depends on a known time-varying quantity (let's call it angle). The dependency between force and angle is different for each time step. The force-angle dependency for each time step is defined as data points, not as a function. For each time step I need to interpolate the value for force from the force-angle dependency of that time step. I am using scipy.interpolate.interp1d inside a list comprehension.
However, I am unhappy with the performance. The interpolation loop takes almost 20 seconds, which is unacceptably slow. The number of time steps is approximately 250k. The number of data points in a force-angle dependency is approximately 2k. I tried using a for loop instead, but this was even slower.
How can I improve performance? The execution time needs to be less than a second, if possible. The code here does not contain the actual data I'm using, but is similar enough.
import numpy as np
import random
from scipy.interpolate import interp1d
import time
nTimeSteps = 250000
nPoints_forceAngleDependency = 2000
# Generating an example (bogus data)
angle = np.linspace(0., 2*np.pi, nPoints_forceAngleDependency)
forceAngleDependency_forEachTimeStep = [random.random() * np.sin(angle) for i in range(nTimeSteps)]
angleHistory = [random.random() * 2 * np.pi for i in range(nTimeSteps)]
# Interpolation
start = time.time()
forceHistory = [interp1d(angle, forceAngleDependency_forEachTimeStep[i])(angleHistory[i]) \
for i in range(nTimeSteps)]
end = time.time()
print 'interpolation duration: %s s' % (end - start)

Use np.random.random(size=nTineSteps) to generate samples
For linear interpolation, use numpy.interp instead of interp1d. For higher order spline interpolation use either CubicSpline or make_interp_spline, and use vectorized evaluation.

Calculate histogram of distances between points in big data set

I have big data set, representing 1.2M points in 220 dimensional periodic space (x changes fom (-pi,pi))... (matrix: 1.2M x 220).
I would like to calculate histogram of distances between these points taking into account periodicity. I have written some code in python but still it works quite slow for my test case (I am not even trying to run it on the whole set...).
Can you maybe take a look and help me with some tweaking?
Any suggestions and comments much appreciated.
import numpy as np
# 1000x220 test set (-pi,pi)
d=np.random.random((1000, 220))*2*np.pi-np.pi
# calculating theoretical limit on the histogram range, max distance between
# two points can be pi in each dimension
m=np.zeros(np.shape(d)[1])+np.pi
m_=np.sqrt(np.sum(m**2))
# hist range is from 0 to mm
mm=np.floor(m_)
bins=mm/0.01
m=np.zeros(bins)
# proper calculations
import time
start_time = time.time()
for i in range(np.shape(d)[0]):
diff=d[:-(i+1),:]-d[i+1:,:]
diff=np.absolute(diff)
adiff=diff-np.pi
diff=np.pi-np.absolute(adiff)
s=np.sqrt(np.einsum('ij,ij->i', diff,diff))
m+=np.histogram(s,range=(0,mm),bins=bins)[0]
print time.time() - start_time

I think you will see the most improvement from breaking the main loop to smaller parts by dividing range(...) to a couple of smaller ranges and use the threading module to have a couple of threads run the loop concurrently

Efficient processing of pixel + neighborhood in numpy image

I have a range image of a scene. I traverse the image and calculate the average change in depth under the detection window. The detection windows changes size based on the average depth of the surrounding pixels of the current location. I accumulate the average change to produce a simple response image.
Most of the time is spent in the for loop, it is taking about 40+s for a 512x52 image on my machine. I was hoping for some speed up. Is there a more efficient/faster way to traverse the image? Is there a better pythonic/numpy/scipy way to visit each pixel? Or shall I go learn cython?
EDIT: I have reduced running time to about 18s by using scipy.misc.imread() instead of skimage.io.imread(). Not sure what the difference is, I will try to investigate.
Here is a simplified version of the code:
import matplotlib.pylab as plt
import numpy as np
from skimage.io import imread
from skimage.transform import integral_image, integrate
import time
def intersect(a, b):
'''Determine the intersection of two rectangles'''
rect = (0,0,0,0)
r0 = max(a[0],b[0])
c0 = max(a[1],b[1])
r1 = min(a[2],b[2])
c1 = min(a[3],b[3])
# Do we have a valid intersection?
if r1 > r0 and c1 > c0:
rect = (r0,c0,r1,c1)
return rect
# Setup data
depth_src = imread("test.jpg", as_grey=True)
depth_intg = integral_image(depth_src) # integrate to find sum depth in region
depth_pts = integral_image(depth_src > 0) # integrate to find num points which have depth
boundary = (0,0,depth_src.shape[0]-1,depth_src.shape[1]-1) # rectangle to intersect with
# Image to accumulate response
out_img = np.zeros(depth_src.shape)
# Average dimensions of bbox/detection window per unit length of depth
model = (0.602,2.044) # width, height
start_time = time.time()
for (r,c), junk in np.ndenumerate(depth_src):
# Find points around current pixel
r0, c0, r1, c1 = intersect((r-1, c-1, r+1, c+1), boundary)
# Calculate average of depth of points around current pixel
scale = integrate(depth_intg, r0, c0, r1, c1) * 255 / 9.0
# Based on average depth, create the detection window
r0 = r - (model[0] * scale/2)
c0 = c - (model[1] * scale/2)
r1 = r + (model[0] * scale/2)
c1 = c + (model[1] * scale/2)
# Used scale optimised detection window to extract features
r0, c0, r1, c1 = intersect((r0,c0,r1,c1), boundary)
depth_count = integrate(depth_pts,r0,c0,r1,c1)
if depth_count:
depth_sum = integrate(depth_intg,r0,c0,r1,c1)
avg_change = depth_sum / depth_count
# Accumulate response
out_img[r0:r1,c0:c1] += avg_change
print time.time() - start_time, " seconds"
plt.imshow(out_img)
plt.gray()
plt.show()

Michael, interesting question. It seems that the main performance problem you have is that each pixel in the image has two integrate() functions computed on it, one of size 3x3 and the other of a size which is not known in advance. Calculating individual integrals in this way is extremely inefficient, regardless of what numpy functions you use; it's an algorithmic issue, not an implementation issue. Consider an image of size NN. You can calculate all integrals of any size KK in that image using only approximately 4*NN operations, not (as one might naively expect) NNKK. The way you do that is first calculate an image of sliding sums over a window K in each row, and then sliding sums over the result in each column. Updating each sliding sum to move to the next pixel requires only adding the newest pixel in the current window and subtracting the oldest pixel in the previous window, thus two operations per pixel regardless of window size. We do have to do that twice (for rows and columns), therefore 4 operations per pixel.
I am not sure if there is a sliding window sum built into numpy, but this answer suggests a couple of ways to do it, using stride tricks: https://stackoverflow.com/a/12713297/1828289. You can certainly accomplish the same with one loop over columns and one loop over rows (taking slices to extract a row/column).
Example:
# img is a 2D ndarray
# K is the size of sums to calculate using sliding window
row_sums = numpy.zeros_like(img)
for i in range( img.shape[0] ):
if i > K:
row_sums[i,:] = row_sums[i-1,:] - img[i-K-1,:] + img[i,:]
elif i > 1:
row_sums[i,:] = row_sums[i-1,:] + img[i,:]
else: # i == 0
row_sums[i,:] = img[i,:]
col_sums = numpy.zeros_like(img)
for j in range( img.shape[1] ):
if j > K:
col_sums[:,j] = col_sums[:,j-1] - row_sums[:,j-K-1] + row_sums[:,j]
elif j > 1:
col_sums[:,j] = col_sums[:,j-1] + row_sums[:,j]
else: # j == 0
col_sums[:,j] = row_sums[:,j]
# here col_sums[i,j] should be equal to numpy.sum(img[i-K:i, j-K:j]) if i >=K and j >= K
# first K rows and columns in col_sums contain partial sums and can be ignored
How do you best apply that to your case? I think you might want to pre-compute the integrals for 3x3 (average depth) and also for several larger sizes, and use the value of the 3x3 to select one of the larger sizes for the detection window (assuming I understand the intent of your algorithm). The range of larger sizes you need might be limited, or artificially limiting it might still work acceptably well, just pick the nearest size. Calculating all integrals together using sliding sums is so much more efficient that I am almost certain it is worth calculating them for a lot of sizes you would never use at a particular pixel, especially if some of the sizes are large.
P.S. This is a minor addition, but you may want to avoid calling intersect() for every pixel: either (a) only process pixels which are farther from the edge than the max integral size, or (b) add margins to the image of the max integral size on all sides, filling the margins with either zeros or nans, or (c) (best approach) use slices to take care of this automatically: a slice index outside the boundary of an ndarray is automatically limited to the boundary, except of course negative indexes are wrapped around.
EDIT: added example of sliding window sums

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cupyx 3D FFT convolution on GPU slows down after multiple sequential calls - python

Related

Is there a way to speed up or vectorize a nested for loop?

Python absdiff performance on submatrices

Slow performance using scipy.interpolate inside loop

Calculate histogram of distances between points in big data set

Efficient processing of pixel + neighborhood in numpy image

Categories

Resources