Python absdiff performance on submatrices - python

I am calculating the sum of the absolute difference of two matrices (representing two images, in RGBA with uint8 type) to see how similar they are to eachother. I've stumbled upon a weird efficiency problem, where calculating the distance on submatrices (specifically, I'm only considering only the RGB channels) takes roughly 5 times as much as doing it on the whole matrices.
Here's my code:
Firstly, I load two images using OpenCV and convert them in RGBA (since when they load they are BGR) like this (the paths are specified before):
template_image = cv2.cvtColor(cv2.imread(template_path, flags=cv2.IMREAD_UNCHANGED), cv2.COLOR_BGR2RGBA)
big_image = cv2.cvtColor(cv2.imread(big_image_path, flags=cv2.IMREAD_UNCHANGED), cv2.COLOR_BGR2RGBA)
Then I calculate the similarity this way (the 1000 loops are only to exagerate the time that it takes. I do 1000 loops in all the tests so it shouldn't make a difference):
first = template_image[:,:,0:3]
second = big_image[0:200, 0:400,0:3]
t = time.time()
for i in range(1000):
numpy.sum(cv2.absdiff(first, second))
print(f'elapsed time: {(time.time() - t):.3f} s')
Running this multiple times, it gives elapsed times that swing between 1s and 1.2s
This weirded me out because before restricting it to the RGB channels it used to take much less. In fact, with this code:
first = template_image[:,:]
second = big_image[0:200, 0:400]
t = time.time()
for i in range(1000):
numpy.sum(cv2.absdiff(first, second))
print(f'elapsed time: {(time.time() - t):.3f} s')
the average elapsed time is roughly 0.25s
This itself is enough to weird me out as why should it take 5 times as much to do technically less calculations?
I've also tried a different piece of code to check the similarity only on the RGBA and this weirds me out even more:
first = numpy.zeros((200, 400, 3), dtype=numpy.uint8)
first[:,:,:] = template_image[:,:,0:3]
second = numpy.zeros((200, 400, 3), dtype=numpy.uint8)
second[:,:,:] = big_image[0:200, 0:400,0:3]
t = time.time()
for i in range(1000):
numpy.sum(cv2.absdiff(first, second))
print(f'elapsed time: {(time.time() - t):.3f} s')
Now this code takes on average 0.19s. It makes sense that it takes less than the one calculated on RGBA as it is less calculations, but why would it take so much less than the other one done only on the RGB channels?
I've specified the dtype=numpy.uint8 because that's the type of the images as I read them with OpenCV, and otherwise it would create matrices in float64 type which (understandibly) takes much more time.
What's the reason behind this weird behaviour? Thanks!

Related

Cupyx 3D FFT convolution on GPU slows down after multiple sequential calls

First off, I have never asked a question on stackoverflow before, and I will do my best to follow the site guidelines, but let me know if I should changes something about my post.
I am attempting to write a function that can quickly extract the pore size distribution from a binary 3D image. I do this by computing the local thickness of the image, in a similar fashion to that implementing in ImageJ's local thickness plugin. I need this function ideally to run in under 1 second, as I am calling it ~200000 times in a simulated annealing process. It is performed partially on the CPU (12th Gen Intel(R) Core(TM) i7-12700KF, 20 cores, 16GB RAM) and partially on the GPU (RTX GeForce 3050, 8GB).
The function works, but there is something happening I think on the backend, which is slowing it down artificially. This may have to do with threading, or GPU to CPU overhead, or some kind of 'cool down' period.
There are three parts of the function:
Euclidean Distance Transform - performed on CPU, in parallel using edt package. Currently takes ~0.25 seconds on a 250^3 binary image
3d Skeletonization - performed on CPU using skimage.morphology.skeletonize_3d, but with the image split into chunks using dask. This implementation is provided by porespy.filters.chunked_func. Multiply skeleton by the distance transform to get a skeleton with values equal to the minimum distance to nearest background voxel. This process takes 0.45 to 0.5 seconds.
Dilate each voxel on the skeleton using a spherical structuring element with radius equal to the value of the skeleton voxel. This is done in a for loop, starting from the maximum structuring element size, and in decreasing order. Larger spheres do not get overwritten by smaller spheres. The dilations are accomplished using fft convolution on the GPU using cupyx.scipy.signal.signaltools.convolve, which takes ~ 0.005 seconds.
Less code is required to reproduce the effect I am seeing, however. The essential part is performing many fft convolutions in sequence.
A minimum reproducible example is as follows:
import skimage
import time
import cupy as cp
from cupyx.scipy.signal.signaltools import convolve
# Generate a binary image
im = cp.random.random((250,250,250)) > 0.4
# Generate spherical structuring kernels for input to convolution
structuring_kernels = {}
for r in range(1,21):
structuring_kernels.update({r: cp.array(skimage.morphology.ball(r))})
# run dilation process in loop
for i in range(10):
s = time.perf_counter()
for j in range(20,0,-1):
convolve(im, structuring_kernels[j], mode='same', method='fft')
e = time.perf_counter()
# time.sleep(2)
print(e-s)
When run as is, after the first couple of loops, each dilation loop takes ~ 1.8 seconds on my computer. If I uncomment the time.sleep(2) line (ie pause for 2 seconds between each loop), then the loop function call only takes 0.05 seconds. I suspect this has to do with threading or GPU use, as it takes a couple of loops for it to reach the 1.8 seconds, then it remains steady at that value. When I monitor my GPU usage, the 3D monitor quickly spikes to 100% and stays close to there.
If I am just being limited by the capacity of my GPU, why do the first couple of loops run faster? Could a memory leak be happening? Does anyone know why this is happening, and if there is a way to prevent it, possibly using backend controls in cupy?
I'm not sure if this is necessary, but my local thickness function in its entirety is as follows:
import porespy as ps
from skimage.morphology import skeletonize_3d
import time
import numpy as np
import cupy as cp
from edt import edt
from cupyx.scipy.signal.signaltools import convolve
def local_thickness_cp(im, masks=None, method='fft'):
"""
Parameters
----------
im: 3D voxelized image for which the local thickness map is desired
masks: (optional) A dictionary of the structuring elements to be used
method: 'fft' or 'direct'
Returns
-------
The local thickness map
"""
s = time.perf_counter()
# Calculate the euclidean distance transform using edt package
dt = cp.array(edt(im, parallel=15))
e = time.perf_counter()
# print(f'EDT took {e - s}')
s = time.perf_counter()
# Calculate the skeleton of the image and multiply by dt
skel = cp.array(ps.filters.chunked_func(skeletonize_3d,
overlap=17,
divs=[2, 3, 3],
cores=20,
image=im).astype(bool)) * dt
e = time.perf_counter()
# print(f'skeletonization took {e - s} seconds')
r_max = int(cp.max(skel))
s = time.perf_counter()
if not masks:
masks = {}
for r in range(int(r_max), 0, -1):
masks.update({r: cp.array(ps.tools.ps_ball(r))})
e = time.perf_counter()
# print(f'mask creation took {e - s} seconds')
# Initialize the local thickness image
final = cp.zeros(cp.shape(skel))
time_in_loop = 0
s = time.perf_counter()
for r in range(r_max, 0, -1):
# Get a mask of where the skeleton has values between r-1 and r
skel_selected = ((skel > r - 1) * (skel <= r)).astype(int)
# Perform dilation on the mask using fft convolve method, and multiply by radius of pore size
dilation = (convolve(skel_selected, masks[r], mode='same', method=method) > 0.1) * r
# Add dilation to local thickness image, where it is still zero (ie don't overwrite previous inserted values)
final = final + (final == 0) * dilation
e = time.perf_counter()
# print(f'Dilation loop took {e - s} seconds')
return final
Now, in theory, the function should take ~ 0.80 seconds to compute. However, when called in a loop on separate images, it takes ~1.5 seconds. However, if I add a time.sleep(1) after each function call, then the function does take approximately 0.8 seconds.

How to parallelize writing to same cell in a numpy array?

Background: I have millions of points in 2D space with (x_position, y_position, value) associated with each point. I am trying to summarize these points by creating an image, where each pixel can contain multiple points. To summarize, each pixel stores the sum of values at that (x_pixel, y_pixel) location in the image.
Question: How can I do this efficiently? Currently, my code does something like this:
image = np.zeros((4096,4096))
for each point in data:
x_pixel, y_pixel = convertPointPos2PixelPos(point)
image[x_pixel, y_pixel] += point.getValue()
but the ETA for this code completing is 450 hours, which is unacceptable. Is there a way to parallelize this? The code is writing to the same image[x,y] index multiple times. I found StackOverflow posts that suggest using multiprocessing, but I think needing to lock to prevent race conditions will mean this will take just as much time as it would without parallelizing.
Assuming you want something on a regular grid, you can use simple division to bin your data. Here is an example:
size = (4096, 4096)
data = np.random.rand(100000000, 3)
image = np.zeros(size)
coords = data[:, :2]
min = coords.min(0)
max = coords.max(0)
index = np.floor_divide(coords - min, (max - min) / np.subtract(size, 1), out=np.empty(coords.shape, dtype=int), casting='unsafe')
index is now an array of indices into image where you want to add the corresponding values. You can do an unbuffered add using np.add.at:
np.add.at(image, tuple(index.T), data[:, -1])
If your data range is better defined than just the bounding box of the coordinates, you can save a little time by not computing coord.max() and coord.min().
The result is something like this:
This entire operation takes 6.4sec on my very moderately powered machine for 10M points, including the call to plt.imshow, plt.colorbar and garbage collection before runs.
Timing collected using the %%timeit cell magic in IPython.
Either way, you're well under 450 hours. Even if your coordinate transformation is not linear binning, I expect that you can run in reasonable time as long as you vectorize it properly. Also, multiprocessing is not likely to give you a huge boost since it requires copying data around.

Tensorflow gradients for every item of tensor

I have a network that takes as input an Nx3 matrix and produces an N-dimensional vector. Let's say batch size is 1 and N=1024, so the output would have the shape (1,1024). I want to compute the gradients for every dimension of the output, with respect to the input. That is, dy/dx for every y. However tensorflow's tf.gradients computes d sum(y)/dx, aggregate. I know there's no straightforward way to compute the gradients for every output dimension, so I finally decided to run tf.gradients 1024 times, because I only have to do this once in the project, and never again.
So I do this:
start = datetime.datetime.now()
output_code_split = tf.split(output_code,1024)
#output shape = (1024,)
grad_ops = []
for i in range(1024):
gr = tf.gradients(output_code_split[i],input)
#output shape = (1024,1,16,1024,3) , where 16= batch size
gr = tf.reduce_mean(gr,[0,1,2,3])
#output shape = (1024,)
grad_ops.append(gr)
present = datetime.datetime.now()
print(i,(present-start).seconds,flush=True)
#prints time taken to finish previous computation.
start = datetime.datetime.now()
When the code started running, the time between two iterations was 4 seconds, so I figured it'll run for roughly 4096 seconds. However, as the number of iterations increase, the time taken for subsequent runs keeps increasing. The gap, which was 4 seconds when the code started, eventually got to 30 seconds after about 500 iterations, which is too much.
Is the list holding the gradient ops grad_ops growing bigger and occupying more memory. I'm unfortunately not in the position to do a detailed memory profiling of this code..Any ideas about what causes the iteration time to blow up as time goes on?
(Note that in the code, I'm only creating the gradient ops and not actually evaluating them. That part comes later, but my code doesn't reach there on account of the extreme slowdown mentioned above)
Thanks.
What blows up your execution time is that you define a new operation on the graph in every iteration of your for loop. Every call to tf.gradient and tf.reduce_mean pushes a new node onto the graph. Then it needs to recompile to be run. What should actually work for you is to use tf.gather with an int32 placeholder, which supplies the dimension to your gradient operation. So something like this:
idx_placeholder = tf.placeholder(tf.int32, shape=(None,))
grad_operation = tf.gradients(tf.gather(output_code_split, idx_placeholder))
for i in range(1024):
sess.run(grad_operation, {idx_placeholder: np.array([i])})

counting pixels using tensorflow and gpu

I have mask images with size (N, 256, 256), where N is a value between 1000-10000.
Each pixel has an integer value between 0-2 (0 is just background).
Unfortunately, the mask image is not encoded as (N,256,256,2)
I have a few thousands of these masks. My goal is to find the quickest method counting pixels per frame for each label (1 and 2).
Running below on one mask image with roughly 6000 frames using numpy takes < 2s.
np.sum(ma==1,axis=(1,2))
np.sum(ma==2,axis=(1,2))
I expect it will take a few hours to run on entire data if I use single process, and maybe less than an hour if I use multiprocessing (CPU).
I'm curious if I can make it even faster if I use GPU. It seems easy to implement the part summing a tensor on axes, but I don't find how I can implement the ma==1 part on tensorflow.
I thought about making the input to encoded shape (N,256,256,2) first and pass to the tensor placeholder, but realized it would take even longer than above to make an array with that shape.
Or, is there a better way to implementing pixel count on this mask data using tensorflow?
Think about what's going on in the background
Roughly the following steps are done twice in your original implementation:
Loading the whole array from memory, prove if a value equals the desired value
Writing the results back to memory (the temporary array is as large as your input array,np.uint8 assumed)
Loading the whole array into memory and sum up the results
Writing the results back to memory
It should be clear that this is a quite suboptimal implementation parallelized or not. I could not do it any better in a pure vectorized numpy way, but there are tools available (Numba, Cython) where you can implement this task in a more direct and paralellized way.
Example
import numpy as np
import numba as nb
import time
#Create some data
N=10000
images=np.random.randint(0, high=3, size=(N,256,256), dtype=np.uint8)
def sum_orig(ma):
A=np.sum(ma==1,axis=(1,2))
B=np.sum(ma==2,axis=(1,2))
return A,B
#nb.njit(fastmath=True,parallel=True)
def sum_mod(ma):
A=np.zeros(ma.shape[0],dtype=np.uint32)
B=np.zeros(ma.shape[0],dtype=np.uint32)
#parallel loop
for i in nb.prange(ma.shape[0]):
AT=0
BT=0
for j in range(ma.shape[1]):
for k in range(ma.shape[2]):
if (ma[i,j,k]==1):
AT+=1
if (ma[i,j,k]==2):
BT+=1
A[i]=AT
B[i]=BT
return A,B
#Warm up
#The funtion is compiled at the first call
[A,B]=sum_mod(images)
t1=time.time()
[A,B]=sum_mod(images)
print(time.time()-t1)
t1=time.time()
[A_,B_]=sum_orig(images)
print(time.time()-t1)
#check if it works correctly
print(np.allclose(A,A_))
print(np.allclose(B,B_))
Performance
improved_version: 0.06s
original_version: 2.07s
speedup: 33x

Calculate histogram of distances between points in big data set

I have big data set, representing 1.2M points in 220 dimensional periodic space (x changes fom (-pi,pi))... (matrix: 1.2M x 220).
I would like to calculate histogram of distances between these points taking into account periodicity. I have written some code in python but still it works quite slow for my test case (I am not even trying to run it on the whole set...).
Can you maybe take a look and help me with some tweaking?
Any suggestions and comments much appreciated.
import numpy as np
# 1000x220 test set (-pi,pi)
d=np.random.random((1000, 220))*2*np.pi-np.pi
# calculating theoretical limit on the histogram range, max distance between
# two points can be pi in each dimension
m=np.zeros(np.shape(d)[1])+np.pi
m_=np.sqrt(np.sum(m**2))
# hist range is from 0 to mm
mm=np.floor(m_)
bins=mm/0.01
m=np.zeros(bins)
# proper calculations
import time
start_time = time.time()
for i in range(np.shape(d)[0]):
diff=d[:-(i+1),:]-d[i+1:,:]
diff=np.absolute(diff)
adiff=diff-np.pi
diff=np.pi-np.absolute(adiff)
s=np.sqrt(np.einsum('ij,ij->i', diff,diff))
m+=np.histogram(s,range=(0,mm),bins=bins)[0]
print time.time() - start_time
I think you will see the most improvement from breaking the main loop to smaller parts by dividing range(...) to a couple of smaller ranges and use the threading module to have a couple of threads run the loop concurrently

Categories