counting pixels using tensorflow and gpu

counting pixels using tensorflow and gpu - python

I have mask images with size (N, 256, 256), where N is a value between 1000-10000.
Each pixel has an integer value between 0-2 (0 is just background).
Unfortunately, the mask image is not encoded as (N,256,256,2)
I have a few thousands of these masks. My goal is to find the quickest method counting pixels per frame for each label (1 and 2).
Running below on one mask image with roughly 6000 frames using numpy takes < 2s.
np.sum(ma==1,axis=(1,2))
np.sum(ma==2,axis=(1,2))
I expect it will take a few hours to run on entire data if I use single process, and maybe less than an hour if I use multiprocessing (CPU).
I'm curious if I can make it even faster if I use GPU. It seems easy to implement the part summing a tensor on axes, but I don't find how I can implement the ma==1 part on tensorflow.
I thought about making the input to encoded shape (N,256,256,2) first and pass to the tensor placeholder, but realized it would take even longer than above to make an array with that shape.
Or, is there a better way to implementing pixel count on this mask data using tensorflow?

Think about what's going on in the background
Roughly the following steps are done twice in your original implementation:
Loading the whole array from memory, prove if a value equals the desired value
Writing the results back to memory (the temporary array is as large as your input array,np.uint8 assumed)
Loading the whole array into memory and sum up the results
Writing the results back to memory
It should be clear that this is a quite suboptimal implementation parallelized or not. I could not do it any better in a pure vectorized numpy way, but there are tools available (Numba, Cython) where you can implement this task in a more direct and paralellized way.
Example
import numpy as np
import numba as nb
import time
#Create some data
N=10000
images=np.random.randint(0, high=3, size=(N,256,256), dtype=np.uint8)
def sum_orig(ma):
A=np.sum(ma==1,axis=(1,2))
B=np.sum(ma==2,axis=(1,2))
return A,B
#nb.njit(fastmath=True,parallel=True)
def sum_mod(ma):
A=np.zeros(ma.shape[0],dtype=np.uint32)
B=np.zeros(ma.shape[0],dtype=np.uint32)
#parallel loop
for i in nb.prange(ma.shape[0]):
AT=0
BT=0
for j in range(ma.shape[1]):
for k in range(ma.shape[2]):
if (ma[i,j,k]==1):
AT+=1
if (ma[i,j,k]==2):
BT+=1
A[i]=AT
B[i]=BT
return A,B
#Warm up
#The funtion is compiled at the first call
[A,B]=sum_mod(images)
t1=time.time()
[A,B]=sum_mod(images)
print(time.time()-t1)
t1=time.time()
[A_,B_]=sum_orig(images)
print(time.time()-t1)
#check if it works correctly
print(np.allclose(A,A_))
print(np.allclose(B,B_))
Performance
improved_version: 0.06s
original_version: 2.07s
speedup: 33x

Related

Custom vectorized non-linear filter in Numpy

In digital image processing, many filters are non-linear, such as Harmonic Mean Filter.
I know in Numpy, they provided many vectorized functions which could speed up the computing time tremendously, but currently I have not known any that could work well with non-linear masks.
In specific, I want to speed up the calculation of my implementation of the above filter, which removes two ugly, snail-paced Python for loops:
import math as m
def harmonic(im, ksize):
# Make a copy of the original image
result = im.copy().astype(np.float32)
# Calculate padding size, and pad the original image
psize = m.floor(ksize/2) # paddding size
im = cv.copyMakeBorder(im, psize, psize, psize, psize, cv.BORDER_REFLECT)
# Perform non-linear operations
for i in range(0, result.shape[0]):
for j in range(0, result.shape[1]):
# Get the neighborhood same size as kernel
neighbor = im[(i):(i+2*psize+1),(j):(j+2*psize+1)].astype(np.float32)
# ----------------------------------------
# Calculate the reciprocal sum
recp_sum = np.sum(np.reciprocal(neighbor,where= neighbor != 0).astype(np.float32))
# Harmonic mean for that neighborhood
if (recp_sum != 0):
result[i][j] = (float((ksize*ksize)/(recp_sum)))
# ----------------------------------------
return result.astype(np.uint8)
In general, could we utilize Numpy to create any custom vectorized operations on a array? Or only a limited number operations and what types are they? If yes, what could I do specifically to optimize the above code?
I have tried to explore Numpy vectorization recently, and np.vectorize really caught my attention. However, the examples provided on the documentation was a bit (as far as I feel) irrelevant to the problem I am trying to solve. (English was not my native language so I may miss something, I'd be happy to be elaborated!)
Related to np.vectorize, I do not really understand pyfunc param. Does it really eliminate the traditional Python loops wrapped in that pyfunc? Or it's there just to define a specific mapping at a specific pixel in the array?

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. That is,
tmp = 1 / im.astype(np.float32)
tmp = cv2.blur(tmp, (ksize, ksize))
out = 1 / tmp
You might want to add a bit of code there to avoid division by zero. The simplest way is to replace zeros with very small values.

How to parallelize writing to same cell in a numpy array?

Background: I have millions of points in 2D space with (x_position, y_position, value) associated with each point. I am trying to summarize these points by creating an image, where each pixel can contain multiple points. To summarize, each pixel stores the sum of values at that (x_pixel, y_pixel) location in the image.
Question: How can I do this efficiently? Currently, my code does something like this:
image = np.zeros((4096,4096))
for each point in data:
x_pixel, y_pixel = convertPointPos2PixelPos(point)
image[x_pixel, y_pixel] += point.getValue()
but the ETA for this code completing is 450 hours, which is unacceptable. Is there a way to parallelize this? The code is writing to the same image[x,y] index multiple times. I found StackOverflow posts that suggest using multiprocessing, but I think needing to lock to prevent race conditions will mean this will take just as much time as it would without parallelizing.

Assuming you want something on a regular grid, you can use simple division to bin your data. Here is an example:
size = (4096, 4096)
data = np.random.rand(100000000, 3)
image = np.zeros(size)
coords = data[:, :2]
min = coords.min(0)
max = coords.max(0)
index = np.floor_divide(coords - min, (max - min) / np.subtract(size, 1), out=np.empty(coords.shape, dtype=int), casting='unsafe')
index is now an array of indices into image where you want to add the corresponding values. You can do an unbuffered add using np.add.at:
np.add.at(image, tuple(index.T), data[:, -1])
If your data range is better defined than just the bounding box of the coordinates, you can save a little time by not computing coord.max() and coord.min().
The result is something like this:
This entire operation takes 6.4sec on my very moderately powered machine for 10M points, including the call to plt.imshow, plt.colorbar and garbage collection before runs.
Timing collected using the %%timeit cell magic in IPython.
Either way, you're well under 450 hours. Even if your coordinate transformation is not linear binning, I expect that you can run in reasonable time as long as you vectorize it properly. Also, multiprocessing is not likely to give you a huge boost since it requires copying data around.

Greater performance in large list operations in python

I'm developing a small software for a college project and I'm having a problem: The code has a way too low performance.
It's an image editing software, and the image is a larg 3d list (the main list is the whole image, each list inside it is an horizontal line and each list inside that one is a pixel, containing three elements).
I need to make pixel-by-pixel adjustments, like multiplying all of them by a constant, so it would go like
for y in range(0,len(image)):
for x in range (0,len(image[0])):
for c in range (0,3):
im[y][x][c] = (im[y][x][c])*a
Where image is the 3d list
len(image) is the amount of horizontal lines in the image (vertical size)
len(image[0]) is the amount of pixels in a horizontal line (horizontal size)
and c is the component of the pixel (going from 0 to 2).
This loop takes several minutes to go through a single 12 MP image and the amount of images I have to process is in the order of the hundreds, so this is just impossible.
What can I do to get a greater performance? Even editing softwares take some seconds because it can be a pretty large operation, but this code is just too slow.
Thank you!

I also (as in the comments) suggest using Numpy.
Sample code would be something like this:
import numpy as np
im = np.array(image,dtype="float16")
# Define your custom function
def myFunc(x,a):
x = x * a
return x
# Vectorise function
vfunc = np.vectorize(myFunc)
# Apply function to the array with the parameter a = 5
im = vfunc(im,5)
I compared timings for a vectorized numpy function and nested loops for an array of the size roughly eqivalent to a 12MP image: 4242 x 2828 x 3.
Nested loops took 99 second, while numpy took about 6.5 secs.
For your reference here is a question about numpy functions efficiency: Most efficient way to map function over numpy array
For simple functions like multiplication using numpy native functions is the fastest.
# Multiply each element by 5
im = im * 5
This code took only 0.5 sec on my machine.

Efficient 2D cross correlation in Python?

I have two arrays of size (n, m, m) (n number of images of size (m,m)). I want to perform a cross correlation between each corresponding n of the two arrays.
Example: n=1 -> corr2d([m,m]<sub>1</sub>,[m,m]<sub>2</sub>)
My current way include a bunch of for loops in python:
for i in range(len(X)):
X_co = X[i,0,:,:]/(np.max(X[i,0,:,:]))
X_x = X[i,1,:,:]/(np.max(X[i,1,:,:]))
autocorr[i,0,:,:]=correlate2d(X_co, X_x, mode='same', boundary='fill', fillvalue=0)
Obviously this is very slow when the input contain many images, and becomes a substantial part of the total run time if (m,m) << n.
The obvious optimization is to skip the loop and feed everything directly to the compiled correlation function. Currently I'm using scipy's correlate2d.
I've looked around but haven't found any function that allows correlation along some axis or multiple inputs.
Any tips on how to make scipy's correlate2d work or alternatives?

I decided to implement it via the FFT instead.
def fft_xcorr2D(x):
# Over axes (-2,-1) (default in the fft2 function)
## Pad because of cyclic (circular?) behavior of the FFT
x = np.fft2(np.pad(x,([0,0],[0,0],[0,34],[0,34]),mode='constant'))
# Conjugate for correlation, not convolution (Conv. Theorem)
x[:,1,:,:] = np.conj(x[:,1,:,:])
# Over axes (-2,-1) (default in the ifft2 function)
## Multiply elementwise over 2:nd axis (2 image bands for me)
### fftshift over rows and column over images
corr = np.fft.fftshift(np.ifft2(np.prod(x,axis=1)),axes=(-2,-1))
# Return after removing padding
return np.abs(corr)[:,3:-2,3:-2]
Call via:
ts=fft_xcorr2D(X)
If anybody wants to use it:
My input is a 4D array: (N, 2, #Rows, #Cols)
E.g. (500, 2, 30, 30): 500 images, 2 bands (polarizations, for example), of 30x30 pixels
If your input is different, adjust the padding to your liking
Check so your input order is the same as mine otherwise change the axes arguments in the fft2 and ifft2 functions, the np.prod and fftshift. I use fftshift to get the maximum value in the middle (otherwise in the corners), so be wary of that if that's not what you want.
Why is it the maximum value? Technically, it doesn't have to be, but for my purpose it is. fftshift is used to get a correlation that looks like you're used to. Otherwise, the quadrants are turned "inside out". If you wonder what I mean, remove fftshift (just the fftshift part, not its arguments), call the function as before, and plot it.
Afterwards, it should be ready to use.
Possibly x.prod(axis=1) is faster than np.prod(x,axis=1) but it's an old post. It shows no improvement for me after trying.

Minimum Distance Algorithm using GDAL and Python

I'm trying to implement the Minimum Distance Algorithm for image classification using GDAL and Python. After calculating the mean pixel-value of the sample areas and storing them into a list of arrays ("sample_array"), I read the image into an array called "values". With the following code I loop through this array:
values = valBD.ReadAsArray()
# loop through pixel columns
for X in range(0,XSize):
# loop thorugh pixel lines
for Y in range (0, YSize):
# initialize variables
minDist = 9999
# get minimum distance
for iSample in range (0, sample_count):
# dist = calc_distance(values[jPixel, iPixel], sample_array[iSample])
# computing minimum distance
iPixelVal = values[Y, X]
mean = sample_array[iSample]
dist = math.sqrt((iPixelVal - mean) * (iPixelVal - mean)) # only for testing
if dist < minDist:
minDist = dist
values[Y, X] = iSample
classBD.WriteArray(values, xoff=0, yoff=0)
This procedure takes very long for big images. That's why I want to ask if somebody knows a faster method. I don't know much about access-speed of different variables in python. Or maybe someone knows a libary I could use.
Thanks in advance,
Mario

You should definitely be using NumPy. I work with some pretty large raster datasets and NumPy burns through them. On my machine, with the code below there's no noticeable delay for a 1000 x 1000 array. An explanation of how this works follows the code.
import numpy as np
from scipy.spatial.distance import cdist
# some starter data
dim = (1000,1000)
values = np.random.randint(0, 10, dim)
# cdist will want 'samples' as a 2-d array
samples = np.array([1, 2, 3]).reshape(-1, 1)
# this could be a one-liner
# 'values' must have the same number of columns as 'samples'
mins = cdist(values.reshape(-1, 1), samples)
outvalues = mins.argmin(axis=1).reshape(dim)
cdist() calculates the "distance" from each element in values to each of the elements in samples. This generates a 1,000,000 x 3 array, where each row n has the distance from pixel nin the original array to each of the sample values [1, 2, 3]. argmin(axis=1) gives you the index of the minimum value along each row, which is what you want. A quick reshape gives you the rectangular format you'd expect for an image.

Agree with Thomas K: use PIL, or else write a C-function and wrap it using e.g. ctypes, or at very least use some numPy matrix operations.
Or else use pypy on your existing code (JIT-compiled code can be 100x faster, on image code). Try pypy and tell us what speedup you got.
Bottom line: never do stuff pixel-wise like this natively in cPython, the interpreting and memory-mgt overhead will kill you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

counting pixels using tensorflow and gpu - python

Related

Custom vectorized non-linear filter in Numpy

How to parallelize writing to same cell in a numpy array?

Greater performance in large list operations in python

Efficient 2D cross correlation in Python?

Minimum Distance Algorithm using GDAL and Python

Categories

Resources