Optimizing windowed feature generation from input image in Python

Optimizing windowed feature generation from input image in Python - python

I have created a script in Python which takes an image as input and produces a new image where each pixel corresponds to a feature calculated from the windowed group of pixels in the input image. The following picture will highlight this idea:
In the border cases we can either insert NaN into the output image or just use the pixels we have available inside the window. What would be an optimized way to do achieve this functionality in Python or some other programming language? At the moment, my script is simply using a bunch of for-loops to get the job done. Here you can see the code:
# This function will return the statistical features
#
#
# INPUTS:
# 'data' the data from which statistical features are to be calculated
# "winSize" specifying the window size, must be odd and > 1
#
# OUTPUT:
# 'meanData, stdData' statistical feature matrices (numpy ndarrays)
def get_stat_feats(data, winSize):
rows = data.shape[0]
cols = data.shape[1]
dist = int(math.floor(float(winSize)/2.0))
neigh = range(-dist, dist+1)
temp = np.zeros((int(winSize)**2, 1))
meanData = np.zeros(data.shape)
stdData = np.zeros(data.shape)
for row in range(0, rows):
for col in range(0, cols):
index = 0
makeNaN = 0
for y in neigh:
for x in neigh:
indY = row + y
indX = col + x
# Check that we are inside the image
if indY >= 0 and indY <= rows-1 and indX >= 0 and indX <= cols-1:
temp[index] = data[indY, indX]
index += 1
else:
makeNaN = 1
if makeNaN == 1:
meanData[row, col] = np.NAN
stdData[row, col] = np.NAN
else:
meanData[row, col] = np.mean(temp)
stdData[row, col] = np.std(temp)
return meanData, stdData
Thnx for any help! =) If there any more information needed, please ask =)

generic_filter from scipy.ndimage should be a decent solution for this. Probably faster solution, but this is the simplest i think.
It can take a mode parameter to define how to handle the edges. For example you could set it to treat elements outside the border to constant and equal NaN like this:
generic_filter(a, f, size=winSize, mode='constant', cval=np.nan)
def get_stat_feats(data, winSize):
from scipy.ndimage import generic_filter
import numpy as np
mean = lambda x: x.mean()
std = lambda x: x.std()
meanData = generic_filter(data, mean, size=winSize)
stdData = generic_filter(data, std, size=winSize)
return meanData, stdData
force float and round return value:
import numpy as np
def get_stat_feats(data, winSize):
from scipy.ndimage import generic_filter
import numpy as np
data = data.astype(float)
mean = lambda x: x.mean()
std = lambda x: x.std()
meanData = generic_filter(data, mean, size=winSize)
stdData = generic_filter(data, std, size=winSize)
return np.round(meanData,2), np.round(stdData, 2)

Related

rasterio data - python - execution time - preprocessing

I'm using Rasterio to work with a satellite image, and I need to iterate through the entire file. and applying the formula to each pixel. This process takes a long time and makes it difficult for me to try out different modifications because it takes a long time to see the results each time.
any suggestions to improve time execution ?
And is it better to work on this project locally or via Jupiter, Google Colab, or other tools?
def dn_to_radiance(data_array, band_number):
# getting the G value
channel_gain = float(Landsat8_mlt_dict['RADIANCE_MULT_BAND_' + str(band_number) + ' '])
# Getting the B value
channel_offset = float(Landsat8_mlt_dict['RADIANCE_ADD_BAND_' + str(band_number) + ' '])
# creating a temp array to store the radiance value
# np.empty_like Return a new array with the same shape and type as a given array.
new_data_array = np.empty_like(data_array)
# loooping through the image
for i, row in enumerate(data_array):
for j, col in enumerate(row):
# checking if the pixel value is not nan, to avoid background correction
if data_array[i][j].all() != np.nan:
new_data_array[i][j] = data_array[i][j] * channel_gain + channel_offset
print(f'Radiance calculated for band {band_number}')
return new_data_array
Landsat8_mlt_dict = {}
with open('LC08_L2SP_190037_20190619_20200827_02_T1_MTL.txt', 'r') as _:
# print(type(_))
for line in _:
line = line.strip()
if line != 'END':
key, value = line.split('=')
Landsat8_mlt_dict[key] = value
# print(Landsat8_mlt_dict)
def radiance_to_reflectance(arr, ESUN, ):
# getting the d value
d = float(Landsat8_mlt_dict['EARTH_SUN_DISTANCE '])
# calculating rh phi value from theta
phi = 90 - float(Landsat8_mlt_dict['SUN_ELEVATION '])
# creating the temp array
new_data_array = np.empty_like(arr)
# loop to finf the reflectance
for i, row in enumerate(arr):
for j, col in enumerate(row):
if arr[i][j].all() != np.nan:
new_data_array[i][j] = np.pi * arr[i][j] * d ** 2 / (ESUN * cos(phi * math.pi / 180))
print(f"Reflectance of Band calculated")
return new_data_array

You could use thir-party libraries such as EOReader to convert Landsat bands to reflectance for you.
from eoreader.reader import Reader
from eoreader.bands import RED, GREEN
prod = Reader().open(r"LC08_L1TP_200030_20201220_20210310_02_T1.tar")
# Load those bands as a dict of xarray.DataArray
bands = prod.load([RED, GREEN])
green = bands[GREEN]
red = bands[RED]
Disclaimer: I am the maintener of EOReader
If you want to do that yourself, you should do some tutorials on how to handle arrays in Python.
Never ever loop over them! You should instead vectorize your computations: it will go way faster!

np.where() to eliminate data, where coordinates are too close to each other

I'm doing aperture photometry on a cluster of stars, and to get easier detection of background signal, I want to only look at stars further apart than n pixels (n=16 in my case).
I have 2 arrays, xs and ys, with the x- and y-values of all the stars' coordinates:
Using np.where I'm supposed to find the indexes of all stars, where the distance to all other stars is >= n
So far, my method has been a for-loop
import numpy as np
# Lists of coordinates w. values between 0 and 2000 for 5000 stars
xs = np.random.rand(5000)*2000
ys = np.random.rand(5000)*2000
# for-loop, wherein the np.where statement in question is situated
n = 16
for i in range(len(xs)):
index = np.where( np.sqrt( pow(xs[i] - xs,2) + pow(ys[i] - ys,2)) >= n)
Due to the stars being clustered pretty closely together, I expected a severe reduction in data, though even when I tried n=1000 I still had around 4000 datapoints left

Using just numpy (and part of the answer here)
X = np.random.rand(5000,2) * 2000
XX = np.einsum('ij, ij ->i', X, X)
D_squared = XX[:, None] + XX - 2 * X.dot(X.T)
out = np.where(D_squared.min(axis = 0) > n**2)
Using scipy.spatial.pdist
from scipy.spatial import pdist, squareform
D_squared = squareform(pdist(x, metric = 'sqeuclidean'))
out = np.where(D_squared.min(axis = 0) > n**2)
Using a KDTree for maximum fast:
from scipy.spatial import KDTree
X_tree = KDTree(X)
in_radius = np.array(list(X_tree.query_pairs(n))).flatten()
out = np.where(~np.in1d(np.arange(X.shape[0]), in_radius))

np.random.seed(seed=1)
xs = np.random.rand(5000,1)*2000
ys = np.random.rand(5000,1)*2000
n = 16
mask = (xs>=0)
for i in range(len(xs)):
if mask[i]:
index = np.where( np.sqrt( pow(xs[i] - x,2) + pow(ys[i] - y,2)) <= n)
mask[index] = False
mask[i] = True
x = xs[mask]
y = ys[mask]
print(len(x))
4220

You can use np.subtract.outer for creating the pairwise comparisons. Then you check for each row whether the distance is below 16 for exactly one item (which is the comparison with the particular start itself):
distances = np.sqrt(
np.subtract.outer(xs, xs)**2
+ np.subtract.outer(ys, ys)**2
)
indices = np.nonzero(np.sum(distances < 16, axis=1) == 1)

Why is minimize_scalar not minimizing correctly?

I am a new Python user, so bear with me if this question is obvious.
I am trying to find the value of lmbda that minimizes the following function, given a fixed vector Z and scalar sigma:
def sure_sft(z,lmbda, sigma):
indicator = np.abs(z) <= lmbda;
minimum = np.minimum(z**2,lmbda**2);
return -sigma**2*np.sum(indicator) + np.sum(minimum);
When I pass in values of lmbda manually, I find that the function produces the correct value of sure_stf. However, when I try to use the following code to find the value of lmbda that minimizes sure_stf:
minimize_scalar(lambda lmbda: sure_sft(Z, lmbda, sigma))
it gives me an incorrect value for sure_stf (-8.6731 for lmbda = 0.4916). If I pass in 0.4916 manually to sure_sft, I obtain -7.99809 instead. What am I doing incorrectly? I would appreciate any advice!
EDIT: I've pasted my code below. The data is from: https://web.stanford.edu/~chadj/HallJones400.asc
import pandas as pd
import numpy as np
from scipy.optimize import minimize_scalar
# FUNCTIONS
# Calculate orthogonal projection of g onto f
def proj(f, g):
return ( np.dot(f,g) / np.dot(f,f) ) * f
def gs(X):
# Copy of X -- will be used to store orthogonalization
F = np.copy(X)
# Orthogonalize design matrix
for i in range(1, X.shape[1]): # Iterate over columns of X
for j in range(i): # Iterate over columns less than current one
F[:,i] -= proj(F[:,j], X[:,i]) # Subtract projection of x_i onto f_j for all j<i from F_i
# normalize each column to have unit length
norm_F=( (F**2).mean(axis=0) ) ** 0.5 # Row vector with sqrt root of average of the squares of each column
W = F/norm_F # Normalize
return W
# SURE for soft-thresholding
def sure_sft(z,lmbda, sigma):
indicator = np.abs(z) <= lmbda
minimum = np.minimum(z**2,lmbda**2)
return -sigma**2*np.sum(indicator) + np.sum(minimum)
# Import data.
data_raw = pd.read_csv("hall_jones1999.csv")
# Drop missing observations.
data = data_raw.dropna(subset=['logYL', 'Latitude'])
Y = data['logYL']
Y = np.array(Y)
N = Y.size
# Create design matrix.
design = np.empty([data['Latitude'].size,15])
design[:,0] = 1
for j in range(1, 15):
design[:,j] = data['Latitude']**j
K = design.shape[1]
# Use Gramm-Schmidt on design matrix.
W = gs(design)
Z = np.dot(W.T, Y)/N
# MLE
mu_mle = np.dot(W, Z)
# Soft-thresholding
# Use MLE residuals to calculate sigma for SURE calculation
sigma = np.sqrt(np.sum((Y - mu_mle)**2)/(N-K))
# Write SURE as a function of lmbda
sure = lambda lmbda: sure_sft(Z, lmbda, sigma)
# Find SURE-minimizing lmbda
lmbda = minimize_scalar(sure).x
min_sure = minimize_scalar(sure).fun #-8.673172212265738
# Compare to manually inputting minimized lambda into sure_sft
# I'm s
act_sure1 = sure_sft(Z, 0.49167598, sigma) #-7.998060514873529
act_sure2 = sure_sft(Z, 0.491675989, sigma) #-8.673172212306728

You're actually not doing anything wrong. I just tested out the code and confirmed that lmbda has a value of 0.4916759890416824 at the end of the script. You can confirm this for yourself by adding the following lines to the bottom of your script:
print(lmbda)
print(sure_sft(Z, lmbda, sigma))
when you run your script you should then see:
0.4916759890416824
-8.673158394698172
The only thing I can figure is that somehow the routine you were using to print out lmbda was set up to only print a fixed number of digits of floating point numbers, or somehow the printout was otherwise truncated.

Rolling Gradient for Pandas Dataframe column

How can I create a column in a pandas dataframe with is the gradient of another column?
I want the gradient to be run over a rolling window, so only 4 data points are assessed at one time.
I am assuming it is something like:
df['Gradient'] = np.gradient(df['Yvalues'].rolling(center=False,window=4))
However this gives error:
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
Any ideas?
Thank you!!

I think I found the solution. Though it's probably not the most efficient..
class lines(object):
def __init__(self):
pass
def date_index_to_integer_axis(self, dateindex):
d = [d.date() for d in dateindex]
days = [(d[x] - d[x-1]).days for x in range(0,len(d))]
axis = np.cumsum(days)
axis = [x - days[0] for x in axis]
return axis
def roll(self, Xvalues, Yvalues, w): # Rollings Generator Function # https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python
for i in range(len(Xvalues) + 1 - w):
yield Xvalues[i:i + w], Yvalues[i:i + w]
def gradient(self,Xvalues,Yvalues):
#Uses least squares method.
#Returns the gradient of two array vectors (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.lstsq.html)
A = np.vstack([Xvalues, np.ones(len(Xvalues))]).T
m, c = np.linalg.lstsq(A, Yvalues)[0]
return m,c
def gradient_column(self, data, window):
""" Takes in a single COLUMN EXTRACT from a DATAFRAME (with associated "DATE" index) """
vars = variables()
#get "X" values
Xvalues = self.date_index_to_integer_axis(data.index)
Xvalues = np.asarray(Xvalues,dtype=np.float)
#get "Y" values
Yvalues = np.asarray([val for val in data],dtype=np.float)
Yvalues = np.asarray(Yvalues,dtype=np.float)
#calculate rolling window "Gradient" ("m" in Y = mx + c)
Gradient_Col = [self.gradient(sx,sy)[0] for sx,sy in self.roll(Xvalues,Yvalues, int(window))]
Gradient_Col = np.asarray(Gradient_Col,dtype=np.float)
nan_array = np.empty([int(window)-1])
nan_array[:] = np.nan
#fill blanks at the start of the "Gradient_Col" so it is the same length as the original Dataframe (its shorter due to WINDOW)
Gradient_Col = np.insert(Gradient_Col, 0, nan_array)
return Gradient_Col
df['Gradient'] = lines.gradient_column(df['Operating Revenue'],window=4)

From the given information, it can be seen that you haven't provided an aggregation function to your rolling window.
df['Gradient'] = np.gradient(
df['Yvalues']
.rolling(center=False, window=4)
.mean()
)
or
df['Gradient'] = np.gradient(
df['Yvalues']
.rolling(center=False, window=4)
.sum()
)
You can read more about rolling functions at this website.

Python, how to optimize this code

I tried to optimize the code below but I cannot figure out how to improve computation speed. I tried Cthon but the performance is like in python.
Is it possible to improve the performance without rewrite everything in C/C++?
Thanks for any help
import numpy as np
heightSequence = 400
widthSequence = 400
nHeights = 80
DOF = np.zeros((heightSequence, widthSequence), dtype = np.float64)
contrast = np.float64(np.random.rand(heightSequence, widthSequence, nHeights))
initDOF = np.zeros([heightSequence, widthSequence], dtype = np.float64)
initContrast = np.zeros([heightSequence, widthSequence, nHeights], dtype = np.float64)
initHeight = np.float64(np.r_[0:nHeights:1.0])
initPixelContrast = np.array(([0 for ii in range(nHeights)]), dtype = np.float64)
# for each row
for row in range(heightSequence):
# for each col
for col in range(widthSequence):
# initialize variables
height = initHeight # array ndim = 1
c = initPixelContrast # array ndim = 1
# for each height
for indexHeight in range(0, nHeights):
# get contrast profile for current pixel
tempC = contrast[:, :, indexHeight]
c[indexHeight] = tempC[row, col]
# save original contrast
# originalC = c
# originalHeight = height
# remove profile before maximum and after minumum contrast
idxMaxContrast = np.argmax(c)
c = c[idxMaxContrast:]
height = height[idxMaxContrast:]
idxMinContrast = np.argmin(c) + 1
c = c[0:idxMinContrast]
height = height[0:idxMinContrast]
# remove some refraction
if (len(c) <= 1) | (np.max(c) <= 0):
DOF[row, col] = 0
else:
# linear fitting of profile contrast
P = np.polyfit(height, c, 1)
m = P[0]
q = P[1]
# remove some refraction
if m >= 0:
DOF[row, col] = 0
else:
DOF[row, col] = -q / m
print 'row=%i/%i' %(row, heightSequence)
# set range of DOF
DOF[DOF < 0] = 0
DOF[DOF > nHeights] = 0

By looking at the code it seems that you can get rid of the two outer loops completely, converting the code to a vectorised form. However, the np.polyfit call must then be replaced by some other expression, but the coefficients for a linear fit are easy to find, also in vectorised form. The last if-else can then be turned into a np.where call.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing windowed feature generation from input image in Python - python

Related

rasterio data - python - execution time - preprocessing

np.where() to eliminate data, where coordinates are too close to each other

Why is minimize_scalar not minimizing correctly?

Rolling Gradient for Pandas Dataframe column

Python, how to optimize this code

Categories

Resources