In numpy, I have a 3D array. Along the 0 axis, it stores multiple 2D planes. I need to get the gradient of each of these planes, select the median gradient magnitude at each point across these planes, and hence isolate the corresponding x and y gradient components. But I'm having difficulty carrying this out properly.
So far, to get the gradient and median, I have:
img_l = #My 3D array of 2D planes
grad = np.gradient(img_l,axis=[1,2]) #Get gradient of each image. This is a list with 2 elements.
mag_grad = np.sqrt(grad[0]**2 + grad[1]**2) #Get magnitude of gradient in each case
med = np.median(mag_grad, axis=0) #Get median value at each point in the planes
Then to select the correct x & y components of the gradient, I use:
pos=(mag_grad==med).argmax(axis=0) #This returns the first instance where the median element encountered along axis=0
G = np.stack([np.zeros(med.shape),np.zeros(med.shape)], axis=0) #Will store y and x median components of the gradient, respectively.
for i in range(med.shape[0]):
for j in range(med.shape[1]):
G[0,i,j], G[1,i,j] = grad[0][pos[i,j],i,j], grad[1][pos[i,j],i,j] #Manually select the median y and x components of the gradient, and save to G.
I believe the 2nd code block works correctly. However, it is very inelegant, and because I couldn't find a way to do this in NumPy, I had to use a Python loop which adds a large amount of overhead. In addition, since this operation occurs frequently in NumPy, I suspect there should be an in-built way to do this.
How can I implement this code more effectively and elegantly?
Using itertools to index your array can make it more efficient/elegant.
import itertools
idxs = np.array(list(itertools.product(range(med.shape[0]), range(med.shape[1]))))
G[0,idxs], G[1,idxs] = grad[0][pos[idxs],idxs], grad[1][pos[idxs],idxs]
Related
I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.
I am trying to calculate the divergence of a 3D velocity field in a multi-phase flow setting (with solids immersed in a fluid). If we assume u,v,w to be the three velocity components (each a n x n x n) 3D numpy array, here is the function I have for calculating divergence:
def calc_divergence_velocity(df,h=0.025):
"""
#param df: A dataframe with the entire vector field with columns [x,y,z,u,v,w] with
x,y,z indicating the 3D coordinates of each point in the field and u,v,w
the velocities in the x,y,z directions respectively.
#param h: This is the dimension of a single side of the 3D (uniform) grid. Used
as input to numpy.gradient() function.
"""
"""
Reshape dataframe columns to get 3D numpy arrays (dim = 80) so each u,v,w is a
80x80x80 ndarray.
"""
u = df['u'].values.reshape((dim,dim,dim))
v = df['v'].values.reshape((dim,dim,dim))
w = df['w'].values.reshape((dim,dim,dim))
#Supply x,y,z coordinates appropriately.
#Note: Only a scalar `h` has been supplied to np.gradient because
#the type of grid we are dealing with is a uniform grid with each
#grid cell having the same dimensions in x,y,z directions.
u_grad = np.gradient(u,h,axis=0) #central diff. du_dx
v_grad = np.gradient(v,h,axis=1) #central diff. dv_dy
w_grad = np.gradient(w,h,axis=2) #central diff. dw_dz
"""
The `mask` column in the dataframe is a binary column indicating the locations
in the field where we are interested in measuring divergence.
The problem I am looking at is multi-phase flow with solid particles and a fluid
hence we are only interested in the fluid locations.
"""
sdf = df['mask'].values.reshape((dim,dim,dim))
div = (u_grad*sdf) + (v_grad*sdf) + (w_grad*sdf)
return div
The problem I'm having is that the divergence values that I am seeing are far too high.
For example the image below showcases, a distribution with values between [-350,350] whereas most values should technically be close to zero and somewhere between [20,-20] in my case. This tells me I'm calculating the divergence incorrectly and I would like some pointers as to how to correct the above function to calculate the divergence appropriately. As far as I can tell (please correct me if I'm wrong), I think have done something similar to this upvoted SO response. Thanks in advance!
I have two arrays of size (n, m, m) (n number of images of size (m,m)). I want to perform a cross correlation between each corresponding n of the two arrays.
Example: n=1 -> corr2d([m,m]<sub>1</sub>,[m,m]<sub>2</sub>)
My current way include a bunch of for loops in python:
for i in range(len(X)):
X_co = X[i,0,:,:]/(np.max(X[i,0,:,:]))
X_x = X[i,1,:,:]/(np.max(X[i,1,:,:]))
autocorr[i,0,:,:]=correlate2d(X_co, X_x, mode='same', boundary='fill', fillvalue=0)
Obviously this is very slow when the input contain many images, and becomes a substantial part of the total run time if (m,m) << n.
The obvious optimization is to skip the loop and feed everything directly to the compiled correlation function. Currently I'm using scipy's correlate2d.
I've looked around but haven't found any function that allows correlation along some axis or multiple inputs.
Any tips on how to make scipy's correlate2d work or alternatives?
I decided to implement it via the FFT instead.
def fft_xcorr2D(x):
# Over axes (-2,-1) (default in the fft2 function)
## Pad because of cyclic (circular?) behavior of the FFT
x = np.fft2(np.pad(x,([0,0],[0,0],[0,34],[0,34]),mode='constant'))
# Conjugate for correlation, not convolution (Conv. Theorem)
x[:,1,:,:] = np.conj(x[:,1,:,:])
# Over axes (-2,-1) (default in the ifft2 function)
## Multiply elementwise over 2:nd axis (2 image bands for me)
### fftshift over rows and column over images
corr = np.fft.fftshift(np.ifft2(np.prod(x,axis=1)),axes=(-2,-1))
# Return after removing padding
return np.abs(corr)[:,3:-2,3:-2]
Call via:
ts=fft_xcorr2D(X)
If anybody wants to use it:
My input is a 4D array: (N, 2, #Rows, #Cols)
E.g. (500, 2, 30, 30): 500 images, 2 bands (polarizations, for example), of 30x30 pixels
If your input is different, adjust the padding to your liking
Check so your input order is the same as mine otherwise change the axes arguments in the fft2 and ifft2 functions, the np.prod and fftshift. I use fftshift to get the maximum value in the middle (otherwise in the corners), so be wary of that if that's not what you want.
Why is it the maximum value? Technically, it doesn't have to be, but for my purpose it is. fftshift is used to get a correlation that looks like you're used to. Otherwise, the quadrants are turned "inside out". If you wonder what I mean, remove fftshift (just the fftshift part, not its arguments), call the function as before, and plot it.
Afterwards, it should be ready to use.
Possibly x.prod(axis=1) is faster than np.prod(x,axis=1) but it's an old post. It shows no improvement for me after trying.
I have two datasets of a specific region: The first is the rainfall and the second a vegetation measure (npp) of that region. So, the first two dimensions (x,y) represent the geographical location. The third dimension is the time (8 time steps). What I want to do is to perform a linear regression for each location of the 8 values rainfall versus the 8 values of the vegetation. The result should be either several two dimensional arrays in which for each location the p-value, the r², the slope and ideally the residuals are calculated or all values togeher in a 3D array.
nppList = glob.glob(nppPath+"*.img")
rainList = glob.glob(rainPath+"*.img")
nppImg = [gdal.Open(i) for i in nppList]
rainImg = [gdal.Open(i) for i in rainList]
nppFiles = [i.ReadAsArray() for i in nppImg]
rainFiles = [i.ReadAsArray() for i in rainImg]
# get nodata
nppNodata = nppImg[1].GetRasterBand(1).GetNoDataValue()
rainNodata = rainImg[1].GetRasterBand(1).GetNoDataValue()
# convert to float and set no data
nppStack = nppStack.astype(float)
nppStack[nppStack == nppNodata] = np.nan
rainStack = rainStack.astype(float)
rainStack[rainStack == rainNodata] = np.nan
# instead of range(0,8) there should be the rainfall variable, but on a pixel base
def linReg(a):
return stats.linregress(a, range(0, 8))
lm = np.apply_along_axis(linReg, axis=2, arr=nppStack)
I know the function numpy.apply_along_axis() but here a function can be applied to only one array. I am searching for a possibility to apply a function on two arrays along an axis preferably wihtout looping through the arrays.
The source for scipy.stats.linregress indicates that only arrays with dimension greater than 2 are not supported (and only then for the case that your x and y data happen to be in the same data structure).
Honestly, in your case I would use a Python loop -- it is unlikely that the slowest part of the code is looping over the data points; rather, the regression itself will be determining the speed.
In that case, you could flatten your positional axes, use a single loop, and then reshape the regression results back to 3D. Something like:
n = nx * ny
frain = rainStack.reshape((n, 8))
fnpp = nppStack.reshape((n, 8))
reg_results = np.empty((n,5))
for i in range(n):
reg_results[i] = stats.linregress(frain[i], fnpp[i])
reg_results[i].reshape((nx,ny,8)) # back to 3D
I wrote the following function to calculate a row by row correlation of a matrix with respect to a selected row (specified by the index parameter):
# returns a 1D array of correlation coefficients whose length matches
# the row count of the given np_arr_2d
def ma_correlate_vs_index(np_arr_2d, index):
def corr_upper(x, y):
# just take the upper right corner of the correlation matrix
return numpy.ma.corrcoef(x, y)[0, 1]
return numpy.ma.apply_along_axis(corr_upper, 1, np_arr_2d, np_arr_2d[index, :])
The problem is that the code is very, very slow and I'm not sure how to improve the performance. I believe that the use of apply_along_axis as well as the fact that corrcoef is creating a 2D array are both contributing to the poor performance. Is there any more direct way to calculate that may give better performance?
In case it matters I'm using the ma version of the functions to mask out some nan values that are found in the data. Also, the shape of np_arr_2d for my data is (623065, 72).
I think you are right that there is a lot of overhead in corrcoef. Essentially you just want the dot product of each row with the index row, normalized to a maximum of 1.0.
Something like this will work and will be much faster:
# Demean
demeaned = np_arr_2d - np_arr_2d.mean(axis=1)[:, None]
# Dot product of each row with index
res = np.ma.dot(demeaned, demeaned[index])
# Norm of each row
row_norms = np.ma.sqrt((demeaned ** 2).sum(axis=1))
# Normalize
res = res / row_norms / row_norms[index]
This runs much more quickly than your original code. I've used the masked array methods and so I think this will work with your data containing NaN.
There may be a tiny difference in the norms, controlled by ddof in corrcoef, in which case you can calculate the row_norms using np.ma.std and specify the ddof you want.