Optimize 4D Numpy array construction - python

I have a 4D array data of shape (50,8,2048,256) which are 50 groups containing 8 2048x256 pixel images. times is an array of shape (50,8) giving the time that each image was taken.
I calculate a 1st order polynomial fit at each pixel for all images in each group, giving me an array of shape (50,2048,256,2). This is essentially a vector plot for each of the 50 groups. The code I use to store the polynomials is:
fits = np.ones((50,2048,256,2))
times = times.reshape(50,8,1).repeat(2048,2).reshape(50,8,2048,1).repeat(256,3)
for group in range(50):
for xpos in range(2048):
for ypos in range(256):
px_data = data[:,:,ypos,xpos]
fits[group,ypos,xpos,:] = np.polyfit(times[group,:,ypos,xpos],data[group,:,ypos,xpos],1)
Now the challenge is that I want to generate an array new_data of shape (50,12,2048,256) where I use the polynomial coefficients from fits and the times from new_time to generate 50 groups of 12 images.
I figure I can use something like np.polyval(fits, new_time) to generate the images but I'm very confused with how to phrase it. It should be something like:
new_data = np.ones((50,12,2048,256))
for i,(times,fit) in enumerate(zip(new_times,fits)):
new_data[i] = np.polyval(fit,times)
But I'm getting broadcasting errors. Any assistance would be greatly appreciated!
Ok, so I changed the code a bit so that it does work and do exactly what I want, but it is terribly slow with all these loops (~1 minute per group meaning this would take me almost an hour to run!). Can anyone suggest a way to optimize this to speed it up?
# Generate the polynomials for each pixel in each group
fits = np.ones((50,2048,256,2))
times = np.arange(0,50*8*grptme,grptme).reshape(50,8)
times = times.reshape(50,8,1).repeat(2048,2).reshape(50,8,2048,1).repeat(256,3)
for group in range(50):
for xpos in range(2048):
for ypos in range(256):
fits[group,xpos,ypos] = np.polyfit(times[group,:,xpos,ypos],data[group,:,xpos,ypos],1)
# Create new array of 12 images per group using the polynomials for each pixel
new_data = np.ones((50,12,2048,256))
times = np.arange(0,50*12*grptme,grptme).reshape(50,12)
times = times.reshape(50,12,1).repeat(2048,2).reshape(50,12,2048,1).repeat(256,3)
for group in range(50):
for img in range(12):
for xpos in range(2048):
for ypos in range(256):
new_data[group,img,xpos,ypos] = np.polynomial.polynomial.polyval(times[group,img,xpos,ypos],fits[group,xpos,ypos])

Regarding the speed I see a lot of loops which is what should and often can be avoided due to the beauty of numpy. If I understand your problem fully you want to fit a first order polynom on 50 groups of 8 data points 2048 * 256 times. So for the fit the shape of your image does not play a role. So my suggestion is to flatten your images because with np.polyfit you can fit for a range of x-values several sets of y-values at the same time
From the doc string
x : array_like, shape (M,)
x-coordinates of the M sample points ``(x[i], y[i])``.
y : array_like, shape (M,) or (M, K)
y-coordinates of the sample points. Several data sets of sample
points sharing the same x-coordinates can be fitted at once by
passing in a 2D-array that contains one dataset per column.
So I would go for
# Generate the polynomials for each pixel in each group
fits = np.ones((50,2048*256,2))
times = np.arange(0,50*8*grptme,grptme).reshape(50,8)
data_fit = data.reshape((50,8,2048*256))
for group in range(50):
fits[group] = np.polyfit(times[group],data_fit[group],1).T
fits_original_shape = fits.reshape((50,2048,256,2))
The transposing is necessary since you want to have the parameters in the last index, but np.polyfit has them first and then the different data sets
And then to evaluate it it is basically the same trick again:
# Create new array of 12 images per group using the polynomials for each pixel
new_data = np.zeros((50,12,2048*256))
times = np.arange(0,50*12*grptme,grptme).reshape(50,12)
#times = times.reshape(50,12,1).repeat(2048,2).reshape(50,12,2048,1).repeat(256,3)
for group in range(50):
new_data[group] = np.polynomial.polynomial.polyval(times[group],fits[group].T).T
new_data_original_shape = new_data.reshape((50,12,2048,256))
The two transposes are again needed due to the ordering of the parameters vs. the different data sets so that matches with the shapes of your arrays.
Probably one could also avoid with some advanced numpy magic the loop over the groups, but with this the code runs much faster already.
I hope it helps!


Interpolating a function over a grid with different input sizes

I have a function f(u,v,w) which I would like to interpolate using a scipy function (with linear interpolation). This is easy enough.
When I run the interpolation step, I simply do the following (interpolating over a u,v,w grid):
u = np.linspace(-1,1,100)
v = np.linspace(-2,2,50)
w = np.linspace(3,8,30)
values_grid = np.zeros((len(u),len(v),len(w)))
count = 0
for i in range(len(u)):
for j in range(len(w)):
for k in range(len(w)):
values_grid[i,j,k] = f(u[i],v[j],w[k])
from scipy.interpolate import RegularGridInterpolator
my_interpolating_function = RegularGridInterpolator((u, v, w), values_grid, method='linear',bounds_error=False,fill_value=-999)
This is fine for many cases. However, when I want to evaluate this interpolation function it seems like I am required to use inputs which have shape [(Number of input samples) x (Dimension of Samples)]. E.g:
func_input = np.vstack([u_samps,v_samps,w_samps].T # E.g. shape is 500,3
output = my_interpolating_function(func_input)) # Has output shape 500
This works fine. The issue is that I would like to evaluate this function over a grid where the samples have the following shape
shape(u_samps) = 500
shape(v_samps) = (100,100)
shape(w_samps) = (100,100)
Meaning I would like to evaluate
my_interpolating_function([u_samps, v_samps, w_samps])
and get out an array which has shape (500,100,100) (so the interpolation is evaluated for all 500 u_samps over the v_samps and w_samps grids). I can flatten the v_samps and w_samps array, but then I have to make several (hundreds) copies of u_samps to get the inputs into the correct format. So is there any way to have an interpolation function that can take the inputs above (u_samps, v_samps, w_samps with the specified shapes) and get out an array with shape (500,100,100) efficiently?
Any help greatly appreciated, I have been stuck on this problem and it's really holding up my progress! The end goal is to use this function in a statistical likelihood which needs to be sampled with MCMC, so speed is pretty important (and making hundreds of copies of massive arrays is very slow)

Manually find the distance between centroid and labelled data points

I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.

Generating large simulations and inserting the same array multiple times into another array at different locations

I am working to generate a monte carlo simulation for oil wells. The end goal is to have all the wells with a smoothed probabilistic production curve. I have optimized what I can, but each of the 3 apply statements I am listing take so much time when I use my full dataset and the number of simulations I want. (Hours) The code I included is has 10 iterations. If you crank it up to 10,000 which is the goal it really starts to drag.
I have generated a Panda that has all the future wells I want to model with a probability of that well being chosen next to be drilled.
I then created a panda where I grouped everything into the categories I want to use to figure out the order that the model will choose the wells. So my "timing" panda contains my categories and an array of every index of those wells in those categories and an array of the well's probabilities.
This all is done in a few seconds. The next part works, but gets very slow.
Next I use a numpy generator choice with percentages to randomly generate the order of the wells for i simulations. As other posts have noted #njit does not work with the probability array. The result is 1 dimension of the array is the order that the wells will be chosen by each category, and the other dimension is each simulation. There are about 150 categories, and 10,000s of wells in each categories. I am hoping to run 10,000 simulations.
a is an array of indexes of wells that can be chosen
size is the length of that array
per is the probability that each well will get chosen
Next I link my timing panda to my panda with all of the wells in it. This attaches the previous array to the wells array. Then I search this array for the well index to figure out for each simulation when that specific well is going to get run. This generates a 1d array with what order that well is going be drilled in each simulation.
This function gets called on 100,000s of wells and as I increase the number of simulations it really slows down.
order is an array of the order each well is drilled per simulation
index is the index of that well
The final difficulty I am having is the averaging out the production curve for the wells. I have how much oil will be produced by each well per month. I need to insert that curve into the array at each point when the well is drilled, then average all of those values together to get the average production of the well given all the simulations.
I have also tried creating an np.zeros array then using the np.insert function, but I could not figure out how to insert an array multiple times without a loop and generating the initial array of 0's took longer than the current method I had. (I overcame inserting the array multiple times by covering everything to a string, inserting the type curve as a string then converting back to an array of numbers, but this did not seem efficient). I need to have the number of leading 0's
order is the time in months that each well will get drilled
curve is the production curve passed as a list
m is the highest value of the months that the well is drilled in all simulations
import numpy as np
from numba import njit
import datetime
import math
def TimingGenerator(a, size, p):
i = 10
g = np.random.Generator(np.random.PCG64())
order = np.concatenate([g.choice(a=a, size=size, replace=False, p=p) for z in range(i)]).reshape(i, size)
return order
def OrderGenerator(order, index):
result = np.where(order == index)[1]
return result
def CurveAverager(order, curve, m):
matrix = np.array([[0] * math.ceil(i) + curve + [0] * int((m - math.ceil(i))) for i in order])
result = np.mean(matrix, axis=0)
return result
begin_time = datetime.datetime.now()
size = 8000
g = np.random.Generator(np.random.PCG64())
a = g.choice(20_000, size=size, replace=False)
p = np.random.randint(1,100, size=size)
p = p/np.sum(p)
for i in range(150):
q = TimingGenerator(a,size,p)
print(datetime.datetime.now() - begin_time)
index = np.amin(q)
for i in range(100000):
order = OrderGenerator(q, index)
print(datetime.datetime.now() - begin_time)
order = order / 15
curve = list(range(600, 0, -1))
for i in range(20000):
avgcurve = CurveAverager(order, curve, size)
print(datetime.datetime.now() - begin_time)
Thanks for any help you can offer. I am willing to greatly alter my code if you can think of anything to help speed it up. Not sure if there is a better way to apply probabilities and smooth out the production curve which is really the end goal.

Numpy- How to create ROI iteratively in OpenCV python?

I am trying to split an image in a grid of smaller images so that I can process each small image separately. For that I realized that I'll have to define each small image as an ROI and I can use it easily from there.
Now, my grid size is not fixed. I.e, if user inputs 5, I have to make a grid of 5x5.
Iterating over the image pixel by pixel would be slow, so I decided to use Numpy to create ROI by using this construct :
#Assuming user entered grid size =5
This would be my first slice. h and w are height and width of the image respectively. For the next slice I'd have to do:
While my last roi will be:
roi25=img[4*roiheight+1:5*roiheight, 4*roiwidth+1:5*roiwidth]
But I need to do it iteratively, and cannot figure out the correct way to do that. I don't want to iterate over the image pixel by pixel and need it to be dynamic
EDIT: I am iterating like this now:
import cv2
import numpy
for i in range (0,5):
for j in range (0,5):
But I don't know whether this is the most efficient way of doing this.
If you want to split your image using Numpy functions, take a look at numpy.array_split.
In your case you would write something like this:
z = {}
count = 0
split1 = np.array_split(img, rh)
for sub in split1:
split2 = np.array_split(sub, rw, 1)
for sub2 in split2:
z[count] = sub2
For efficiency purposes, listed below is a vectorized approach using reshaping and permuting dimensions.
1) Let's define the input parameters and setup inputs :
M = 5 # Number of patches along height and width
img_slice = img[:rh*M,:rw*M] # Slice out valid image data
2) The main processing part comes here. Split the first two axes of sliced image such that we create two new axes of lengths M each by reshaping. Thus, the two remaining axes would represent the window (rh x rw). Our final aim is to bring them adjacent to each other so as to give us (rh,rw) patches and thus the other two split axes would also come next to each other. To do so, we need to permute dimensions with np.transpose. After permuting, we reshape to merge the two dimensions of lengths (M,M) so that we end up with one axis of length M^2, each of whose element would represent one window from the image.
So, finally we would have :
z = img_slice.reshape(M,rh,M,rw,-1).transpose(0,2,1,3,4).reshape(M**2,rh,rw,-1)
This gives us a NumPy array with M^2 elements along the first axis. Each slice along that axis would correspond to each window/patch. So, z[0] would be the top left corner patch and so on.

Python: How to perform linear regression of two numpy 3D datasets along axis?

I have two datasets of a specific region: The first is the rainfall and the second a vegetation measure (npp) of that region. So, the first two dimensions (x,y) represent the geographical location. The third dimension is the time (8 time steps). What I want to do is to perform a linear regression for each location of the 8 values rainfall versus the 8 values of the vegetation. The result should be either several two dimensional arrays in which for each location the p-value, the r², the slope and ideally the residuals are calculated or all values togeher in a 3D array.
nppList = glob.glob(nppPath+"*.img")
rainList = glob.glob(rainPath+"*.img")
nppImg = [gdal.Open(i) for i in nppList]
rainImg = [gdal.Open(i) for i in rainList]
nppFiles = [i.ReadAsArray() for i in nppImg]
rainFiles = [i.ReadAsArray() for i in rainImg]
# get nodata
nppNodata = nppImg[1].GetRasterBand(1).GetNoDataValue()
rainNodata = rainImg[1].GetRasterBand(1).GetNoDataValue()
# convert to float and set no data
nppStack = nppStack.astype(float)
nppStack[nppStack == nppNodata] = np.nan
rainStack = rainStack.astype(float)
rainStack[rainStack == rainNodata] = np.nan
# instead of range(0,8) there should be the rainfall variable, but on a pixel base
def linReg(a):
return stats.linregress(a, range(0, 8))
lm = np.apply_along_axis(linReg, axis=2, arr=nppStack)
I know the function numpy.apply_along_axis() but here a function can be applied to only one array. I am searching for a possibility to apply a function on two arrays along an axis preferably wihtout looping through the arrays.
The source for scipy.stats.linregress indicates that only arrays with dimension greater than 2 are not supported (and only then for the case that your x and y data happen to be in the same data structure).
Honestly, in your case I would use a Python loop -- it is unlikely that the slowest part of the code is looping over the data points; rather, the regression itself will be determining the speed.
In that case, you could flatten your positional axes, use a single loop, and then reshape the regression results back to 3D. Something like:
n = nx * ny
frain = rainStack.reshape((n, 8))
fnpp = nppStack.reshape((n, 8))
reg_results = np.empty((n,5))
for i in range(n):
reg_results[i] = stats.linregress(frain[i], fnpp[i])
reg_results[i].reshape((nx,ny,8)) # back to 3D
