Average a Data Set while maintaining its variables? - python

I am currently trying to plot some data into cartopy, but I am having some issues.
I have a datasheet that has a shape of (180, 180, 360) time, lat, and lon respectively.
I would like to get an annual mean of this data. I had been using the code
def global_mean_3D(var, weights):
# make sure masking is correct, otherwise we get nans
var = np.ma.masked_invalid(var)
# resulting variable should have dimensions of depth and time (x)
ave = np.zeros([var.shape[0], var.shape[1]])
# loop over time
for t in np.arange(var.shape[0]):
# loop over each depth slice
for d in np.arange(var.shape[1]):
ave[t,d] = np.ma.average(var[t,d,:], weights = weights)
return ave
which I then use to plot
ax=plt.axes(projection=ccrs.Robinson())
ax.coastlines()
ax.contourf(x,y, ann_total_5tg)
But this code gives me a one dimension shape, over time, which I can't plot into cartopy using pcolor mesh.
I am left with the error
TypeError: Input z must be a 2D array.
Would it be possible to get an annual mean whilst maintaining variables within the datasheet?

I suspect that you have to reshape your numpy array to use it with the contour method.
Using your variable name it can be done like this :
ann_total_5tg = ann_total_5tg.reshape((180, 180))

Related

interpolate / downsample 2D array in Python

I have 2 separate arrays with different sizes:
len(range_data) = 4320
len(az1) = 385
len(az2) = 347
data1.shape = (385,4320)
data2.shape = (347,4320)
I would like for the dimensions of data2 to equal that of data1, such that data2.shape should be (385,4320). I have tried scipy interpolate such as:
f = interpolate.interp2d(az1,range_data,data1,kind='cubic')
znew = f(az2,range_data)
print(znew.shape)
(347,4320)
znew.shape should be (385,4320), any ideas why this is happening and/or what might need to be done to fix this?
I don't think that interp2d actually generates more points for you, it defines an interpolation function over a grid. That means that what you've created is a way to interpolate points within the grid defined by your first set of data points. znew will return an interpolated grid with the same number of values as the x and y passed to it.
See the source code.
Returns
-------
z : 2-D array with shape (len(y), len(x))
The interpolated values.
If you want to add extra data points, I would suggest deriving a regression function (or whatever ML technique you want, NNs if you're so inclined) on the second data set and use that function to produce the extra 38 datapoints you need.

Interpolating a function over a grid with different input sizes

I have a function f(u,v,w) which I would like to interpolate using a scipy function (with linear interpolation). This is easy enough.
When I run the interpolation step, I simply do the following (interpolating over a u,v,w grid):
u = np.linspace(-1,1,100)
v = np.linspace(-2,2,50)
w = np.linspace(3,8,30)
values_grid = np.zeros((len(u),len(v),len(w)))
count = 0
for i in range(len(u)):
for j in range(len(w)):
for k in range(len(w)):
values_grid[i,j,k] = f(u[i],v[j],w[k])
from scipy.interpolate import RegularGridInterpolator
my_interpolating_function = RegularGridInterpolator((u, v, w), values_grid, method='linear',bounds_error=False,fill_value=-999)
This is fine for many cases. However, when I want to evaluate this interpolation function it seems like I am required to use inputs which have shape [(Number of input samples) x (Dimension of Samples)]. E.g:
func_input = np.vstack([u_samps,v_samps,w_samps].T # E.g. shape is 500,3
output = my_interpolating_function(func_input)) # Has output shape 500
This works fine. The issue is that I would like to evaluate this function over a grid where the samples have the following shape
shape(u_samps) = 500
shape(v_samps) = (100,100)
shape(w_samps) = (100,100)
Meaning I would like to evaluate
my_interpolating_function([u_samps, v_samps, w_samps])
and get out an array which has shape (500,100,100) (so the interpolation is evaluated for all 500 u_samps over the v_samps and w_samps grids). I can flatten the v_samps and w_samps array, but then I have to make several (hundreds) copies of u_samps to get the inputs into the correct format. So is there any way to have an interpolation function that can take the inputs above (u_samps, v_samps, w_samps with the specified shapes) and get out an array with shape (500,100,100) efficiently?
Any help greatly appreciated, I have been stuck on this problem and it's really holding up my progress! The end goal is to use this function in a statistical likelihood which needs to be sampled with MCMC, so speed is pretty important (and making hundreds of copies of massive arrays is very slow)

Incoherent handling of coordinates in basemap addcyclic

I have used all mpl_toolkits.basemap functions successfully on several global GCM netcdf datasets. Until I met this grid, with longitudes starting at 0.9375 (instead of 0 as I have always seen) and ending at 359.062.
To prepare a plot, I need to:
make the plot continuous with:
# input_var is a 2D numpy array
var_cyclicDUMMY, lons_cyclicDUMMY = addcyclic(input_var, lons)
I thus obtain a 2D array var_cyclicDUMMYwith an extra column (one extra longitude), and a 1D array lons_cyclicDUMMY with one extra element at the end, i.e. one extra longitude, but at 0.9375, instead of the 360 that is needed.
Indeed in the next step, where I
shift the grid, so longitudes go from -180 to 180 instead of 0 to 360, with:
var_cyclic, lons_cyclic = shiftgrid(180., var_cyclicDUMMY,
lons_cyclicDUMMY, start=False)
I get an ValueError: lon0 outside of range of lonsin
Any suggestions how to get around this with basemap or another solution?

Apply scipy.stats.kde gaussian_kde function to each grid cell

I am new to this so I apologize if I am missing something. I am trying to get a probability range of a dataset with three dimensions (time, lat, lon). For 1 "cell" (single lat/lon combination), I have done the following:
# create some data
mu, sigma = 0, 0.1
s = np.random.normal(mu, sigma,900)
# get 90th - 100th percentiles
t_90x_ref= np.percentile(s, 90,interpolation="nearest")
t_100x_ref=np.percentile(s,100,interpolation="nearest")
# apply gaussian_kde function
AbnomRef_pdf= gaussian_kde(s)
# get probability range
Prob_range_90_100_Ref=AbnomRef_pdf.integrate_box_1d(t_90x_ref, t_100x_ref)*100
I would now like to repeat this exact process for each grid cell (lat/lon combination) along the time axis (with 900 timesteps,like above).
lat= np.linspace(-38.28,34.76, 167)
lon = np.linspace(143.92,207.72, 146)
# 3dim data
Anomalies_ref = np.random.rand(900, 167,146)
# get percentiles for 3 dim data
t_90x_ref= np.percentile(Anomalies_ref, 90,interpolation="nearest", axis=0)
Here is where I get stuck with the gaussian_kde function (neither a for-loop worked, nor was I able to flatten the gaussian_kde results). I have seen this case Using scipy.stats.gaussian_kde with 2 dimensional data but can`t really apply it to my problem.Ultimately, my goal is to get a Prob_range_90_100_Ref with shape (167,146)
Any help would be very much appreciated!
Thanks!

Python: How to perform linear regression of two numpy 3D datasets along axis?

I have two datasets of a specific region: The first is the rainfall and the second a vegetation measure (npp) of that region. So, the first two dimensions (x,y) represent the geographical location. The third dimension is the time (8 time steps). What I want to do is to perform a linear regression for each location of the 8 values rainfall versus the 8 values of the vegetation. The result should be either several two dimensional arrays in which for each location the p-value, the r², the slope and ideally the residuals are calculated or all values togeher in a 3D array.
nppList = glob.glob(nppPath+"*.img")
rainList = glob.glob(rainPath+"*.img")
nppImg = [gdal.Open(i) for i in nppList]
rainImg = [gdal.Open(i) for i in rainList]
nppFiles = [i.ReadAsArray() for i in nppImg]
rainFiles = [i.ReadAsArray() for i in rainImg]
# get nodata
nppNodata = nppImg[1].GetRasterBand(1).GetNoDataValue()
rainNodata = rainImg[1].GetRasterBand(1).GetNoDataValue()
# convert to float and set no data
nppStack = nppStack.astype(float)
nppStack[nppStack == nppNodata] = np.nan
rainStack = rainStack.astype(float)
rainStack[rainStack == rainNodata] = np.nan
# instead of range(0,8) there should be the rainfall variable, but on a pixel base
def linReg(a):
return stats.linregress(a, range(0, 8))
lm = np.apply_along_axis(linReg, axis=2, arr=nppStack)
I know the function numpy.apply_along_axis() but here a function can be applied to only one array. I am searching for a possibility to apply a function on two arrays along an axis preferably wihtout looping through the arrays.
The source for scipy.stats.linregress indicates that only arrays with dimension greater than 2 are not supported (and only then for the case that your x and y data happen to be in the same data structure).
Honestly, in your case I would use a Python loop -- it is unlikely that the slowest part of the code is looping over the data points; rather, the regression itself will be determining the speed.
In that case, you could flatten your positional axes, use a single loop, and then reshape the regression results back to 3D. Something like:
n = nx * ny
frain = rainStack.reshape((n, 8))
fnpp = nppStack.reshape((n, 8))
reg_results = np.empty((n,5))
for i in range(n):
reg_results[i] = stats.linregress(frain[i], fnpp[i])
reg_results[i].reshape((nx,ny,8)) # back to 3D

Categories