Apply scipy.stats.kde gaussian_kde function to each grid cell - python

I am new to this so I apologize if I am missing something. I am trying to get a probability range of a dataset with three dimensions (time, lat, lon). For 1 "cell" (single lat/lon combination), I have done the following:
# create some data
mu, sigma = 0, 0.1
s = np.random.normal(mu, sigma,900)
# get 90th - 100th percentiles
t_90x_ref= np.percentile(s, 90,interpolation="nearest")
t_100x_ref=np.percentile(s,100,interpolation="nearest")
# apply gaussian_kde function
AbnomRef_pdf= gaussian_kde(s)
# get probability range
Prob_range_90_100_Ref=AbnomRef_pdf.integrate_box_1d(t_90x_ref, t_100x_ref)*100
I would now like to repeat this exact process for each grid cell (lat/lon combination) along the time axis (with 900 timesteps,like above).
lat= np.linspace(-38.28,34.76, 167)
lon = np.linspace(143.92,207.72, 146)
# 3dim data
Anomalies_ref = np.random.rand(900, 167,146)
# get percentiles for 3 dim data
t_90x_ref= np.percentile(Anomalies_ref, 90,interpolation="nearest", axis=0)
Here is where I get stuck with the gaussian_kde function (neither a for-loop worked, nor was I able to flatten the gaussian_kde results). I have seen this case Using scipy.stats.gaussian_kde with 2 dimensional data but can`t really apply it to my problem.Ultimately, my goal is to get a Prob_range_90_100_Ref with shape (167,146)
Any help would be very much appreciated!
Thanks!

Related

interpolate / downsample 2D array in Python

I have 2 separate arrays with different sizes:
len(range_data) = 4320
len(az1) = 385
len(az2) = 347
data1.shape = (385,4320)
data2.shape = (347,4320)
I would like for the dimensions of data2 to equal that of data1, such that data2.shape should be (385,4320). I have tried scipy interpolate such as:
f = interpolate.interp2d(az1,range_data,data1,kind='cubic')
znew = f(az2,range_data)
print(znew.shape)
(347,4320)
znew.shape should be (385,4320), any ideas why this is happening and/or what might need to be done to fix this?
I don't think that interp2d actually generates more points for you, it defines an interpolation function over a grid. That means that what you've created is a way to interpolate points within the grid defined by your first set of data points. znew will return an interpolated grid with the same number of values as the x and y passed to it.
See the source code.
Returns
-------
z : 2-D array with shape (len(y), len(x))
The interpolated values.
If you want to add extra data points, I would suggest deriving a regression function (or whatever ML technique you want, NNs if you're so inclined) on the second data set and use that function to produce the extra 38 datapoints you need.

For loops to iterate through columns of a csv

I'm very new to python and programming in general (This is my first programming language, I started about a month ago).
I have a CSV file with data ordered like this (CSV file data at the bottom). There are 31 columns of data. The first column (wavelength) must be read in as the independent variable (x) and for the first iteration, it must read in the second column (i.e. the first column labelled as "observation") as the dependent variable (y). I am then trying to fit a Gaussian+line model to the data and extracting the value of the mean of the Gaussian (mu) from the data which should be stored in an array for further analysis. This process should be repeated for each set of observations, whilst the x values read in must stay the same (i.e. from the Wavelength column)
Here is the code for how I am currently reading in the data:
import numpy as np #importing necessary packages
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from scipy.optimize import curve_fit
e=np.exp
spectral_data=np.loadtxt(r'C:/Users/Sidharth/Documents/Computing Labs/Project 1/Halpha_spectral_data.csv', delimiter=',', skiprows=2) #importing data file
print(spectral_data)
x=spectral_data[:,0] #selecting column 0 to be x-axis data
y=spectral_data[:,1] #selecting column 1 to be y-axis data
So I need to automate that process so that instead of having to change y=spectral_data[:,1] to y=spectral_data[:,2] manually all the way up to y=spectral_data[:,30] for each iteration, it can simply be automated.
My code for producing the Gaussian fit is as follows:
plt.scatter(x,y) #produce scatter plot
plt.title('Observation 1')
plt.ylabel('Intensity (arbitrary units)')
plt.xlabel('Wavelength (m)')
plt.plot(x,y,'*')
plt.plot(x,c+m*x,'-') #plots the fit
print('The slope and intercept of the regression is,', m,c)
m_best=m
c_best=c
def fit_gauss(x,a,mu,sig,m,c):
gaus = a*sp.exp(-(x-mu)**2/(2*sig**2))
line = m*x+c
return gaus + line
initial_guess=[160,7.1*10**-7,0.2*10**-7,m_best,c_best]
po,po_cov=sp.optimize.curve_fit(fit_gauss,x,y,initial_guess)
The Gaussian seems to fit fine (as shown in the image of the plot) and so the mean value of this gaussian (i.e. the x-coordinate of its peak) is the value I must extract from it. The value of the mean is given in the console (denoted by mu):
The slope and intercept of the regression is, -731442221.6844947 616.0099144830941
The signal parameters are
Gaussian amplitude = 19.7 +/- 0.8
mu = 7.1e-07 +/- 2.1e-10
Gaussian width (sigma) = -0.0 +/- 0.0
and the background estimate is
m = 132654859.04 +/- 6439349.49
c = 40 +/- 5
So my questions are, how can I iterate the process of reading in data from the csv so that I don't have to manually change the column y takes data from, and then how do I store the value of mu from each iteration of the read-in so that I can do further analysis/calculations with that mean later?
My thoughts are I should use a for-loop but I'm not sure how to do it.
The orange line shown in the plot is a result of some code I tried earlier. I think its irrelevant which is why it isn't in the main part of the question, but if necessary, this is all it is.
x=spectral_data[:,0] #selecting column 0 to be x-axis data
y=spectral_data[:,1] #selecting column 1 to be y-axis data
plt.scatter(x,y) #produce scatter plot
plt.title('Observation 1')
plt.ylabel('Intensity (arbitrary units)')
plt.xlabel('Wavelength (m)')
plt.plot(x,y,'*')
plt.plot(x,c+m*x,'-') #plots the fit
Usually when you encounter a problem like that, try to break it into what has to be kept unchanged (in your example, the x data and the analysis code), and what does have to be changed (the y data, or more specific the index which tells the rest of the code what is the right column for the y data), and how to keep the values you wish to store further down the road.
Once you figure this out, we need to formalize the right loop and how to store the values we wish to. To do the latter, an easy way is to store them in a list, so we'll initiate an empty list and at the end of each loop iteration we'll append the value to that list.
mu_list = [] # will store our mu's in this list
for i in range(1, 31): # each iteration i gets a different value, starting with 1 and ends with 30 (and not 31)
x = spectral_data[:, 0]
y = spectral_data[:, i]
# Your analysis and plot code here #
mu = po[1] # Not sure po[1] is the right place where your mu is, please change it appropriately...
mu_list.append(mu) # store mu at the end of our growing mu_list
And you will have a list of 30 mu's under mu_list.
Now, notice we don't have to do everything inside the loop, for example x is the same regardless to i (loading x only once - improves performance) and the analysis code is basically the same, except for a different input (y data), so we can define a function for it (a good practice to make bigger code much more readable), so most likely we can take them out from the loop. We can write x = spectral_data[:, 0] before the loop, and define a function which analyizes the data and returns mu:
def analyze(x, y):
# Your analysis and plot code here #
mu = po[1]
return mu
x = spectral_data[:, 0]
mu_list = [] # will store our mu's in this list
for i in range(1, 31):
y = spectral_data[:, i]
mu_list.append(analyze(x,y)) # Will calculate mu using our function, and store it at the end of our growing mu_list

Calculate Divergence of Velocity Field (3D) in Python

I am trying to calculate the divergence of a 3D velocity field in a multi-phase flow setting (with solids immersed in a fluid). If we assume u,v,w to be the three velocity components (each a n x n x n) 3D numpy array, here is the function I have for calculating divergence:
def calc_divergence_velocity(df,h=0.025):
"""
#param df: A dataframe with the entire vector field with columns [x,y,z,u,v,w] with
x,y,z indicating the 3D coordinates of each point in the field and u,v,w
the velocities in the x,y,z directions respectively.
#param h: This is the dimension of a single side of the 3D (uniform) grid. Used
as input to numpy.gradient() function.
"""
"""
Reshape dataframe columns to get 3D numpy arrays (dim = 80) so each u,v,w is a
80x80x80 ndarray.
"""
u = df['u'].values.reshape((dim,dim,dim))
v = df['v'].values.reshape((dim,dim,dim))
w = df['w'].values.reshape((dim,dim,dim))
#Supply x,y,z coordinates appropriately.
#Note: Only a scalar `h` has been supplied to np.gradient because
#the type of grid we are dealing with is a uniform grid with each
#grid cell having the same dimensions in x,y,z directions.
u_grad = np.gradient(u,h,axis=0) #central diff. du_dx
v_grad = np.gradient(v,h,axis=1) #central diff. dv_dy
w_grad = np.gradient(w,h,axis=2) #central diff. dw_dz
"""
The `mask` column in the dataframe is a binary column indicating the locations
in the field where we are interested in measuring divergence.
The problem I am looking at is multi-phase flow with solid particles and a fluid
hence we are only interested in the fluid locations.
"""
sdf = df['mask'].values.reshape((dim,dim,dim))
div = (u_grad*sdf) + (v_grad*sdf) + (w_grad*sdf)
return div
The problem I'm having is that the divergence values that I am seeing are far too high.
For example the image below showcases, a distribution with values between [-350,350] whereas most values should technically be close to zero and somewhere between [20,-20] in my case. This tells me I'm calculating the divergence incorrectly and I would like some pointers as to how to correct the above function to calculate the divergence appropriately. As far as I can tell (please correct me if I'm wrong), I think have done something similar to this upvoted SO response. Thanks in advance!

Average a Data Set while maintaining its variables?

I am currently trying to plot some data into cartopy, but I am having some issues.
I have a datasheet that has a shape of (180, 180, 360) time, lat, and lon respectively.
I would like to get an annual mean of this data. I had been using the code
def global_mean_3D(var, weights):
# make sure masking is correct, otherwise we get nans
var = np.ma.masked_invalid(var)
# resulting variable should have dimensions of depth and time (x)
ave = np.zeros([var.shape[0], var.shape[1]])
# loop over time
for t in np.arange(var.shape[0]):
# loop over each depth slice
for d in np.arange(var.shape[1]):
ave[t,d] = np.ma.average(var[t,d,:], weights = weights)
return ave
which I then use to plot
ax=plt.axes(projection=ccrs.Robinson())
ax.coastlines()
ax.contourf(x,y, ann_total_5tg)
But this code gives me a one dimension shape, over time, which I can't plot into cartopy using pcolor mesh.
I am left with the error
TypeError: Input z must be a 2D array.
Would it be possible to get an annual mean whilst maintaining variables within the datasheet?
I suspect that you have to reshape your numpy array to use it with the contour method.
Using your variable name it can be done like this :
ann_total_5tg = ann_total_5tg.reshape((180, 180))

Distance between two group of values in a numpy array

I have a very basic question which in theory is easy to do (with fewer points and a lot of manual labour in ArcGIS), but I am not able to start at all with the coding to solve this problem (also I am new to complicated python coding).
I have 2 variables 'Root zone' aka RTZ and 'Tree cover' aka TC both are an array of 250x186 values (which are basically grids with each grid having a specific value). The values in TC varies from 0 to 100. Each grid size is 0.25 degrees (might be helpful in understanding the distance).
My problem is "I want to calculate the distance of each TC value ranging between 50-100 (so each value of TC value greater than 50 at each lat and lon) from the points where nearest TC ranges between 0-30 (less than 30)."
Just take into consideration that we are not looking at the np.nan part of the TC. So the white part in TC is also white in RZS.
What I want to do is create a 2-dimensional scatter plot with X-axis denoting the 'distance of 50-100 TC from 0-30 values', Y-axis denoting 'RZS of those 50-100 TC points'. The above figure might make things more clear.
I hope I could have provided any code for this, but I am not to even able to start on the distance thing.
Please provide any suggestion on how should I proceed with this.
Let's consider an example:
If you look at the x: 70 and y:70, one can see a lot of points with values from 0-30 of the tree cover all across the dataset. But I only want the distance from the nearest value to my point which falls between 0-30.
The following code might work, with random example data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Create some completely random data, and include an area of NaNs as well
rzs = np.random.uniform(0, 100, size=(250, 168))
tc = np.random.lognormal(3.0, size=(250, 168))
tc = np.clip(tc, 0, 100)
rzs[60:80,:] = np.nan
tc[60:80,:] = np.nan
plt.subplot(2,2,1)
plt.imshow(rzs)
plt.colorbar()
plt.subplot(2,2,2)
plt.imshow(tc)
plt.colorbar()
Now do the real work:
# Select the indices of the low- and high-valued points
# This will results in warnings here because of NaNs;
# the NaNs should be filtered out in the indices, since they will
# compare to False in all the comparisons, and thus not be
# indexed by 'low' and 'high'
low = (tc >= 0) & (tc <= 30)
high = (tc >= 50) & (tc <= 100)
# Get the coordinates for the low- and high-valued points,
# combine and transpose them to be in the correct format
y, x = np.where(low)
low_coords = np.array([x, y]).T
y, x = np.where(high)
high_coords = np.array([x, y]).T
# We now calculate the distances between *all* low-valued points, and *all* high-valued points.
# This calculation scales as O^2, as does the memory cost (of the output),
# so be wary when using it with large input sizes.
from scipy.spatial.distance import cdist, pdist
distances = cdist(low_coords, high_coords)
# Now find the minimum distance along the axis of the high-valued coords,
# which here is the second axis.
# Since we also want to find values corresponding to those minimum distances,
# we should use the `argmin` function instead of a normal `min` function.
indices = distances.argmin(axis=1)
mindistances = distances[np.arange(distances.shape[0]), indices]
minrzs = rzs.flatten()[indices]
plt.scatter(mindistances, minrzs)
The resulting plot looks a bit weird, since there are rather discrete distances because of the grid (1, sqrt(1^1+1^1), 2, sqrt(1^1+2^2), sqrt(2^2+2^2), 3, sqrt(1^1+3^2), ...); this is because both TC values are randomly distributed, and thus low values may end up directly adjacent to high values (and because we're looking for minimum distances, most plotted points are for these cases). The vertical distribution is because the RZS values were uniformly distributed between 0 and 100.
This is simply a result of the input example data, which is not too representative of the real data.

Categories