How to find specific values along different dimensions of 3D numpy array - python

I am currently working with a 3D array with global ozone data with dimensions (42, 361, 576) which is (years, latitude, longitude). This array contains the ozone data on January 1st of every year of my dataset (which is 42 years long).
I am trying to now make a time series line/scatter plot of ozone in specific locations around the globe. However, I can't seem to figure out how to specify the latitude and longitude of a location within the entire array, and put the information from that location over the 42 years into a 2D array which I can then use to make my plot.
Within the data, longitude ranges from -180 to 180, with steps of 0.625. Latitude ranges from -90 to 90, with steps of 0.5.
My end goal is to have a plot with time on the x-axis, and the actual ozone values for the y-axis. I have done a lot of research into finding solutions for this, and I have yet to find anything that applies to what I am trying to do.
Any help is appreciated, as I am still fairly new to Python and how to work with arrays.

Related

Produce bin averaged latitude and longitude grids

I have a set of data with each row representing a series of observations taken at a particular point. Each column represents a different observation as well as information regarding where the data was collected (longitude and latitude (column 3 and 4 respectively). The data needs to be gridded so that the data is over 5degrees latitude 5degrees longitude bin-averaged grids. How would I go about doing this in python?
I don't know how to solve this. hope someone could help me.thanks so much

Efficiently find closest points to track in space & time on gridded data

Summary/simplified version
Given a list of track points defined by three 1-dimensional arrays (lats, lons and dtime all with same length) and a gridded 3-dimensional array rr (defined by 2-D lat_radar, lon_radar coordinate arrays and a 1-dimensional time array time_radar) I want to extract all the grid values in rr where the coordinates (latitude, longitude AND time included) are closest to the three 1-dimensional arrays.
I've managed to use cKDTree to select points in space but I don't know how to generalize the solution to space & time together. Right now I have to do the selection on time separately and it makes the code quite bulky and hard to read.
for more details about this problem see hereinafter
Extended version
I'm trying to develop an app that uses precipitation data obtained from weather radar composites to predict the precipitation along a track. Most apps usually predict the precipitation at a point without considering the point moving in time.
The idea is, given points identifying a track in space and time, find the closest grid points from radar data to obtain a precipitation estimate over the track (see plot). The final goal would be to shift the start time to identify the best time to leave to avoid rain.
I just optimized my previous algorithm, that was using plain loops, to use cKDTree from scipy. Execution time went down from 30s to 380ms :). However I think the code can still be optimized. Here is my attempt.
As input we have
lons, lats: coordinates of the track as N-dimensional arrays
dtime: timedelta T-dimensional array containing the time elapsed on the track
lon_radar, lat_radar: M x P matrices containing the coordinates of the radar data
dtime_radar: timedelta Q-dimensional array containing the radar forecast
rr: M x P X Q array containing the radar forecast at every time step
First find the grid points closest to the trajectory using cKDTree:
combined_x_y_arrays = np.dstack([lon_radar.ravel(),
lat_radar.ravel()])[0]
points_list = list(np.vstack([lons, lats]).T)
def do_kdtree(combined_x_y_arrays, points):
mytree = cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results = do_kdtree(combined_x_y_arrays, points_list)
# As we have many duplicates, since the itinerary has a much higher resolution than the radar,
# we only select the unique points
inds_itinerary = np.unique(results)
lon_lat_itinerary = combined_x_y_arrays[inds_itinerary]
then find the closest points in the track to subset it. It doesn't make sense to have a track resolution of 10 m if the radar only has grid points every km.
combined_x_y_arrays = np.vstack([lons, lats]).T
points_list = list(lon_lat_itinerary)
results = do_kdtree(combined_x_y_arrays, points_list)
Now we can use these positions to get the elapsed time on the trajectory and the relative time steps in radar data
dtime_itinerary = dtime[results]
# find indices of these dtimes in radar dtime
inds_dtime_radar = np.abs(np.subtract.outer(dtime_radar, dtime_itinerary)).argmin(0)
Now we have everything that we need to find the precipitation so we only need one last loop. I also loop on shifts to obtain prediction with different start times.
shifts = (1, 3, 5, 7, 9)
rain = np.empty(shape=(len(shifts), len(inds_itinerary)))
for i, shift in enumerate(shifts):
temp = []
for i_time, i_space in zip(inds_dtime_radar, inds_itinerary):
temp.append(rr[i_time+shift].ravel()[i_space])
rain[i, :] = temp
In particular I would like to find a way to combine the time search with the lat-lon search for the closest points.

python interp2d strange checkers

I am interpolating an array (2d_values) of size 105 by 109 with scipy.interpolate.interp2d.
function=interp2d(2d_x_coords,2d_y_cords,2d_values)
interpolated=float(function(2d_x_finer_coords,2d_y_fner_coords))
I am having an issue where interpolated comes out with proper values in most locations but in certain areas of interpolated there are checkerboards and stripes of huge positive and negative numbers when the data is supposed to be between (0 and ~300).
2d_values is a relatively continuous field with a few places with large jumps between adjacent coordinates, and is a map projection of latitude and longitude coordinates so the coordinates are not a regular grid but are distorted as a flat map is.
the picture on the right is 2d_values and the picture on the left is interpolated
this is the code used to perform this
Using griddata instead yielded great results with no issues.

Interpolating to get rid of NANs and contour plot

I have these arrays that I need to interpolate and make the smoothest possible interpolation:
x = time
y = height
z = latitude
print np.shape(x)
print np.shape(y)
print np.shape(z)
Result:
(99, 25)
(99, 25)
(99, 25)
y is altitude and it's not uniform. It has a bunch of nan's and even though they're all the same size (a variable n_alt with the number of altitudes, which is for this example 99).
x is time and it's uniform all the way through (all the values in one column of that array are the same).
z is latitude and it's the actual 'z' and it's an array with the same number of rows as the number of time points and same number of rows as the altitude points.
I want to interpolate in 2D (the data set has series of nans in both x and y directions) to fill the gaps on the data, since several files will cover a certain altitude range and not others
My questions are:
1) is there a good way to fill the gaps the 2 directions while making the grid uniform (the idea is to plot that and also save the interpolated data (x,y and z) into a new file as well)?
2) what's a good way to contour plot the data with the shape I mentioned earlier (tried plt.contour, but it doesn't give a satisfactory result just plotting that straight up)?
Thanks y'all
Edit:
I believe this will illustrate the question better:
X: Time, Y: Altitude, Z: Latitude or Longitude
I essentially want to fill up the white space (I understand the consequences of extrapolations and all, but I just want, at this point, to have an algorithm that works. The blue dots is my grid and the color plot is just a normal plt.contour (no interpolation done). I want to make such that I have blue dots all over the plot area.
Rafael! With respect to your interpolation question, I can explain the math if you want to manually come up with an interpolation function, but there is an existing resource you might want to look into: scipy.interpolate.RegularGridInterpolator
(see https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.interpolate.RegularGridInterpolator.html)
If I have misunderstood your issue, another interpolation method from the class might be appropriate: see, scipy.interpolate
For plotting the 3d surface, https://matplotlib.org/examples/mplot3d/surface3d_demo.html might help guide you! Let me know if this helps! Just comment if you would like me to expand! Hopefully those are the resources you were looking for!

Python- np.mean() giving wrong means?

The issue
So I have 50 netCDF4 data files that contain decades of monthly temperature predictions on a global grid. I'm using np.mean() to make an ensemble average of all 50 data files together while preserving time length & spatial scale, but np.mean() gives me two different answers. The first time I run its block of code, it gives me a number that, when averaged over latitude & longitude & plotted against the individual runs, is slightly lower than what the ensemble mean should be. If I re-run the block, it gives me a different mean which looks correct.
The code
I can't copy every line here since it's long, but here's what I do for each run.
#Historical (1950-2020) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_195001-202012.nc") #Import data file
tash1 = ncin_1.variables['tas'][:] #extract tas (temperature) variable
ncin_1.close() #close to save memory
#Repeat for future (2021-2100) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_202101-210012.nc")
tasr1 = ncin_1.variables['tas'][:]
ncin_1.close()
#Concatenate historical & future files together to make one time series array
tas11 = np.concatenate((tash1,tasr1),axis=0)
#Subtract the 1950-1979 mean to obtain anomalies
tas11 = tas11 - np.mean(tas11[0:359],axis=0,dtype=np.float64)
And I repeat that 49 times more for other datasets. Each tas11, tas12, etc file has the shape (1812, 64, 128) corresponding to time length in months, latitude, and longitude.
To get the ensemble mean, I do the following.
#Move all tas data to one array
alltas = np.zeros((1812,64,128,51)) #years, lat, lon, members (no ensemble mean value yet)
alltas[:,:,:,0] = tas11
(...)
alltas[:,:,:,49] = tas50
#Calculate ensemble mean & fill into 51st slot in axis 3
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
When I check a coordinate & month, the ensemble mean is off from what it should be. Here's what a plot of globally averaged temperatures from 1950-2100 looks like with the first mean (with monhly values averaged into annual values. Black line is ensemble mean & colored lines are individual runs.
Obviously that deviated below the real ensemble mean. Here's what the plot looks like when I run alltas[:,:,:,50]=np.mean(alltas,axis=3,dtype=np.float64) a second time & keep everything else the same.
Much better.
The question
Why does np.mean() calculate the wrong value the first time? I tried specifying the data type as a float when using np.mean() like in this question- Wrong numpy mean value?
But it didn't work. Any way I can fix it so it works correctly the first time? I don't want this problem to occur on a calculation where it's not so easy to notice a math error.
In the line
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
the argument to mean should be alltas[:,:,:,:50]:
alltas[:,:,:,50] = np.mean(alltas[:,:,:,:50], axis=3, dtype=np.float64)
Otherwise you are including those final zeros in the calculation of the ensemble means.

Categories