Read value over time at certain coordinate from netCDF with python - python

I have a netCDF file with a grid (each step 0.25°).
What I want is the value of the variable, lets say tempMax, at a certain gridpoint, over the last 50 years.
I am aware that you read the data into python like this
lon = numpy.array(file.variables['longitude'][:])
lat = numpy.array(file.variables['latitude'][:])
temp = numpy.array(file.variables['tempMax'][:])
time = numpy.array(file.variables['time'][:])
That leaves me with an array and I do not know how to "untangle" it.
How to get the value at a certain coordinate (stored in temp) over the whole time (stored in time)?
S display is the value over the time at the certain coordinate.
Any ideas how I could achieve that?
Thanks!

I'm guessing that tempMax is 3D (time x lat x lon) and should then be read in as
temp = ncfile.variables['tempMAx'][:,:,:]
(Note two things: (1) if you're using Python v2, it's best to avoid the word file and instead use something like ncfile as shown above, (2) temp will be automatically stored as a numpy.ndarray simply with the call above, you don't need to use the numpy.array() command during the read in of variables.)
Now you can extract temperatures for all times at a certain location with
temp_crd = temp[:,lat_idx,lon_idx]
where lat_idx and lon_idx are integers corresponding to the index of the latitude and longitude coordinates. If you know these indices beforehand, great, just plug them in, e.g. temp_crd = temp[:,25,30]. (You can use the tool ncdump to view the contents of a netCDF file, https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/ncdump.html)
The more likely case is that you know the coordinates, but not their indices beforehand. Let's say you want temperatures at 50N and 270E. You can use the numpy.where function to extract the indices of the coordinates given the lat and lon arrays that you've already read in.
lat_idx = numpy.where(lat==50)[0][0]
lon_idx = numpy.where(lon==270)[0][0]
tmp_crd = temp[:,lat_idx,lon_idx]

Related

How to select data from netcdf file by specific variable value?

I am searching for an option to select data from a NetCDF file at a specific variable value. The dataset contains time, lat, and lon coordinates and a range of variables. One of these variables is a mask with specific values for Land/ open ocean/ sea-ice /lake. Since the open ocean is represented by ds.mask = 1, I want to extract only sea surface temperature values which are located at the coordinates (time and space) where mask = 1. However, I do not want the sea surface temperature values at other coordinates to be set to NaN, but to keep only those coordinates and variable's values where ds.mask = 1. I know how to select and data with xarray.sel/isel, however, this works only with selecting by coordinates, not by variable values as I am trying it. Any help would be very much appreciated.
lati = stormtrack_lat.values
loni = stormtrack_lon.values
timei = stormtrack_datetime.values
tmax = timei.max() + np.timedelta64(10,'D')
tmin = timei.min() - np.timedelta64(10,'D')
SSTskin_subfile = SSTskin_file.sel(time=slice(tmin, tmax))
#HERE I NEED HELP:
#extract data where mask = ocean (1) and use only these data points and keep these only!
SSTskin_subfile_masked = SSTskin_subfile.sel(SSTskin_subfile.mask == 1) #does not work yet (Thrown error: ValueError: the first argument to .isel must be a dictionary)
This is the NetCDF file's structure:
You can apply the ocean mask with .where :
SSTskin_subfile_masked = SSTskin_subfile.where(SSTskin_subfile.mask)
It is not possible to drop all masked points because the data are gridded. For example if you have just one defined value for a given latitude, you have to keep all the values along it. However you can drop the coordinates where all values are NaN with:
SSTskin_subfile_masked.dropna(dim = ['lat', 'lon'], how = 'all')

matrix vs. list - switching from Matlab to Python

Coming from a Matlab background, where everything is a matrix/vector, it was very easy to loop through a given data set and build a matrix successively. Since the final object was a matrix, it was also very easy to extract specific elements of the matrix. I'm finding it rather problematic in Python. I've reproduced the code here to explain where I am getting stuck.
The original data just a time series with a month and a price. The goal is to simulate select subsets of these prices. The loop starts by collecting all months into one set, and then drops one month in each successive loop. For 12 months, I will have (n^2 - n)/2 + n, 78 columns in total in this example. To be clear, the n is the total number of time periods; 12 in this data set. The rows of the matrix will be the Z scores sampled from the standard normal variable - the goal is to simulate all 78 prices in one go in a matrix. The # of z scores is determined by the variable num_terminal_values, currently set to 5 for just keeping things simple/easy to visualize at this point.
Here's a link to a google sheet with the original matrix google sheet with corr mat . The code below may not work from the google sheet; the sheet is intended to show what the original data is. My steps (and Python code) are as follows:
#1 read the data
dfCrv = pd.read_excel(xl, sheet_name = 'upload', usecols = range(0,2)).dropna(axis=0)
#2 create the looper variables and then loop through the data to build a matrix. The rows in the matrix are Z values sampled from the standard normal (this is the variable num_terminal_values). The columns refers to each individual simulation month.
import datetime as dt
lst_zUCorr = []
num_terminal_values = 5
as_of = dt.datetime(2020, 12, 1)
max_months = dfCrv.shape[0]
sim_months = pd.date_range(dfCrv['term'].iloc[0], dfCrv['term'].iloc[-1], freq='MS')
end_month = dfCrv['term'].iloc[-1]
dfCrv = dfCrv.set_index('term',drop=False)
for runNum in range(max_months):
sim_month = dfCrv['term'].iloc[runNum]
ttm = ((sim_month - as_of).days)/365
num_months = (end_month.year - sim_month.year) * 12 + (end_month.month - sim_month.month) + 1
zUCorr = npr.standard_normal(size=(num_terminal_values, num_months))
lst_zUCorr.append(zUCorr)
investigate the objects
lst_zUCorr
z = np.hstack(lst_zUCorr)
z
So far, everything works fine. However, I don't know how to transform the object lst_zUCorr to a simple matrix. I've tried hstack etc.; but this still doesn't look like a matrix.
The next set of operations require data in simple matrix form; but what I'm getting here isn't a matrix. Here's a visual:
Key point/question - the final 5x78 matrix in Matlab can be used to do more operations. Is there a way to convert the equivalent Python object into a 5x78 matrix, or will I now need to do more coding to access specific subsets of the Python objects?

How to iterate through multiple data frames in a dictionary

I've created a dict that appears to consist of multiple data frames parsed by location. However, when I try to iterate through the dict to run correlations by location, it appears as if it's running a correlation on the entire set.
I have split the data frame by locations (Store_ID), and the loop will print each Store_ID but because the correlations are exactly the same in each iteration, I suspect it's just using the entire dataset and not iterating through data frame in the dict.
I started with:
stores = df.Store_ID.unique()
storedict = {elem : pd.DataFrame() for elem in stores}
for key in storedict.keys():
storedict[key] = df[:][df.Store_ID == key]
np.array(storedict) prints the array, grouped by each location.
But this loop (below), though it iterates through stores when it prints, seems to return the same correlation coefficients as though it's just repeating the Pearson correlation on the entire set of locations (stores).
What I'm trying to do is have it show, e.g. the Store ID and the correlation matrix for the data associated with that Store ID, then the next Store ID and its correlation matrix, and so on...
I must be missing something idiotically obvious here. What is it?
EDIT:
So when I run:
for store in stores:
print("\r")
print(store)
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
I get the same list of correlations. I wonder if it's because data is defined globally as:
data = df.drop(['datestring'], axis=1)
data.index = df.datestring
values = data.values
I think data.corr is ignoring the tuple and looking at the original dataframe. How do I define correlations so that it runs iteratively for "data" from each store, not all stores? Again here what I wanted to do was iteratively split the one data frame into many and run correlation on each store as a separate data frame (or however else is easiest to get it to work without multiplying the volume of code to tackle something that could be looped.

Data extraction from wef output file

I have a wrf output netcdf file.File have variables temp abd prec.Dimensions keys are time, south-north and west-east. So how I select different lat long value in region. The problem is south-north and west-east are not variable. I have to find index value of four lat long value
1) Change your Registry files (I think it is Registry.EM_COMMON) so that you print latitude and longitude in your wrfout_d01_time.nc files.
2) Go to your WRFV3 map.
3) Clean, configure and recompile.
4) Run your model again the way you are used to.

Pandas - How do I look for a set of values in a column and if it is present return a value in another column

I am new to pandas. I have a csv file which has a latitude and longitude columns and also a tile ID column, the file has around 1 million rows. I have a list of around a hundred tile ID's and want to get the latitude and longitude coordinates for these tile ID's. Currently I have:
good_tiles_str = [str(q) for q in good_tiles]#setting list elements to string data type
file['tile'] = file.tile.astype(str)#setting title column to string data type
for i in range (len(good_tiles_str)):
x = good_tiles_str[i]
lat = file.loc[file['tile'].str.contains(x), 'BL_Latitude'] #finding lat coordinates
long = file.loc[file['tile'].str.contains(x), 'BL_Longitude'] #finding long coordinates
print(lat)
print(long)
This method is very slow and I know it is not the correct way as I heard you should not use for loops like this whilst using pandas. Also, it does not work as it doesn't find all the latitude and longitude points for the tile ID's
Any help would be very gladly appreciated
There is no need to iterate rows explicitly , I think as far as I understood your question.
If you wish a particular assignment given a condition, you can do so explicitly. Here's one way using numpy.where; we use ~ to indicate "negative".
rule1= file['tile'].str.contains(x)
rule2= file['tile'].str.contains(x)
file['flag'] = np.where(rule1 , 'BL_Latitude', " " )
file['flag'] = np.where(rule2 & ~rule1, 'BL_Longitude', file['flag'])
Try this:
search_for = '|'.join(good_tiles_str)
good = file[file.tile.str.contains(search_for)]
good = good[['BL_Latitude', 'BL_Longitude']].drop_duplicates()

Categories