How to identify time, lon, and lat coordinates in xarray? - python

What is the best way to determine which coordinates of an xarray dataArray object contain longitude, latitude and time?
A typical dataArray might look like this:
<xarray.Dataset>
Dimensions: (ensemble: 9, lat: 224, lon: 464, time: 12054)
Coordinates:
* lat (lat) float64 25.06 25.19 25.31 25.44 ... 52.56 52.69 52.81 52.94
* lon (lon) float64 -124.9 -124.8 -124.7 ... -67.31 -67.19 -67.06
* time (time) datetime64[ns] 1980-01-01 1980-01-02 ... 2012-12-31
Dimensions without coordinates: ensemble
Data variables:
elevation (lat, lon) float64 dask.array<shape=(224, 464), chunksize=(224, 464)>
temp (ensemble, time, lat, lon) float64 dask.array<shape=(9, 12054, 224, 464), chunksize=(1, 287, 224, 464)>
One approach could be to loop through the variables identified by the variable coords, like temp.coords, looking for the standard_name attributes of time, longitude, and latitude. But many datasets don't seem to include standard_name attributes for all variables.
I guess another approach be to search over the units attributes and try to identify if they have appropriate units attributes (e.g. degrees_east or degrees_west for longitude, etc).
Is there a better way?

The MetPy package includes some helpers for systematic coordinate identification like this. You can see the basics of how this works in the xarray with MetPy tutorial. For example, if you want the time coordinate of a DataArray called temp (assuming it came from a dataset that has been parsed by MetPy), you would simply call:
temp.metpy.time
This is done internally by parsing the coordinate metadata according to the CF conventions.
Here's a short example:
import xarray as xr
import metpy.calc as mpcalc
ds = xr.tutorial.load_dataset('air_temperature')
ds = ds.metpy.parse_cf()
x,y,t = ds['air'].metpy.coordinates('x','y','time')
print([coord.name for coord in (x, y, t)])
which produces:
['lon', 'lat', 'time']

You can probably do something similar to the code below with xarray filter_by:
def x_axis(nc):
xnames = ['longitude', 'grid_longitude', 'projection_x_coordinate']
xunits = [
'degrees_east',
'degree_east',
'degree_E',
'degrees_E',
'degreeE',
'degreesE',
]
xvars = list(set(
nc.get_variables_by_attributes(
axis=lambda x: x and str(x).lower() == 'x'
) +
nc.get_variables_by_attributes(
standard_name=lambda x: x and str(x).lower() in xnames
) +
nc.get_variables_by_attributes(
units=lambda x: x and str(x).lower() in xunits
)
))
return xvars

I think we should lean heavily on CF conventions. They exist precisely for this reason. So I would recommend separating this problem into two parts:
Fixing non-CF-complaint datasets (perhaps a small library for this purpose would make sense--it could contain the logic to translate common variable names into appropriate standard_name attributes)
Parsing CF-complaint datasets (can leverage standard_name attributes)

If you are looking for just the special coords that act as indexes, then you can iterate over the ds.indexes and do some string parsing on their names. Something like:
ds = xr.tutorial.load_dataset('air_temperature')
ds.lat.attrs.pop('standard_name')
for k in ds.indexes.keys():
v = ds[k]
sn = v.attrs.get('standard_name')
if not sn:
if 'lon' in k:
v.attrs.update(standard_name='longitude')
continue
if 'lat' in k:
v.attrs.update(standard_name='latitude')
continue
if 'time' in k or k in ['day', 't', 'month', 'year']:
v.attrs.update(standard_name='time')

Related

python mask netcdf file with a shape file and dealing with LAT LONG in degree vs LAT LONG in meters

I am struggling to mask my netcdf dataset. I have managed to do something but not in the proper way.
Basically, I have a shape file and a netcdf dataset.
I read the shapefile as follow:
import geopandas as gpd
shp_noce = gpd.read_file(shapefile_path)
which reads as:
DN geometry
0 1 POLYGON ((660074.143 5155942.267, 660172.884 5...
Then, I read the file as
rain = xarray.open_dataset(ncfile_path)
and here is the results:
<xarray.Dataset>
Dimensions: (DATE: 14245, x: 641, y: 643)
Coordinates:
* DATE (DATE) datetime64[ns] 1980-01-01T12:00:00 ... 2018-1...
* x (x) float64 6.058e+05 6.061e+05 ... 7.656e+05 7.658e+05
* y (y) float64 5.06e+06 5.06e+06 ... 5.22e+06 5.22e+06
Data variables:
transverse_mercator |S1 ...
precipitation (DATE, y, x) float32 ...
Attributes:
CDI: Climate Data Interface version 1.9.9 (https://mpimet.mpg.de...
Conventions: CF-1.5
Title: Daily total precipitation Trentino-South Tyrol 250-meter re...
Created on: Fri Feb 26 21:30:51 2021
history: Fri Feb 26 23:31:30 2021: cdo -z zip -mergetime DAILYPCP_19...
CDO: Climate Data Operators version 1.9.9 (https://mpimet.mpg.de..
I have tried to follow some suggestions coming from other posts. First of all, I have tried this, which is based on rioarray. This reads as:
rain.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
this is the outcomes:
raise MissingSpatialDimensionError(
MissingSpatialDimensionError: x dimension (lon) not found.
As far I have understood, there could be a ploblem linking the shape file and the netcdf dataset due to the projection units.
So, following what reported in here, I have done the following:
shp_noce.to_crs("epsg:3395")
However, I get the same error. I suppose because the field in the netcdf dataset are named x and y.
What are you suggestions? should I rename the fields? should I "set_spatial_dims" as x and y?
If your data and shapefile are in the same CRS (mercator), all you need to do is to tell rioxarray that your spatial dims are x and y.
rain.rio.set_spatial_dims(x_dim="x", y_dim="y", inplace=True)
See the rioxarray API documentation for ds.rio.set_spatial_dims:
set_spatial_dims (x_dim: str, y_dim: str, inplace: bool = True) → Union[xarray.core.dataset.Dataset, xarray.core.dataarray.DataArray]
This sets the spatial dimensions of the dataset.
Parameters
x_dim (str) – The name of the x dimension.
y_dim (str) – The name of the y dimension.
inplace (bool, optional) – If True, it will modify the dataframe in place. Otherwise it will return a modified copy.
Returns
Dataset with spatial dimensions set.
Return type
xarray.Dataset | xarray.DataArray
You told it to look for a dimension named "lon", and it's telling you that lon isn't found in the dataset. That's because the x dimension is named "x" :)

How to create a variable from xarray dataset coordinates?

Based on a xarray dataset containing latitude and longitude coordinates and several variables I would like to create a new variable containing objects based on the latitude and longitude coordinates.
For example, from the following dataset:
<xarray.Dataset>
Dimensions: (time: 100, x: 1000, y: 840)
Coordinates:
* x (x) float64 2.452e+06 2.458e+06 2.462e+06 ... 7.442e+06 7.448e+06
* y (y) float64 1.352e+06 1.358e+06 1.362e+06 ... 5.542e+06 5.548e+06
* time (time) datetime64[ns] 2015-01-01 ... 2015-01-05T03:00:00
... I would like to simply create a point object for each grid cell based on the respective latitude and longitude coordinates.
Pseudocode:
ds['points'] = (('y', 'x'), point_creation_function(ds.y, ds.x))
(How) Can I apply a function that requires the coordinate values as inputs such, that the result can be directly added as new variable?
A horrible implementation after an initialization of ds.points would be:
for x_value in ds.x:
for y_value in ds.y:
ds.points.loc[dict(x=x_value, y=y_value)] = (x_value, y_value)
I assume there is an elegant and computation-efficient solution available, but searching the documentation I did not understand how to use apply, reduce or other functions to achieve it.
If I undestand you question correctly, I think this is the answer:
import numpy as np
import xarray as xr
# Create some example data
data = np.random.rand(10,5,6)
# Make the dataset.
ds = xr.Dataset({"my_var": (["time", "x", "y"], data)})
# Create a MultiIndex
ds = ds.stack(points=("x", "y"))

Select data by latitude and longitude

I am using a dataset from DWD (Deutscher Wetterdienst) and want to select data by latitude and longitude. The import works so far. So no problem there. Now I want to select data by latitude and longitude. It works when I try to select data with sel when I use x and y.
But not with lat and long. I tried all the answer which I could find, like:
ds.sel(latitude=50, longitude=14, method='nearest')
but I am getting the error
ValueError: dimensions or multi-index levels ['latitude', 'longitude'] do not exist
That's my code:
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
import xarray as xr
​
​
ds = xr.open_dataset(
'cosmo-d2_germany_rotated-lat-lon_single-level_2019061721_012_ASWDIFD_S.grib2',
engine='cfgrib',
backend_kwargs={'filter_by_keys': {'stepUnits': 1}}
)
​
print(ds)
Output:
<xarray.Dataset>
Dimensions: (x: 651, y: 716)
Coordinates:
time datetime64[ns] ...
step timedelta64[ns] ...
surface int32 ...
latitude (y, x) float64 ...
longitude (y, x) float64 ...
valid_time datetime64[ns] ...
Dimensions without coordinates: x, y
Data variables:
ASWDIFD_S (y, x) float32 ...
Attributes:
GRIB_edition: 2
GRIB_centre: edzw
GRIB_centreDescription: Offenbach
GRIB_subCentre: 255
Conventions: CF-1.7
institution: Offenbach
history: 2019-07-22T13:35:33 GRIB to CDM+CF via cfgrib-
In your file latitude and longitude are not dimensions but rather helper 2D variables containing coordinate data. In xarray parlance they are called non-dimension coordinates and you cannot slice on them. See also Working with Multidimensional Coordinates.
It would be better if you regrid the data to a regular grid inside python so that you have latitudes and longitudes as 1D vectors, you would have to make a grid and then interpolate the data over that grid.
Also you need to check https://www.ecmwf.int/sites/default/files/elibrary/2018/18727-cfgrib-easy-and-efficient-grib-file-access-xarray.pdf to see the way to access grib files in xarray. If you dont want to use xarray for this purpose pygrib is another option.
I can't test the solution as I don't have the cfgrib engine installed, but could you try to use
numpy.find_nearest(lonarray, lonvalue)
to find the lon and lat indexes near your point as per this soln:
Find nearest value in numpy array
And then select the point using the index directly on the x,y coordinates?
http://xarray.pydata.org/en/stable/indexing.html
i wrote a function for the files from the DWD:
import pygrib # https://jswhit.github.io/pygrib/docs/
import numpy as np
def get_grib_data_nearest_point(grib_file, inp_lat, inp_lon):
"""
Gets the correspondent value to a latitude-longitude pair of coordinates in
a grib file.
:param grib_file: path to the grib file in disk
:param lat: latitude
:param lon: longitude
:return: scalar
"""
# open the grib file, get the coordinates and values
grbs = pygrib.open(grib_file)
grb = grbs[1]
lats, lons = grb.latlons()
values = grb.values
grbs.close()
# check if user coords are valide
if inp_lat > max(grb.distinctLatitudes): return np.nan
if inp_lat < min(grb.distinctLatitudes): return np.nan
if inp_lon > max(grb.distinctLongitudes): return np.nan
if inp_lon < min(grb.distinctLongitudes): return np.nan
# find index for closest lat (x)
diff_save = 999
for x in range(0, len(lats)):
diff = abs(lats[x][0] - inp_lat)
if diff < diff_save:
diff_save = diff
else:
break
# find index for closest lon (y)
diff_save = 999
for y in range(0, len(lons[x])):
diff = abs(lons[x][y] - inp_lon)
if diff < diff_save:
diff_save = diff
else:
break
# index the array to return the correspondent value
return values[x][y]
As noted above, you can re-grid your data (probably given in curvilinear grid i.e., lat and lon in 2D arrays) to your desired resolution of 1-D array (lat/lon) , after which you can use .sel directly on the lat/lon coords to slice the data.
Check out xESMF(https://xesmf.readthedocs.io/en/latest/notebooks/Curvilinear_grid.html).
Easy, fast interpolation and regridding of Xarray fields with good examples and documentation.

python mask netcdf data using shapefile

I am using the following packages:
import pandas as pd
import numpy as np
import xarray as xr
import geopandas as gpd
I have the following objects storing data:
print(precip_da)
Out[]:
<xarray.DataArray 'precip' (time: 13665, latitude: 200, longitude: 220)>
[601260000 values with dtype=float32]
Coordinates:
* longitude (longitude) float32 35.024994 35.074997 35.125 35.175003 ...
* latitude (latitude) float32 5.0249977 5.074997 5.125 5.174999 ...
* time (time) datetime64[ns] 1981-01-01 1981-01-02 1981-01-03 ...
Attributes:
standard_name: convective precipitation rate
long_name: Climate Hazards group InfraRed Precipitation with St...
units: mm/day
time_step: day
geostatial_lat_min: -50.0
geostatial_lat_max: 50.0
geostatial_lon_min: -180.0
geostatial_lon_max: 180.0
This looks as follows:
precip_da.mean(dim="time").plot()
I have my shapefile as a geopandas.GeoDataFrame which represents a polygon.
awash = gpd.read_file(shp_dir)
awash
Out[]:
OID_ Name FolderPath SymbolID AltMode Base Clamped Extruded Snippet PopupInfo Shape_Leng Shape_Area geometry
0 0 Awash_Basin Awash_Basin.kml 0 0 0.0 -1 0 None None 30.180944 9.411263 POLYGON Z ((41.78939511000004 11.5539922500000...
Which looks as follows:
awash.plot()
Plotted one on top of the other they look like this:
ax = awash.plot(alpha=0.2, color='black')
precip_da.mean(dim="time").plot(ax=ax,zorder=-1)
My question is, how do I mask the xarray.DataArray by checking if the lat-lon points lie INSIDE the shapefile stored as a geopandas.GeoDataFrame?
 So I want ONLY the precipitation values (mm/day) which fall INSIDE that shapefile.
I want to do something like the following:
masked_precip = precip_da.within(awash)
OR
masked_precip = precip_da.loc[precip_da.isin(awash)]
EDIT 1
I have thought about using the rasterio.mask module but I don't know what format the input data needs to be. It sounds as if it does exactly the right thing:
"Creates a masked or filled array using input shapes. Pixels are masked or set to nodata outside the input shapes"
Reposted from GIS Stack Exchange here
This is the current working solution that I have taken from this gist. This is Stephan Hoyer's answer to a github issue for the xarray project.
On top of the other packages above both affine and rasterio are required
from rasterio import features
from affine import Affine
def transform_from_latlon(lat, lon):
""" input 1D array of lat / lon and output an Affine transformation
"""
lat = np.asarray(lat)
lon = np.asarray(lon)
trans = Affine.translation(lon[0], lat[0])
scale = Affine.scale(lon[1] - lon[0], lat[1] - lat[0])
return trans * scale
def rasterize(shapes, coords, latitude='latitude', longitude='longitude',
fill=np.nan, **kwargs):
"""Rasterize a list of (geometry, fill_value) tuples onto the given
xray coordinates. This only works for 1d latitude and longitude
arrays.
usage:
-----
1. read shapefile to geopandas.GeoDataFrame
`states = gpd.read_file(shp_dir+shp_file)`
2. encode the different shapefiles that capture those lat-lons as different
numbers i.e. 0.0, 1.0 ... and otherwise np.nan
`shapes = (zip(states.geometry, range(len(states))))`
3. Assign this to a new coord in your original xarray.DataArray
`ds['states'] = rasterize(shapes, ds.coords, longitude='X', latitude='Y')`
arguments:
---------
: **kwargs (dict): passed to `rasterio.rasterize` function
attrs:
-----
:transform (affine.Affine): how to translate from latlon to ...?
:raster (numpy.ndarray): use rasterio.features.rasterize fill the values
outside the .shp file with np.nan
:spatial_coords (dict): dictionary of {"X":xr.DataArray, "Y":xr.DataArray()}
with "X", "Y" as keys, and xr.DataArray as values
returns:
-------
:(xr.DataArray): DataArray with `values` of nan for points outside shapefile
and coords `Y` = latitude, 'X' = longitude.
"""
transform = transform_from_latlon(coords[latitude], coords[longitude])
out_shape = (len(coords[latitude]), len(coords[longitude]))
raster = features.rasterize(shapes, out_shape=out_shape,
fill=fill, transform=transform,
dtype=float, **kwargs)
spatial_coords = {latitude: coords[latitude], longitude: coords[longitude]}
return xr.DataArray(raster, coords=spatial_coords, dims=(latitude, longitude))
def add_shape_coord_from_data_array(xr_da, shp_path, coord_name):
""" Create a new coord for the xr_da indicating whether or not it
is inside the shapefile
Creates a new coord - "coord_name" which will have integer values
used to subset xr_da for plotting / analysis/
Usage:
-----
precip_da = add_shape_coord_from_data_array(precip_da, "awash.shp", "awash")
awash_da = precip_da.where(precip_da.awash==0, other=np.nan)
"""
# 1. read in shapefile
shp_gpd = gpd.read_file(shp_path)
# 2. create a list of tuples (shapely.geometry, id)
# this allows for many different polygons within a .shp file (e.g. States of US)
shapes = [(shape, n) for n, shape in enumerate(shp_gpd.geometry)]
# 3. create a new coord in the xr_da which will be set to the id in `shapes`
xr_da[coord_name] = rasterize(shapes, xr_da.coords,
longitude='longitude', latitude='latitude')
return xr_da
It can be implemented as follows:
precip_da = add_shape_coord_from_data_array(precip_da, shp_dir, "awash")
awash_da = precip_da.where(precip_da.awash==0, other=np.nan)
awash_da.mean(dim="time").plot()
You should have a look at the following packages:
salem and the region of interest example
regionmask
Both may get you to what you want.

Reading netCDF data

I am trying to read data from a nc file, which has the following variables:
['latitude',
'longitude',
'latitude_bnds',
'longitude_bnds',
'time',
'minimum',
'maximum',
'average',
'stddev',
'AirTemperature']
What I am trying to achieve is to extract the AirTemperature data for any given (time, latitude and longitude):
And for that, I am doing something like this:
df = Dataset('data_file.nc', 'r')
lat = df.variables['latitude'][:]
lon = df.variables['longitude'][:]
temp = df.variables['AirTemperature'][:,:,:]
#(lat, lon) for Coffee, TN
test_lat = 35.45
test_lon = -86.05
#getting the indices for the (lat, lon) using numpy.where
lat_idx = np.where(lat==test_lat)[0][0]
lon_idx = np.where(lon==test_lon)[0][0]
#extracting data for all the times for given indices
tmp_crd = temp[:,lat_idx,lon_idx]
Up till this point, it all goes fine. However, when I print the data, I see all the identical values being printed.. (for any lat, lon that I have been testing..)
print tmp_crd.data
>>> [-9999. -9999. -9999. ..., -9999. -9999. -9999.]
Which I don't seem to understand..why the air temperature is always shown as -9999.0? I have tested for a lot of other (lat, lon) points, and it seems for every location point, the air temperature is -9999.0. How can I extract the real data from this file?
Please help :-(.
Thank You
Okay..I think I figured out. Here is what was happening:
The nc file i have has a different precision for latitude and longitudes, and I was apparently passing much more rounded sets of (lat, lon) points. Once I figured out the right precision, it works fine for me. The -9999.0 value was basically the _fill_value for the numpy's masked array (which indicated that if there is no record matching the given set of lat and long, return the masked values).
Thanks every one.

Categories