How to create a variable from xarray dataset coordinates? - python

Based on a xarray dataset containing latitude and longitude coordinates and several variables I would like to create a new variable containing objects based on the latitude and longitude coordinates.
For example, from the following dataset:
<xarray.Dataset>
Dimensions: (time: 100, x: 1000, y: 840)
Coordinates:
* x (x) float64 2.452e+06 2.458e+06 2.462e+06 ... 7.442e+06 7.448e+06
* y (y) float64 1.352e+06 1.358e+06 1.362e+06 ... 5.542e+06 5.548e+06
* time (time) datetime64[ns] 2015-01-01 ... 2015-01-05T03:00:00
... I would like to simply create a point object for each grid cell based on the respective latitude and longitude coordinates.
Pseudocode:
ds['points'] = (('y', 'x'), point_creation_function(ds.y, ds.x))
(How) Can I apply a function that requires the coordinate values as inputs such, that the result can be directly added as new variable?
A horrible implementation after an initialization of ds.points would be:
for x_value in ds.x:
for y_value in ds.y:
ds.points.loc[dict(x=x_value, y=y_value)] = (x_value, y_value)
I assume there is an elegant and computation-efficient solution available, but searching the documentation I did not understand how to use apply, reduce or other functions to achieve it.

If I undestand you question correctly, I think this is the answer:
import numpy as np
import xarray as xr
# Create some example data
data = np.random.rand(10,5,6)
# Make the dataset.
ds = xr.Dataset({"my_var": (["time", "x", "y"], data)})
# Create a MultiIndex
ds = ds.stack(points=("x", "y"))

Related

How to add time dimension and create an xarray dataset/data array from a stack of rasters?

I have approximately 75 2D raster maps (tifs) of elevation over the exact same area, each acquired at a different time. I would like to stack these using xarray. I can read in each raster (see below) but currently, there is no time coords as I need to extract the time from the title of each file (2017-02-15T06:13:38Z in file below).
da = xr.open_rasterio('tifs/DTSLOS_20170122_20190828_D79H_2017-02-15T06:13:38Z.tif')
da
<xarray.DataArray (y: 12284, x: 17633)>
[216603772 values with dtype=float64]
Coordinates:
band int64 1
* y (y) float64 59.62 59.62 59.62 59.62 59.62 ... 49.8 49.8 49.8 49.8
* x (x) float64 -12.17 -12.17 -12.17 -12.17 ... 1.931 1.932 1.932 1.933
Attributes:
transform: (0.0008, 0.0, -12.172852, 0.0, -0.0008, 59.623425)
crs: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,2...
res: (0.0008, 0.0008)
is_tiled: 1
nodatavals: (-9999.0,)
I'm assuming the way I should approach this is to add the time to each data array and then stack/concatenate them but I am new to xarray and am struggling to figure out how to do this.
You would need to convert the respective string into a datetime timestring using datetime.strptime and set it as the dimension time along which you want to combine the datasets. You also need to expand this dimension, so when using xr.combine_by_coords you can combine the dataarrays along that dimension. One way to do this would be
import xarray as xr
from datetime import datetime
import pandas as pd
#collecting datasets when looping over your files
list_da = []
for path in ...:
#path = "tifs/DTSLOS_20170122_20190828_D79H_2017-02-15T06:13:38Z.tif"
da = xr.open_rasterio(path)
time = path.split("_")[-1].split("Z")[0]
dt = datetime.strptime(time,"%Y-%m-%dT%H:%M:%S")
dt = pd.to_datetime(dt)
da = da.assign_coords(time = dt)
da = da.expand_dims(dim="time")
list_da.append(da)
#stack dataarrays in list
ds = xr.combine_by_coords(list_da)
That's the way I approached this for my data. Not sure whether its the most elegant solution, but it worked for me

Expanding coordinates of a variable using other two variables of xarray in Python

Let's say I have data of 16 grid cells (4 * 4) which has a corresponding index (0~15) as dimension & coordinates and variables (a, longitude and latitude) for each cell. Here is the code to create this data.
import xarray as xr
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(16,3), \
columns=['a','longitude', 'latitude'], \
index=range(16))
ds = df.to_xarray()
ds
What I want to do is:
Expand the coordination of data a from (index) to (longitude, latitude) using longitude and latitude variables of each cell.
So, the resulting DataSet will include longitude and latitude as its dimension and coordinates as well as variable a of (longitude, latitude)
How can I do this within xarray functionality?
Thanks!
Here is a way to solve that:
# convert the index as a MultiIndex containing longitude and latitude
dat_2d = ds.set_index({'index': ['longitude', 'latitude']})
# unstack the MultiIndex
unstacked = dat_2d.unstack('index')
# plot
unstacked['a'].plot()

Select data by latitude and longitude

I am using a dataset from DWD (Deutscher Wetterdienst) and want to select data by latitude and longitude. The import works so far. So no problem there. Now I want to select data by latitude and longitude. It works when I try to select data with sel when I use x and y.
But not with lat and long. I tried all the answer which I could find, like:
ds.sel(latitude=50, longitude=14, method='nearest')
but I am getting the error
ValueError: dimensions or multi-index levels ['latitude', 'longitude'] do not exist
That's my code:
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
import xarray as xr
​
​
ds = xr.open_dataset(
'cosmo-d2_germany_rotated-lat-lon_single-level_2019061721_012_ASWDIFD_S.grib2',
engine='cfgrib',
backend_kwargs={'filter_by_keys': {'stepUnits': 1}}
)
​
print(ds)
Output:
<xarray.Dataset>
Dimensions: (x: 651, y: 716)
Coordinates:
time datetime64[ns] ...
step timedelta64[ns] ...
surface int32 ...
latitude (y, x) float64 ...
longitude (y, x) float64 ...
valid_time datetime64[ns] ...
Dimensions without coordinates: x, y
Data variables:
ASWDIFD_S (y, x) float32 ...
Attributes:
GRIB_edition: 2
GRIB_centre: edzw
GRIB_centreDescription: Offenbach
GRIB_subCentre: 255
Conventions: CF-1.7
institution: Offenbach
history: 2019-07-22T13:35:33 GRIB to CDM+CF via cfgrib-
In your file latitude and longitude are not dimensions but rather helper 2D variables containing coordinate data. In xarray parlance they are called non-dimension coordinates and you cannot slice on them. See also Working with Multidimensional Coordinates.
It would be better if you regrid the data to a regular grid inside python so that you have latitudes and longitudes as 1D vectors, you would have to make a grid and then interpolate the data over that grid.
Also you need to check https://www.ecmwf.int/sites/default/files/elibrary/2018/18727-cfgrib-easy-and-efficient-grib-file-access-xarray.pdf to see the way to access grib files in xarray. If you dont want to use xarray for this purpose pygrib is another option.
I can't test the solution as I don't have the cfgrib engine installed, but could you try to use
numpy.find_nearest(lonarray, lonvalue)
to find the lon and lat indexes near your point as per this soln:
Find nearest value in numpy array
And then select the point using the index directly on the x,y coordinates?
http://xarray.pydata.org/en/stable/indexing.html
i wrote a function for the files from the DWD:
import pygrib # https://jswhit.github.io/pygrib/docs/
import numpy as np
def get_grib_data_nearest_point(grib_file, inp_lat, inp_lon):
"""
Gets the correspondent value to a latitude-longitude pair of coordinates in
a grib file.
:param grib_file: path to the grib file in disk
:param lat: latitude
:param lon: longitude
:return: scalar
"""
# open the grib file, get the coordinates and values
grbs = pygrib.open(grib_file)
grb = grbs[1]
lats, lons = grb.latlons()
values = grb.values
grbs.close()
# check if user coords are valide
if inp_lat > max(grb.distinctLatitudes): return np.nan
if inp_lat < min(grb.distinctLatitudes): return np.nan
if inp_lon > max(grb.distinctLongitudes): return np.nan
if inp_lon < min(grb.distinctLongitudes): return np.nan
# find index for closest lat (x)
diff_save = 999
for x in range(0, len(lats)):
diff = abs(lats[x][0] - inp_lat)
if diff < diff_save:
diff_save = diff
else:
break
# find index for closest lon (y)
diff_save = 999
for y in range(0, len(lons[x])):
diff = abs(lons[x][y] - inp_lon)
if diff < diff_save:
diff_save = diff
else:
break
# index the array to return the correspondent value
return values[x][y]
As noted above, you can re-grid your data (probably given in curvilinear grid i.e., lat and lon in 2D arrays) to your desired resolution of 1-D array (lat/lon) , after which you can use .sel directly on the lat/lon coords to slice the data.
Check out xESMF(https://xesmf.readthedocs.io/en/latest/notebooks/Curvilinear_grid.html).
Easy, fast interpolation and regridding of Xarray fields with good examples and documentation.

How to identify time, lon, and lat coordinates in xarray?

What is the best way to determine which coordinates of an xarray dataArray object contain longitude, latitude and time?
A typical dataArray might look like this:
<xarray.Dataset>
Dimensions: (ensemble: 9, lat: 224, lon: 464, time: 12054)
Coordinates:
* lat (lat) float64 25.06 25.19 25.31 25.44 ... 52.56 52.69 52.81 52.94
* lon (lon) float64 -124.9 -124.8 -124.7 ... -67.31 -67.19 -67.06
* time (time) datetime64[ns] 1980-01-01 1980-01-02 ... 2012-12-31
Dimensions without coordinates: ensemble
Data variables:
elevation (lat, lon) float64 dask.array<shape=(224, 464), chunksize=(224, 464)>
temp (ensemble, time, lat, lon) float64 dask.array<shape=(9, 12054, 224, 464), chunksize=(1, 287, 224, 464)>
One approach could be to loop through the variables identified by the variable coords, like temp.coords, looking for the standard_name attributes of time, longitude, and latitude. But many datasets don't seem to include standard_name attributes for all variables.
I guess another approach be to search over the units attributes and try to identify if they have appropriate units attributes (e.g. degrees_east or degrees_west for longitude, etc).
Is there a better way?
The MetPy package includes some helpers for systematic coordinate identification like this. You can see the basics of how this works in the xarray with MetPy tutorial. For example, if you want the time coordinate of a DataArray called temp (assuming it came from a dataset that has been parsed by MetPy), you would simply call:
temp.metpy.time
This is done internally by parsing the coordinate metadata according to the CF conventions.
Here's a short example:
import xarray as xr
import metpy.calc as mpcalc
ds = xr.tutorial.load_dataset('air_temperature')
ds = ds.metpy.parse_cf()
x,y,t = ds['air'].metpy.coordinates('x','y','time')
print([coord.name for coord in (x, y, t)])
which produces:
['lon', 'lat', 'time']
You can probably do something similar to the code below with xarray filter_by:
def x_axis(nc):
xnames = ['longitude', 'grid_longitude', 'projection_x_coordinate']
xunits = [
'degrees_east',
'degree_east',
'degree_E',
'degrees_E',
'degreeE',
'degreesE',
]
xvars = list(set(
nc.get_variables_by_attributes(
axis=lambda x: x and str(x).lower() == 'x'
) +
nc.get_variables_by_attributes(
standard_name=lambda x: x and str(x).lower() in xnames
) +
nc.get_variables_by_attributes(
units=lambda x: x and str(x).lower() in xunits
)
))
return xvars
I think we should lean heavily on CF conventions. They exist precisely for this reason. So I would recommend separating this problem into two parts:
Fixing non-CF-complaint datasets (perhaps a small library for this purpose would make sense--it could contain the logic to translate common variable names into appropriate standard_name attributes)
Parsing CF-complaint datasets (can leverage standard_name attributes)
If you are looking for just the special coords that act as indexes, then you can iterate over the ds.indexes and do some string parsing on their names. Something like:
ds = xr.tutorial.load_dataset('air_temperature')
ds.lat.attrs.pop('standard_name')
for k in ds.indexes.keys():
v = ds[k]
sn = v.attrs.get('standard_name')
if not sn:
if 'lon' in k:
v.attrs.update(standard_name='longitude')
continue
if 'lat' in k:
v.attrs.update(standard_name='latitude')
continue
if 'time' in k or k in ['day', 't', 'month', 'year']:
v.attrs.update(standard_name='time')

python mask netcdf data using shapefile

I am using the following packages:
import pandas as pd
import numpy as np
import xarray as xr
import geopandas as gpd
I have the following objects storing data:
print(precip_da)
Out[]:
<xarray.DataArray 'precip' (time: 13665, latitude: 200, longitude: 220)>
[601260000 values with dtype=float32]
Coordinates:
* longitude (longitude) float32 35.024994 35.074997 35.125 35.175003 ...
* latitude (latitude) float32 5.0249977 5.074997 5.125 5.174999 ...
* time (time) datetime64[ns] 1981-01-01 1981-01-02 1981-01-03 ...
Attributes:
standard_name: convective precipitation rate
long_name: Climate Hazards group InfraRed Precipitation with St...
units: mm/day
time_step: day
geostatial_lat_min: -50.0
geostatial_lat_max: 50.0
geostatial_lon_min: -180.0
geostatial_lon_max: 180.0
This looks as follows:
precip_da.mean(dim="time").plot()
I have my shapefile as a geopandas.GeoDataFrame which represents a polygon.
awash = gpd.read_file(shp_dir)
awash
Out[]:
OID_ Name FolderPath SymbolID AltMode Base Clamped Extruded Snippet PopupInfo Shape_Leng Shape_Area geometry
0 0 Awash_Basin Awash_Basin.kml 0 0 0.0 -1 0 None None 30.180944 9.411263 POLYGON Z ((41.78939511000004 11.5539922500000...
Which looks as follows:
awash.plot()
Plotted one on top of the other they look like this:
ax = awash.plot(alpha=0.2, color='black')
precip_da.mean(dim="time").plot(ax=ax,zorder=-1)
My question is, how do I mask the xarray.DataArray by checking if the lat-lon points lie INSIDE the shapefile stored as a geopandas.GeoDataFrame?
 So I want ONLY the precipitation values (mm/day) which fall INSIDE that shapefile.
I want to do something like the following:
masked_precip = precip_da.within(awash)
OR
masked_precip = precip_da.loc[precip_da.isin(awash)]
EDIT 1
I have thought about using the rasterio.mask module but I don't know what format the input data needs to be. It sounds as if it does exactly the right thing:
"Creates a masked or filled array using input shapes. Pixels are masked or set to nodata outside the input shapes"
Reposted from GIS Stack Exchange here
This is the current working solution that I have taken from this gist. This is Stephan Hoyer's answer to a github issue for the xarray project.
On top of the other packages above both affine and rasterio are required
from rasterio import features
from affine import Affine
def transform_from_latlon(lat, lon):
""" input 1D array of lat / lon and output an Affine transformation
"""
lat = np.asarray(lat)
lon = np.asarray(lon)
trans = Affine.translation(lon[0], lat[0])
scale = Affine.scale(lon[1] - lon[0], lat[1] - lat[0])
return trans * scale
def rasterize(shapes, coords, latitude='latitude', longitude='longitude',
fill=np.nan, **kwargs):
"""Rasterize a list of (geometry, fill_value) tuples onto the given
xray coordinates. This only works for 1d latitude and longitude
arrays.
usage:
-----
1. read shapefile to geopandas.GeoDataFrame
`states = gpd.read_file(shp_dir+shp_file)`
2. encode the different shapefiles that capture those lat-lons as different
numbers i.e. 0.0, 1.0 ... and otherwise np.nan
`shapes = (zip(states.geometry, range(len(states))))`
3. Assign this to a new coord in your original xarray.DataArray
`ds['states'] = rasterize(shapes, ds.coords, longitude='X', latitude='Y')`
arguments:
---------
: **kwargs (dict): passed to `rasterio.rasterize` function
attrs:
-----
:transform (affine.Affine): how to translate from latlon to ...?
:raster (numpy.ndarray): use rasterio.features.rasterize fill the values
outside the .shp file with np.nan
:spatial_coords (dict): dictionary of {"X":xr.DataArray, "Y":xr.DataArray()}
with "X", "Y" as keys, and xr.DataArray as values
returns:
-------
:(xr.DataArray): DataArray with `values` of nan for points outside shapefile
and coords `Y` = latitude, 'X' = longitude.
"""
transform = transform_from_latlon(coords[latitude], coords[longitude])
out_shape = (len(coords[latitude]), len(coords[longitude]))
raster = features.rasterize(shapes, out_shape=out_shape,
fill=fill, transform=transform,
dtype=float, **kwargs)
spatial_coords = {latitude: coords[latitude], longitude: coords[longitude]}
return xr.DataArray(raster, coords=spatial_coords, dims=(latitude, longitude))
def add_shape_coord_from_data_array(xr_da, shp_path, coord_name):
""" Create a new coord for the xr_da indicating whether or not it
is inside the shapefile
Creates a new coord - "coord_name" which will have integer values
used to subset xr_da for plotting / analysis/
Usage:
-----
precip_da = add_shape_coord_from_data_array(precip_da, "awash.shp", "awash")
awash_da = precip_da.where(precip_da.awash==0, other=np.nan)
"""
# 1. read in shapefile
shp_gpd = gpd.read_file(shp_path)
# 2. create a list of tuples (shapely.geometry, id)
# this allows for many different polygons within a .shp file (e.g. States of US)
shapes = [(shape, n) for n, shape in enumerate(shp_gpd.geometry)]
# 3. create a new coord in the xr_da which will be set to the id in `shapes`
xr_da[coord_name] = rasterize(shapes, xr_da.coords,
longitude='longitude', latitude='latitude')
return xr_da
It can be implemented as follows:
precip_da = add_shape_coord_from_data_array(precip_da, shp_dir, "awash")
awash_da = precip_da.where(precip_da.awash==0, other=np.nan)
awash_da.mean(dim="time").plot()
You should have a look at the following packages:
salem and the region of interest example
regionmask
Both may get you to what you want.

Categories