xarray.Dataset conditionally indexing variables - python

Starting with a hrrr file downloaded from ncep.
read into a xarray.Dataset like...
ds: xr.Dataset = xr.open_dataset(file, engine="pynio")
Dataset
<xarray.Dataset>
Dimensions: (ygrid_0: 1059, xgrid_0: 1799, lv_HYBL0: 50,
lv_HTGL1: 2, lv_HTGL2: 2, lv_TMPL3: 2,
lv_SPDL4: 3, lv_HTGL5: 2, lv_HTGL6: 2,
lv_DBLL7: 2, lv_HTGL8: 2, lv_HTGL9: 3)
Coordinates:
* lv_HTGL6 (lv_HTGL6) float32 1e+03 4e+03
* lv_TMPL3 (lv_TMPL3) float32 253.0 263.0
* lv_HTGL1 (lv_HTGL1) float32 10.0 80.0
* lv_HYBL0 (lv_HYBL0) float32 1.0 2.0 3.0 ... 49.0 50.0
gridlat_0 (ygrid_0, xgrid_0) float32 ...
gridlon_0 (ygrid_0, xgrid_0) float32 ...
Dimensions without coordinates: ygrid_0, xgrid_0, lv_HTGL2, lv_SPDL4, lv_HTGL5,
lv_DBLL7, lv_HTGL8, lv_HTGL9
Data variables: (12/149)
TMP_P0_L1_GLC0 (ygrid_0, xgrid_0) float32 ...
TMP_P0_L103_GLC0 (ygrid_0, xgrid_0) float32 ...
TMP_P0_L105_GLC0 (lv_HYBL0, ygrid_0, xgrid_0) float32 ...
POT_P0_L103_GLC0 (ygrid_0, xgrid_0) float32 ...
DPT_P0_L103_GLC0 (ygrid_0, xgrid_0) float32 ...
LHTFL_P0_L1_GLC0 (ygrid_0, xgrid_0) float32 ...
... ...
lv_HTGL5_l0 (lv_HTGL5) float32 ...
lv_SPDL4_l1 (lv_SPDL4) float32 ...
lv_SPDL4_l0 (lv_SPDL4) float32 ...
lv_HTGL2_l1 (lv_HTGL2) float32 ...
lv_HTGL2_l0 (lv_HTGL2) float32 ...
gridrot_0 (ygrid_0, xgrid_0) float32 ...
for the time being I am only concerned with Variables that contain these 3 common Coordinates [lv_HYBL0, gridlat_0, gridlon_0]
I can manually select/index those Variables that have the Coordinates that I want, like....
ds[["TMP_P0_L105_GLC0",...]]
but I would prefer a more abstract method. In pandas I would do some sort of bool indexing along the lines of ... ds[ds.variables[ds.coords.isin(["gridlat_0","gridlon_0","lv_HYBL0"])]]
this unfortunately does not work.
How can I select Variables based on a condition where the Variable is tied to a Coordinate?

You can still do something similar. You can filter a dataset’s variables using a list of keys, and determine the dimensions by testing the elements of each array’s dims attribute, which is a tuple.
In this case:
required_dims = ['lv_HYBL0', 'gridlat_0', 'gridlon_0']
#sorted tuple
required_dims = tuple(sorted(required_dims))
subset = ds[[
k for k, v in ds.data_vars.items()
if tuple(sorted(v.dims)) == required_dims
]]

I found that the drop_dims method worked sufficiently
def dont_drop(dims: Mapping, *args: str):
a = np.array(tuple(dims.keys()))
mask = np.all(a == np.array(args)[:, np.newaxis], axis=0)
return a[~mask]
ds.drop_dims(dont_drop(ds.dims, "lv_HYBL0", "ygrid_0", "xgrid_0"))

Related

OSError when extracting values from xarray DataArray

I have a dataset containing windspeeds at multiple pressure levels for 3 consecutive months:
import xarray as xr
da = xr.open_dataset('autumn_data.grib', engine='cfgrib')
In[1]: da
Out[1]:
<xarray.Dataset>
Dimensions: (time: 2208, isobaricInhPa: 11, latitude: 161, longitude: 401)
Coordinates:
number int32 ...
* time (time) datetime64[ns] 2020-08-01 ... 2020-10-31T23:00:00
step timedelta64[ns] ...
* isobaricInhPa (isobaricInhPa) float64 1e+03 950.0 900.0 ... 550.0 500.0
* latitude (latitude) float64 70.0 69.75 69.5 69.25 ... 30.5 30.25 30.0
* longitude (longitude) float64 -90.0 -89.75 -89.5 ... 9.5 9.75 10.0
valid_time (time) datetime64[ns] ...
Data variables:
u (time, isobaricInhPa, latitude, longitude) float32 ...
v (time, isobaricInhPa, latitude, longitude) float32 ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2022-06-19T13:42 GRIB to CDM+CF via cfgrib-0.9.1...
I loop over this dataset to make a numpy array with windspeeds in both directions for 40 consecutive hours.
import numpy as np
RUNTIME = 40
TIMES = da.time.values
for t in range(len(TIMES)-RUNTIME+1)):
WIND = np.stack((
da['u'].sel(time = xr.DataArray(TIMES[t:t+RUNTIME])).values,
da['v'].sel(time = xr.DataArray(TIMES[t:t+RUNTIME])).values
))
This worked fine, until some point where I got an error.
In[2]: da['u'].sel(time = xr.DataArray(TIMES[716:716+RUNTIME])).values
Traceback (most recent call last):
Input In [2] in <cell line: 1>
da['u'].sel(time = xr.DataArray(TIMES[716:716+RUNTIME])).values
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\dataarray.py:646 in values
return self.variable.values
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\variable.py:519 in values
return _as_array_or_item(self._data)
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\variable.py:259 in _as_array_or_item
data = np.asarray(data)
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:551 in __array__
self._ensure_cached()
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:548 in _ensure_cached
self.array = NumpyIndexingAdapter(np.asarray(self.array))
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:521 in __array__
return np.asarray(self.array, dtype=dtype)
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:422 in __array__
return np.asarray(array[self.key], dtype=None)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\xarray_plugin.py:144 in __getitem__
return xr.core.indexing.explicit_indexing_adapter(
File ~\anaconda3\envs\thesis\lib\site-packages\xarray\core\indexing.py:711 in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\xarray_plugin.py:150 in _getitem
return self.array[key]
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\dataset.py:342 in __getitem__
message = self.index.get_field(message_ids[0]) # type: ignore
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:472 in get_field
return ComputedKeysAdapter(self.fieldset[message_id], self.computed_keys)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:332 in __getitem__
return self.message_from_file(file, offset=item)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:328 in message_from_file
return Message.from_file(file, offset, **kwargs)
File ~\anaconda3\envs\thesis\lib\site-packages\cfgrib\messages.py:91 in from_file
file.seek(offset)
OSError: [Errno 22] Invalid argument
This seems very strange, because I didn't get this error in the previous loops and because I can create the corresponding DataArray.
In[3]: da['u'].sel(time = xr.DataArray(TIMES[716:716+RUNTIME]))
Out[3]:
<xarray.DataArray 'u' (dim_0: 40, isobaricInhPa: 11, latitude: 161, longitude: 401)>
[28406840 values with dtype=float32]
Coordinates:
number int32 0
time (dim_0) datetime64[ns] 2020-08-30T20:00:00 ... 2020-09-01T...
step timedelta64[ns] 00:00:00
* isobaricInhPa (isobaricInhPa) float64 1e+03 950.0 900.0 ... 550.0 500.0
* latitude (latitude) float64 70.0 69.75 69.5 69.25 ... 30.5 30.25 30.0
* longitude (longitude) float64 -90.0 -89.75 -89.5 ... 9.5 9.75 10.0
valid_time (dim_0) datetime64[ns] 2020-08-30T20:00:00 ... 2020-09-01T...
Dimensions without coordinates: dim_0
Attributes:
GRIB_paramId: 131
GRIB_dataType: an
GRIB_numberOfPoints: 64561
GRIB_typeOfLevel: isobaricInhPa
GRIB_stepUnits: 1
GRIB_stepType: instant
GRIB_gridType: regular_ll
GRIB_NV: 0
GRIB_Nx: 401
GRIB_Ny: 161
GRIB_cfName: eastward_wind
GRIB_cfVarName: u
GRIB_gridDefinitionDescription: Latitude/Longitude Grid
GRIB_iDirectionIncrementInDegrees: 0.25
GRIB_iScansNegatively: 0
GRIB_jDirectionIncrementInDegrees: 0.25
GRIB_jPointsAreConsecutive: 0
GRIB_jScansPositively: 0
GRIB_latitudeOfFirstGridPointInDegrees: 70.0
GRIB_latitudeOfLastGridPointInDegrees: 30.0
GRIB_longitudeOfFirstGridPointInDegrees: -90.0
GRIB_longitudeOfLastGridPointInDegrees: 10.0
GRIB_missingValue: 9999
GRIB_name: U component of wind
GRIB_shortName: u
GRIB_totalNumber: 0
GRIB_units: m s**-1
long_name: U component of wind
units: m s**-1
standard_name: eastward_wind
Why does this occur? And how can I fix this?

Is there another alternative to DataArray.sel(time = slice(x, y)) in xarray?

For some reason, the DataArray.sel(time = slice(x, y)) is working for me without any problem for the months of January to June, where x and y are both equal to values ranging from 1 for January to 6 for June. However, this method is not working for July to December. I have checked the input data, which is a netCDF4 file and it is not corrupted. Therefore, I am looking for an alternative to use instead of DataArray.sel(time = slice(x, y)) in xarray to extract the data for the months of July to December.
The code is as follows:
import xarray as xr
td = xr.open_dataset(r'C:\Users\abc\Desktop\misc\netcdf_to_geotiff\ECLIPSEv5_monthly_patterns.nc')
td_agr = td.agr
td_agrtime = td_agr.sel(time = slice('1', '1'))
which gives the output:
In [7]: td_agrtime
Out[7]:
<xarray.DataArray 'agr' (time: 1, lat: 360, lon: 720)>
[259200 values with dtype=float64]
Coordinates:
* lat (lat) float64 -89.75 -89.25 -88.75 -88.25 ... 88.75 89.25
89.75
* lon (lon) float64 -179.8 -179.2 -178.8 -178.2 ... 178.8 179.2
179.8
* time (time) int32 1
Attributes:
long_name: Monthly weights - Agriculture (animals, rice, soil)
sector: Agriculture (animals, rice, soil)
If the 1 is changed to 7 in the code as follows:
td_agrtime = td_agr.sel(time = slice('7', '7')
the output is:
In [7]: td_agrtime
Out[9]:
<xarray.DataArray 'agr' (time: 6, lat: 360, lon: 720)>
[1555200 values with dtype=float64]
Coordinates:
* lat (lat) float64 -89.75 -89.25 -88.75 -88.25 ... 88.75 89.25
89.75
* lon (lon) float64 -179.8 -179.2 -178.8 -178.2 ... 178.8 179.2
179.8
* time (time) int32 7 8 9 10 11 12
Attributes:
long_name: Monthly weights - Agriculture (animals, rice, soil)
sector: Agriculture (animals, rice, soil)
Thanks to Robert Davy for his comment. The answer is to use .isel(), instead of .sel().

Python Change Dimension and Coordinates Xarray Dataset

I have a xarray Dataset that looks like this below. I need to be able to plot by latitude and longitude any of the three variables in "Data variables: si10, si10_u, avg". However, I cannot figure out how to change the dimensions to latitude and longitude from index_id. Or, to delete "index_id" in Coordinates. I've tried that and then 'latitude' and 'longitude' disappear from "Coordinates". Thank you for suggestions.
Here is my xarray Dataset:
<xarray.Dataset>
Dimensions: (index: 2448, index_id: 2448)
Coordinates:
* index_id (index_id) MultiIndex
- latitude (index_id) float64 58.0 58.0 58.0 58.0 ... 23.0 23.0 23.0 23.0
- longitude (index_id) float64 -130.0 -129.0 -128.0 ... -65.0 -64.0 -63.0
Dimensions without coordinates: index
Data variables:
si10 (index) float32 1.7636629 1.899161 ... 5.9699616 5.9121003
si10_u (index) float32 1.6784391 1.7533684 ... 6.13361 6.139127
avg (index) float32 1.721051 1.8262646 ... 6.0517855 6.025614
You have two issues. First, you need to replace 'index' with 'index_id' so your data is indexed consistently. Second, to unstack 'index_id', you're looking for xr.Dataset.unstack:
ds = ds.unstack('index_id')
As an example... here's a dataset like yours
In [16]: y = np.arange(58, 23, -1)
...: x = np.arange(-130, -63, 1)
In [17]: ds = xr.Dataset(
...: data_vars={
...: v: (("index",), np.random.random(len(x) * len(y)))
...: for v in ["si10", "si10_u", "avg"]
...: },
...: coords={
...: "index_id": pd.MultiIndex.from_product(
...: [y, x], names=["latitude", "longitude"],
...: ),
...: },
...: )
In [18]: ds
Out[18]:
<xarray.Dataset>
Dimensions: (index: 2345, index_id: 2345)
Coordinates:
* index_id (index_id) MultiIndex
- latitude (index_id) int64 58 58 58 58 58 58 58 58 ... 24 24 24 24 24 24 24
- longitude (index_id) int64 -130 -129 -128 -127 -126 ... -68 -67 -66 -65 -64
Dimensions without coordinates: index
Data variables:
si10 (index) float64 0.9412 0.7395 0.6843 ... 0.03979 0.4259 0.09203
si10_u (index) float64 0.7359 0.1984 0.5919 ... 0.5535 0.2867 0.4093
avg (index) float64 0.04257 0.1442 0.008705 ... 0.1911 0.2669 0.1498
First, reorganize your data to have consistent dims:
In [19]: index_id = ds['index_id']
In [20]: ds = (
...: ds.drop("index_id")
...: .rename({"index": "index_id"})
...: .assign_coords(index_id=index_id)
...: )
Then, ds.unstack reorganizes the data to be the combinatorial product of all dimensions in the MultiIndex:
In [21]: ds.unstack("index_id")
Out[21]:
<xarray.Dataset>
Dimensions: (latitude: 35, longitude: 67)
Coordinates:
* latitude (latitude) int64 24 25 26 27 28 29 30 31 ... 52 53 54 55 56 57 58
* longitude (longitude) int64 -130 -129 -128 -127 -126 ... -67 -66 -65 -64
Data variables:
si10 (latitude, longitude) float64 0.9855 0.1467 ... 0.6569 0.9479
si10_u (latitude, longitude) float64 0.4672 0.2664 ... 0.4894 0.128
avg (latitude, longitude) float64 0.3738 0.01793 ... 0.1264 0.21

Multiply xarray datasets with different dimensions

I have two NetCDF files, one covers the continental US (dataset2) and the other only the northeast (dataset1). I'm trying to multiply the two values together in order to create one dataset, however I get a ValueError after doing the multiplication.
import xarray
dataset1=xarray.open_dataset('../data/precip.nc')
print(dataset1)
Output:
<xarray.Dataset>
Dimensions: (time: 24, x: 180, y: 235)
Coordinates:
* time (time) datetime64[ns] 2019-02-14 ... 2019-02-14T23:00:00
* y (y) float64 -4.791e+06 -4.786e+06 ... -3.681e+06 -3.677e+06
* x (x) float64 2.234e+06 2.238e+06 2.243e+06 ... 3.081e+06 3.086e+06
lat (y, x) float64 ...
lon (y, x) float64 ...
Data variables:
z (y, x) float64 ...
crs int32 ...
PRECIP (time, y, x) float32 ...
dataset2=xarray.open_dataset('../data/ratio.nc')
print(dataset2)
Output:
<xarray.Dataset>
Dimensions: (lat: 272, lon: 480, nv: 2)
Coordinates:
* lat (lat) float64 21.06 21.19 21.31 21.44 ... 54.69 54.81 54.94
* lon (lon) float64 -125.9 -125.8 -125.7 ... -66.31 -66.19 -66.06
Dimensions without coordinates: nv
Data variables:
lat_bounds (lat, nv) float64 ...
lon_bounds (lon, nv) float64 ...
crs int16 ...
Data (lat, lon) float32 ...
# Merge datasets
data=xarray.merge([dataset1, dataset2], compat='override')
print(data)
Output:
<xarray.Dataset>
Dimensions: (lat: 272, lon: 480, nv: 2, time: 24, x: 180, y: 235)
Coordinates:
* time (time) datetime64[ns] 2019-02-14 ... 2019-02-14T23:00:00
* y (y) float64 -4.791e+06 -4.786e+06 ... -3.681e+06 -3.677e+06
* x (x) float64 2.234e+06 2.238e+06 ... 3.081e+06 3.086e+06
* lat (lat) float64 21.06 21.19 21.31 21.44 ... 54.69 54.81 54.94
* lon (lon) float64 -125.9 -125.8 -125.7 ... -66.31 -66.19 -66.06
Dimensions without coordinates: nv
Data variables:
z (y, x) float64 ...
crs int32 ...
PRECIP (time, y, x) float32 ...
lat_bounds (lat, nv) float64 ...
lon_bounds (lon, nv) float64 ...
Data (lat, lon) float32 ...
# Get first hour of precip data
precip=data.PRECIP[0:, :, :]
# Get ratio data
slr=data.Data
# Multiply to get snowfall
snow=slr*precip
That last line then give me this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-b9ed8e05f451> in <module>
----> 1 snow=slr*precip
~/.local/lib/python3.7/site-packages/xarray/core/dataarray.py in func(self, other)
2597 variable = (
2598 f(self.variable, other_variable)
-> 2599 if not reflexive
2600 else f(other_variable, self.variable)
2601 )
~/.local/lib/python3.7/site-packages/xarray/core/variable.py in func(self, other)
2034 new_data = (
2035 f(self_data, other_data)
-> 2036 if not reflexive
2037 else f(other_data, self_data)
2038 )
ValueError: iterator is too large
Solved following https://gis.stackexchange.com/questions/339463/using-xarray-to-resample-and-merge-two-datasets
slr_interpolate = slr.interp(lat=precip["lat"], lon=precip["lon"])
mpe_snowfall=slr_interpolate.Data*precip.Data

Slider labels do not correspond to the title

I have this xarray dataset defined as ds:
<xarray.Dataset>
Dimensions: (bnds: 2, lag: 61, plev: 63)
Coordinates:
* plev (plev) float64 1e+03 925.0 850.0 800.0 780.0 750.0 700.0 ...
* lag (lag) int64 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 ...
* bnds (bnds) int64 0 1
Data variables:
time_bnds (lag, plev, bnds) float64 -3.468e+04 -3.468e+04 -3.468e+04 ...
lat_bnds (lag, plev, bnds) float64 -48.24 -51.95 -48.24 -51.95 -48.24 ...
lon_bnds (lag, plev, bnds) float64 -318.8 -322.5 -318.7 -322.5 -318.7 ...
plev_bnds (lag, plev, bnds) float64 -1e+05 -9.25e+04 -9.25e+04 -8.5e+04 ...
accelogw (lag, plev) float64 -0.001869 0.05221 0.04774 0.02534 0.02233 ...
where the plev coordinate is decrasing (pressure level) from 1000 to 0.0007 hPa.
I define geoviews dataset like this:
import geoviews as gv
kdims = ['plev', 'lag']
vdims = ['accelogw']
dataset = gv.Dataset(ds, kdims=kdims, vdims = vdims)
I create the holoviews Curve object in this way:
%%opts Curve [xrotation=25] NdOverlay [fig_size=300 aspect=1.2]
dataset.to(hv.Curve, 'lag')
resulting in frames beginning with this one:
As you can see the slider label shows the pressure level equal to 0.0007 hPa in contrary to the title showing the pressure level equal to 1000 hPa. Is it a bug or default behavior of holoviews/geoviews for dimensions?
Thanks for your time.
EDIT: I have holoviews on v1.6.2, geoviews on v1.1.0 and xarray on v0.8.2.

Categories