Simple conversion of netCDF4.Dataset to xarray Dataset - python

I know how to convert netCDF4.Dataset to xarray DataArray manually. However, I would like to know whether is there any simple and elegant way, e.g. using xarray backend, for simple conversion of the following 'netCDF4.Dataset' object to xarray DataArray object:
<type 'netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
Originating_or_generating_Center: US National Weather Service, National Centres for Environmental Prediction (NCEP)
Originating_or_generating_Subcenter: NCEP Ensemble Products
GRIB_table_version: 2,1
Type_of_generating_process: Ensemble forecast
Analysis_or_forecast_generating_process_identifier_defined_by_originating_centre: Global Ensemble Forecast System (GEFS)
Conventions: CF-1.6
history: Read using CDM IOSP GribCollection v3
featureType: GRID
History: Translated to CF-1.0 Conventions by Netcdf-Java CDM (CFGridWriter2)
Original Dataset = /data/ldm/pub/native/grid/NCEP/GEFS/Global_1p0deg_Ensemble/member/GEFS_Global_1p0deg_Ensemble_20170926_0600.grib2.ncx3#LatLon_181X360-p5S-180p0E; Translation Date = 2017-09-26T17:50:23.259Z
geospatial_lat_min: 0.0
geospatial_lat_max: 90.0
geospatial_lon_min: 0.0
geospatial_lon_max: 359.0
dimensions(sizes): time2(2), ens(21), isobaric1(12), lat(91), lon(360)
variables(dimensions): float32 u-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon), float64 time2(time2), int32 ens(ens), float32 isobaric1(isobaric1), float32 lat(lat), float32 lon(lon), float32 v-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon)
groups:
I've got this using siphon.ncss.

The next release of xarray (0.10) has support for this very thing, or at least getting an xarray dataset from a netCDF4 one, for exactly the reason you're trying to use it:
import xarray as xr
nc = nc4.Dataset('filename.nc', mode='r') # Or from siphon.ncss
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Or with siphon.ncss, this would look like:
from datetime import datetime
from siphon.catalog import TDSCatalog
import xarray as xr
gfs_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog'
'/grib/NCEP/GFS/Global_0p5deg/catalog.xml')
latest = gfs_cat.latest
ncss = latest.subset()
query = ncss.query().variables('Temperature_isobaric')
query.time(datetime.utcnow()).accept('netCDF4')
nc = ncss.get_data(query)
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Until it's released, you could install xarray from master. Otherwise, the only other solution is to do everything manually.

Related

python mask netcdf file with a shape file and dealing with LAT LONG in degree vs LAT LONG in meters

I am struggling to mask my netcdf dataset. I have managed to do something but not in the proper way.
Basically, I have a shape file and a netcdf dataset.
I read the shapefile as follow:
import geopandas as gpd
shp_noce = gpd.read_file(shapefile_path)
which reads as:
DN geometry
0 1 POLYGON ((660074.143 5155942.267, 660172.884 5...
Then, I read the file as
rain = xarray.open_dataset(ncfile_path)
and here is the results:
<xarray.Dataset>
Dimensions: (DATE: 14245, x: 641, y: 643)
Coordinates:
* DATE (DATE) datetime64[ns] 1980-01-01T12:00:00 ... 2018-1...
* x (x) float64 6.058e+05 6.061e+05 ... 7.656e+05 7.658e+05
* y (y) float64 5.06e+06 5.06e+06 ... 5.22e+06 5.22e+06
Data variables:
transverse_mercator |S1 ...
precipitation (DATE, y, x) float32 ...
Attributes:
CDI: Climate Data Interface version 1.9.9 (https://mpimet.mpg.de...
Conventions: CF-1.5
Title: Daily total precipitation Trentino-South Tyrol 250-meter re...
Created on: Fri Feb 26 21:30:51 2021
history: Fri Feb 26 23:31:30 2021: cdo -z zip -mergetime DAILYPCP_19...
CDO: Climate Data Operators version 1.9.9 (https://mpimet.mpg.de..
I have tried to follow some suggestions coming from other posts. First of all, I have tried this, which is based on rioarray. This reads as:
rain.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
this is the outcomes:
raise MissingSpatialDimensionError(
MissingSpatialDimensionError: x dimension (lon) not found.
As far I have understood, there could be a ploblem linking the shape file and the netcdf dataset due to the projection units.
So, following what reported in here, I have done the following:
shp_noce.to_crs("epsg:3395")
However, I get the same error. I suppose because the field in the netcdf dataset are named x and y.
What are you suggestions? should I rename the fields? should I "set_spatial_dims" as x and y?
If your data and shapefile are in the same CRS (mercator), all you need to do is to tell rioxarray that your spatial dims are x and y.
rain.rio.set_spatial_dims(x_dim="x", y_dim="y", inplace=True)
See the rioxarray API documentation for ds.rio.set_spatial_dims:
set_spatial_dims (x_dim: str, y_dim: str, inplace: bool = True) → Union[xarray.core.dataset.Dataset, xarray.core.dataarray.DataArray]
This sets the spatial dimensions of the dataset.
Parameters
x_dim (str) – The name of the x dimension.
y_dim (str) – The name of the y dimension.
inplace (bool, optional) – If True, it will modify the dataframe in place. Otherwise it will return a modified copy.
Returns
Dataset with spatial dimensions set.
Return type
xarray.Dataset | xarray.DataArray
You told it to look for a dimension named "lon", and it's telling you that lon isn't found in the dataset. That's because the x dimension is named "x" :)

mask NetCDF using shapefile and calculate average and anomaly for all polygons within the shapefile

There are several tutorials (example 1, example 2, example 3) about masking NetCDF using shapefile and calculating average measures. However, I was confused with those workflows about masking NetCDF and extracting measures such as average, and those tutorials did not include extract anomaly (for example, the difference between temperature in 2019 and a baseline average temperature).
I make an example here. I have downloaded monthly temperature (download temperature file) from 2000 to 2019 and the state-level US shapefile (download shapefile). I want to get the state-level average temperature based on the monthly average temperature from 2000 to 2019 and the temperature anomaly of year 2019 relative to baseline temperature from 2000 to 2010. Specifically, the final dataframe looks as follow:
state
avg_temp
anom_temp2019
AL
xx
xx
AR
xx
xx
...
...
...
WY
xx
xx
# Load libraries
%matplotlib inline
import regionmask
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Read shapefile
us = gpd.read_file('./shp/state_cus.shp')
# Read gridded data
ds = xr.open_mfdataset('./temp/monthly_mean_t2m_*.nc')
......
I really appreciate your help that providing an explicit workflow that could do the above task. Thanks a lot.
This can be achieved using regionmask. I don't use your files but the xarray tutorial data and naturalearth data for the US states.
import numpy as np
import regionmask
import xarray as xr
# load polygons of US states
us_states_50 = regionmask.defined_regions.natural_earth.us_states_50
# load an example dataset
air = xr.tutorial.load_dataset("air_temperature")
# turn into monthly time resolution
air = air.resample(time="M").mean()
# create a mask
mask3D = us_states_50.mask_3D(air)
# latitude weights
wgt = np.cos(np.deg2rad(air.lat))
# calculate regional averages
reg_ave = air.weighted(mask3D * wgt).mean(("lat", "lon"))
# calculate the average temperature (over 2013-2014)
avg_temp = reg_ave.sel(time=slice("2013", "2014")).mean("time")
# calculate the anomaly (w.r.t. 2013-2014)
reg_ave_anom = reg_ave - avg_temp
# select a single timestep (January 2013)
reg_ave_anom_ts = reg_ave_anom.sel(time="2013-01")
# remove the time dimension
reg_ave_anom_ts = reg_ave_anom_ts.squeeze(drop=True)
# convert to a pandas dataframe so it's in tabular form
df = reg_ave_anom_ts.air.to_dataframe()
# set the state codes as index
df = df.set_index("abbrevs")
# remove other columns
df = df.drop(columns="names")
You can find info how to use your own shapefile on the regionmask docs (Working with geopandas).
disclaimer: I am the main author of regionmask.

Pearson correlation matrix in Python

I'm characterizing biological samples in infrared spectroscopy (mid infrared region) and the resulting data is used in predictive models, for disease prediction. I'm now working the spectrum data using supervised learning and the first step is to prepare the data (smoothing, peak finding, peak filtering, etc) and now I have a matrix of 93x1 (the dependent variables, or the disease/not disease label), where 93 are the number of samples, and a matrix of 93x210, where 210 are the number of wavelenghts where I can find the pre-filered absorption peaks. From these 210 wavelenght I need to extract the features (absorption peaks) that I'll feed into my model. For this, I'm using Pearson correlation matrix in python where the header is the 210 xi wavelenghts. I want to find correlations between absortion peaks at 'xi wavelenght' and samples. The issue is that the resulted matrix gives me '1' everywhere
Disclaimer: I'm a newbie in python
import numpy as np
import pandas as pd
from google.colab import files
uploaded = files.upload()
df2 = pd.read_excel(io.BytesIO(uploaded['20191201-Peaks.xlsx']),header=None, Index=None)
df2.columns=['Label_0_1','Sample',1700.105,...,1500.49]
print (df2.dtypes)
Label_0_1 object
Sample object
1700.105 float64
1699.141 float64
1698.177 float64
...
1504.35 float64
1503.38 float64
1502.42 float64
1501.45 float64
1500.49 float64
Length: 210, dtype: object
df2.shape
(93, 210)
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(150,150))
cor = df2.corr(method='pearson')
cor
[![correlation matrix][1]][1]

How turn this monthly xarray dataset into an annual mean without resampling?

I have an xarray of monthly average surface temperatures read in from a server using open_dataset with decode_times=False because the calendar type is not understood by xarray.
After some manipulation, I am left with a dataset my_dataset of surface temperatures ('ts') and times ('T'):
<xarray.Dataset>
Dimensions: (T: 1800)
Coordinates:
* T (T) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 ...
Data variables:
ts (T) float64 246.6 247.9 250.7 260.1 271.9 281.1 283.3 280.5 ...
'T' has the following attributes:
Attributes:
pointwidth: 1.0
calendar: 360
gridtype: 0
units: months since 0300-01-01
I would like to take this monthly data and calculate annual averages, but because the T coordinate aren't datetimes, I'm unable to use xarray.Dataset.resample. Right now, I am simply converting to a numpy array, but I would like a way to do this preserving the xarray dataset.
My current, rudimentary way:
temps = np.mean(np.array(my_dataset['ts']).reshape(-1,12),axis=1)
years = np.array(my_dataset['T'])/12
I appreciate any help, even if the best way is redefining the time coordinate to use resampling.
Edit:
Requested how xarray was created, it was done via the following:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
filename = 'http://strega.ldeo.columbia.edu:81/CMIP5/.byScenario/.abrupt4xCO2/.atmos/.mon/.ts/ACCESS1-0/r1i1p1/.ts/dods'
ds = xr.open_dataset(filename,decode_times=False)
zonal_mean = ds.mean(dim='lon')
arctic_only = zonal.where(zonal['lat'] >= 60).dropna('lat')
weights = np.cos(np.deg2rad(arctic['lat']))/np.sum(np.cos(np.deg2rad(arctic['lat'])))
my_dataset = (arctic_only * weights).sum(dim='lat')
This is a very common problem, especially with datasets from INGRID. The reason xarray can't decode the date whose units are "months since..." is due to the underlying netcdf4-python library's refusal to parse such dates. This is discussed in a netcdf4-python github issue
The problem with time units such as "months" is that they are not well defined. In contrast to days, hours, etc. the length of a month depends on the calendar used and even varies between different months.
INGRID unfortunately refuses to accept this fact and continues to use "months" as its default unit, despite the ambiguity. So right now there is this frustrating incompatibility between INGRID and xarray / python-netcdf4.
Anyway, here is a hack to accomplish what you want without leaving xarray
# create new coordinates for month and year
ds.coords['month'] = np.ceil(ds['T'] % 12).astype('int')
ds.coords['year'] = (ds['T'] // 12).astype('int')
# calculate monthly climatology
ds_clim = ds.groupby('month').mean(dim='T')
# calculate annual mean
ds_am = ds.groupby('year').mean(dim='T')

Convert raster time series of multiple GeoTIFF images to NetCDF

I have a raster time series stored in multiple GeoTIFF files (*.tif) that I'd like to convert to a single NetCDF file. The data is uint16.
I could probably use gdal_translate to convert each image to netcdf using:
gdal_translate -of netcdf -co FORMAT=NC4 20150520_0164.tif foo.nc
and then some scripting with NCO to extract dates from filenames and then concatenate, but I was wondering whether I might do this more effectively in Python using xarray and it's new rasterio backend.
I can read a file easily:
import glob
import xarray as xr
f = glob.glob('*.tif')
da = xr.open_rasterio(f[0])
da
which returns
<xarray.DataArray (band: 1, y: 5490, x: 5490)>
[30140100 values with dtype=uint16]
Coordinates:
* band (band) int64 1
* y (y) float64 5e+05 5e+05 5e+05 5e+05 5e+05 4.999e+05 4.999e+05 ...
* x (x) float64 8e+05 8e+05 8e+05 8e+05 8.001e+05 8.001e+05 ...
Attributes:
crs: +init=epsg:32620
and I can write one of these to NetCDF:
ds.to_netcdf('foo.nc')
but ideally I would be able to use something like xr.open_mfdataset , write the time values (extracted from the filenames) and then write the entire aggregation to netCDF. And have dask handle the out-of-core memory issues. :-)
Can something like this be done with xarray and dask?
Xarray should be able to do the concat step for you. I have adapted your example a bit below. It will be up to you to parse the filenames into something useful.
import glob
import pandas as pd
import xarray as xr
def time_index_from_filenames(filenames):
'''helper function to create a pandas DatetimeIndex
Filename example: 20150520_0164.tif'''
return pd.DatetimeIndex([pd.Timestamp(f[:8]) for f in filenames])
filenames = glob.glob('*.tif')
time = xr.Variable('time', time_index_from_filenames(filenames))
chunks = {'x': 5490, 'y': 5490, 'band': 1}
da = xr.concat([xr.open_rasterio(f, chunks=chunks) for f in filenames], dim=time)

Categories