How turn this monthly xarray dataset into an annual mean without resampling? - python

I have an xarray of monthly average surface temperatures read in from a server using open_dataset with decode_times=False because the calendar type is not understood by xarray.
After some manipulation, I am left with a dataset my_dataset of surface temperatures ('ts') and times ('T'):
<xarray.Dataset>
Dimensions: (T: 1800)
Coordinates:
* T (T) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 ...
Data variables:
ts (T) float64 246.6 247.9 250.7 260.1 271.9 281.1 283.3 280.5 ...
'T' has the following attributes:
Attributes:
pointwidth: 1.0
calendar: 360
gridtype: 0
units: months since 0300-01-01
I would like to take this monthly data and calculate annual averages, but because the T coordinate aren't datetimes, I'm unable to use xarray.Dataset.resample. Right now, I am simply converting to a numpy array, but I would like a way to do this preserving the xarray dataset.
My current, rudimentary way:
temps = np.mean(np.array(my_dataset['ts']).reshape(-1,12),axis=1)
years = np.array(my_dataset['T'])/12
I appreciate any help, even if the best way is redefining the time coordinate to use resampling.
Edit:
Requested how xarray was created, it was done via the following:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
filename = 'http://strega.ldeo.columbia.edu:81/CMIP5/.byScenario/.abrupt4xCO2/.atmos/.mon/.ts/ACCESS1-0/r1i1p1/.ts/dods'
ds = xr.open_dataset(filename,decode_times=False)
zonal_mean = ds.mean(dim='lon')
arctic_only = zonal.where(zonal['lat'] >= 60).dropna('lat')
weights = np.cos(np.deg2rad(arctic['lat']))/np.sum(np.cos(np.deg2rad(arctic['lat'])))
my_dataset = (arctic_only * weights).sum(dim='lat')

This is a very common problem, especially with datasets from INGRID. The reason xarray can't decode the date whose units are "months since..." is due to the underlying netcdf4-python library's refusal to parse such dates. This is discussed in a netcdf4-python github issue
The problem with time units such as "months" is that they are not well defined. In contrast to days, hours, etc. the length of a month depends on the calendar used and even varies between different months.
INGRID unfortunately refuses to accept this fact and continues to use "months" as its default unit, despite the ambiguity. So right now there is this frustrating incompatibility between INGRID and xarray / python-netcdf4.
Anyway, here is a hack to accomplish what you want without leaving xarray
# create new coordinates for month and year
ds.coords['month'] = np.ceil(ds['T'] % 12).astype('int')
ds.coords['year'] = (ds['T'] // 12).astype('int')
# calculate monthly climatology
ds_clim = ds.groupby('month').mean(dim='T')
# calculate annual mean
ds_am = ds.groupby('year').mean(dim='T')

Related

How to convert time coordinate from "months since 2005-01-01 00:00:00" data type to normal time in Python?

I was working with the OHC NetCDF dataset. In the picture below, the time coordinate is in "months since 2005-01-01 00:00:00" unit and the data is like this array([ 0.5, 1.5, 2.5, ..., 207.5, 208.5, 209.5], dtype=float32). Here 0.5 may mean the data is 2005-01-15 00:00:00.
Now, I want the dataset like this image below. To summarize, I want my dataset in
datetime64[ns] instead of float32
You can use netCDF4 library to convert numbers to time.
Here is an example:
import xarray as xr
from netCDF4 import num2date
import numpy as np
# Dummy data
ds = xr.Dataset()
ds.coords["time"] = np.arange(10)
ds.time.attrs['units'] = "days since 2005-01-01 00:00:00"
# Now let's convert it to cftime.DatetimeGregorian format
times_new = num2date(times=ds.time, units=ds.time.units)
# Now you can add it to the dataset
ds.coords["rtime"] = times_new
I solved this problem by manually summing the value with starting date. Thanks, everyone for your idea and comment.
dates = []
for i in range(len(ohc.time.values)):
date = (pd.to_datetime('2005-01-01') + pd.DateOffset(days=30*ohc['time'].values[i]))
dates.append(date)
ohc.coords["rtime"] = dates

Obtain mean value of specific area in netCDF

I am trying to plot a time series of the sea surface temperature (SST) for a specific region from a .nc file. The SST is a three-dimensional variable (lat,lon,time), that has mean daily values for a specific region from 1982 to 2016. I want my plot to reflect the seasonal sst variability of the entire period of time. I assume that what I need to do first is to obtain a mean sst value for my lat,lon region for each of the days with which I can work alter on. So far, I assume that I need to read the .nc file and the variables:
import netCDF4 as nc
f = nc.Dataset('cmems_SST_MED_SST_L4_REP_OBSERVATIONS_010_021_1639073212518.nc')
sst = f.variables['analysed_sst'][:]
lon = f.variables['longitude'][:]
lat = f.variables['latitude'][:]
Next, following the code suggested here, I tried to reshape and obtain the mean, but an error pops up:
global_average= np.nanmean(sst[:,:,:],axis=(1,2))
annual_temp = np.nanmean(np.reshape(global_average, (34,12)), axis = 1)
#34 years between 1982 and 2016, and 12 months per year.
ERROR cannot reshape array of size 14008 into shape (34,12)
From here I found different ways, like using cdo or nco (which didn't work due installation problems) among others, which were not suitable for my case. I used nanmean because know that in MATLAB this is done using the nanmean function. I am quite new to this topic and I would like to ask for some hints, like, where should I focus more or what path is more suitable for this case. Thank you!!
Handling daily data with just pure python is difficult because you should consider leap years and sub-setting a region require tedious indexing striding....
As steTATO mentioned, since the data that you are working has daily temporal resolution you need to consider the following
You need to reshape the global_average in the shape of (34,365) or (34,366) depending on the year (1984,1988,1992,1996,2000,2004,2008,2012,2016). So your above code should look something like
annual_temp = np.nanmean(np.reshape(global_average, (34,365)), axis = 1)
But, like I said, because of the leap years, you can't do the things you want by simply reshaping the global_average.
If I had no choice but to use python only, I'd do the following
import numpy as np
def days_in_year(in_year):
leap_years = [1984,1988,1992,1996,2000,2004,2008,2012,2016]
if (in_year in leap_years):
out_days = 366
else:
out_days = 365
return out_days
# some of your code, importing netcdf data
year = np.arange(1982,2017)
global_avg= np.nanmean(sst[:,:,:],axis=(1,2))
annual_avgs = []
i = 0
for yr in range(35):
i = i + days_in_year(year[yr])
f = i - days_in_year(year[yr])
annual_avg = np.nanmean(global_avg[i:f])
annual_avgs.append(annual_avg)
Above code basically takes and averages by taking strides of the global_avg considering the leap year, and saving it as annual_avgs.

mask NetCDF using shapefile and calculate average and anomaly for all polygons within the shapefile

There are several tutorials (example 1, example 2, example 3) about masking NetCDF using shapefile and calculating average measures. However, I was confused with those workflows about masking NetCDF and extracting measures such as average, and those tutorials did not include extract anomaly (for example, the difference between temperature in 2019 and a baseline average temperature).
I make an example here. I have downloaded monthly temperature (download temperature file) from 2000 to 2019 and the state-level US shapefile (download shapefile). I want to get the state-level average temperature based on the monthly average temperature from 2000 to 2019 and the temperature anomaly of year 2019 relative to baseline temperature from 2000 to 2010. Specifically, the final dataframe looks as follow:
state
avg_temp
anom_temp2019
AL
xx
xx
AR
xx
xx
...
...
...
WY
xx
xx
# Load libraries
%matplotlib inline
import regionmask
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Read shapefile
us = gpd.read_file('./shp/state_cus.shp')
# Read gridded data
ds = xr.open_mfdataset('./temp/monthly_mean_t2m_*.nc')
......
I really appreciate your help that providing an explicit workflow that could do the above task. Thanks a lot.
This can be achieved using regionmask. I don't use your files but the xarray tutorial data and naturalearth data for the US states.
import numpy as np
import regionmask
import xarray as xr
# load polygons of US states
us_states_50 = regionmask.defined_regions.natural_earth.us_states_50
# load an example dataset
air = xr.tutorial.load_dataset("air_temperature")
# turn into monthly time resolution
air = air.resample(time="M").mean()
# create a mask
mask3D = us_states_50.mask_3D(air)
# latitude weights
wgt = np.cos(np.deg2rad(air.lat))
# calculate regional averages
reg_ave = air.weighted(mask3D * wgt).mean(("lat", "lon"))
# calculate the average temperature (over 2013-2014)
avg_temp = reg_ave.sel(time=slice("2013", "2014")).mean("time")
# calculate the anomaly (w.r.t. 2013-2014)
reg_ave_anom = reg_ave - avg_temp
# select a single timestep (January 2013)
reg_ave_anom_ts = reg_ave_anom.sel(time="2013-01")
# remove the time dimension
reg_ave_anom_ts = reg_ave_anom_ts.squeeze(drop=True)
# convert to a pandas dataframe so it's in tabular form
df = reg_ave_anom_ts.air.to_dataframe()
# set the state codes as index
df = df.set_index("abbrevs")
# remove other columns
df = df.drop(columns="names")
You can find info how to use your own shapefile on the regionmask docs (Working with geopandas).
disclaimer: I am the main author of regionmask.

compute and plot monthly mean SST anomalies and plot with xarray multindex (pangeo tutorial gallery)

I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?
Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()

Simple conversion of netCDF4.Dataset to xarray Dataset

I know how to convert netCDF4.Dataset to xarray DataArray manually. However, I would like to know whether is there any simple and elegant way, e.g. using xarray backend, for simple conversion of the following 'netCDF4.Dataset' object to xarray DataArray object:
<type 'netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
Originating_or_generating_Center: US National Weather Service, National Centres for Environmental Prediction (NCEP)
Originating_or_generating_Subcenter: NCEP Ensemble Products
GRIB_table_version: 2,1
Type_of_generating_process: Ensemble forecast
Analysis_or_forecast_generating_process_identifier_defined_by_originating_centre: Global Ensemble Forecast System (GEFS)
Conventions: CF-1.6
history: Read using CDM IOSP GribCollection v3
featureType: GRID
History: Translated to CF-1.0 Conventions by Netcdf-Java CDM (CFGridWriter2)
Original Dataset = /data/ldm/pub/native/grid/NCEP/GEFS/Global_1p0deg_Ensemble/member/GEFS_Global_1p0deg_Ensemble_20170926_0600.grib2.ncx3#LatLon_181X360-p5S-180p0E; Translation Date = 2017-09-26T17:50:23.259Z
geospatial_lat_min: 0.0
geospatial_lat_max: 90.0
geospatial_lon_min: 0.0
geospatial_lon_max: 359.0
dimensions(sizes): time2(2), ens(21), isobaric1(12), lat(91), lon(360)
variables(dimensions): float32 u-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon), float64 time2(time2), int32 ens(ens), float32 isobaric1(isobaric1), float32 lat(lat), float32 lon(lon), float32 v-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon)
groups:
I've got this using siphon.ncss.
The next release of xarray (0.10) has support for this very thing, or at least getting an xarray dataset from a netCDF4 one, for exactly the reason you're trying to use it:
import xarray as xr
nc = nc4.Dataset('filename.nc', mode='r') # Or from siphon.ncss
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Or with siphon.ncss, this would look like:
from datetime import datetime
from siphon.catalog import TDSCatalog
import xarray as xr
gfs_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog'
'/grib/NCEP/GFS/Global_0p5deg/catalog.xml')
latest = gfs_cat.latest
ncss = latest.subset()
query = ncss.query().variables('Temperature_isobaric')
query.time(datetime.utcnow()).accept('netCDF4')
nc = ncss.get_data(query)
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Until it's released, you could install xarray from master. Otherwise, the only other solution is to do everything manually.

Categories