Python netcdf - Monthly median over the all time period of daily data - python

I have a NetCDF file input.nc. This file represents nearly 18 years of data sampled every 4 days. From this file, I would like to calculate the monthly median value over the all time period. So the output file should only contain 12-time steps.
I am using the following code:
import xarray as xr
os.chdir(inbasedir)
data = xr.open_dataset('input.nc')
monthly_data = data.resample(freq='m', dim ='time', how = 'median')
monthly_data.to_netcdf("test.nc")
Unfortunately when I look at the output file, my code has done the median value for each month of the all time serie and I end up with more than 200 values. How can I change my code, in order to calculate the 12 monthly medians over the all time period?

You want to use the groupby method:
monthly_data = data.groupby('time.month').median()
There are some good examples of how to use xarray with timeseries data here: http://xarray.pydata.org/en/stable/time-series.html

Related

Calculating mean total seasonal precipitation using python

I am new to python and using it to analyse climate data in NetCDF. I am wanting to calculate the total precipitation for each season in each year and then average these seasonal totals across the time period (i.e. an average for DJF over all years in the file and an average for MAM etc.).
Here is what I thought to do:
fn1 = 'cru_fixed.nc'
ds1 = xr.open_dataset(fn1)
ds1_season = ds1['pre'].groupby('time.season').mean('time')
#Then plot each season
ds1_season.plot(col='season')
plt.show()
The original file contains monthly totals of precipitation. This is calculating an average for each season and I need the sum of Dec, Jan and Feb and the sum of Mar, Apr, May etc. for each season in each year. How do I sum and then average over the years?
If I'm not mistaking, you need to first resample you data to have the sum of each seasons on a DataArray, then to average theses sum on multiple years.
To resample:
sum_of_seasons = ds1['pre'].resample(time='Q').sum(dim="time")
resample is an operator to upsample or downsample time series, it uses time offsets of pandas.
However be careful to choose the right offset, it will define the month included in each season. Depending on your needs, you may want to use "Q", "QS" or an anchored offset like "QS-DEC".
To have the same splitting as "time.season", the offset is "QS-DEC" I believe.
Then to group over multiple years, same as you did above:
result = sum_of_seasons.groupby('time.season').mean('time')

Produce daily forecasts from monthly averages using Python Pandas

I have daily data going back years. If I firstly wanted to see what the monthly average of these was, then to project out this monthly average forecast for the next few years I have written the following code.
For example, my forecast for the next few January's will be the average of the last few January's, and the same for Feb, Mar etc. Over the past few years my January number is 51.8111, so for the January's in my forecast period I want every day in every January to be this 51.8111 number (i.e. moving the monthly to daily granularity).
My question is, my code seems a bit long winded and with the loop, could potentially be a little slow? For my own learning I was wondering, what is a better way of taking daily data, averaging it by a time period, then projecting out this time period? I was looking at map and apply functions within Pandas, but couldn't quite work it out.
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(0)
# create random dataframe of daily values
df = pd.DataFrame(np.random.randint(low=0, high=100,size=2317),
columns=['value'],
index=pd.date_range(start='2014-01-01', end=dt.date.today()-dt.timedelta(days=1), freq='D'))
# gain average by month over entire date range
df_by_month = df.groupby(df.index.month).mean()
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = 0
# project forward the monthly average to each day
for val in df_forecast.index:
df_forecast.loc[val]['value'] = df_by_month.loc[val.month]
# create new dataframe joining together the historical value and forecast
df_complete = df.append(df_forecast)
I think you need Index.map by months by column value from df_by_month:
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = df_forecast.index.month.map(df_by_month['value'])

maximum difference between two time series of different resolution

I have two time series data that gives the electricity demand in one-hour resolution and five-minute resolution. I am trying to find the maximum difference between these two time series. So the one-hour resolution data has 8760 rows (hourly for an year) and the 5-minute resolution data has 104,722 rows (5-minutly for an year).
I can only think of a method that will expand the hourly data into 5 minute resolution that will have 12 times repeating of the hourly data and find the maximum of the difference of the two data sets.
If this technique is the way to go, is there an easy way to convert my hourly data into 5-minute resolution by repeating the hourly data 12 times?
for your reference I posted a plot of this data for one day.
P.S> I am using Python to do this task
Numpy's .repeat() function
You can change your hourly data into 5-minute data by using numpy's repeat function
import numpy as np
np.repeat(hourly_data, 12)
I would strongly recommend against converting the hourly data into five-minute data. If the data in both cases refers to the mean load of those time ranges, you'll be looking at more accurate data if you group the five-minute intervals into hourly datasets. You'd get more granularity the way you're talking about, but the granularity is not based on accurate data, so you're not actually getting more value from it. If you aggregate the five-minute chunks into hourly chunks and compare the series that way, you can be more confident in the trustworthiness of your results.
In order to group them together to get that result, you can define a function like the following and use the apply method like so:
def to_hour(date):
date = date.strftime("%Y-%m-%d %H:00:00")
date = dt.strptime(date, "%Y-%m-%d %H:%M:%S")
return date
df['Aggregated_Datetime'] = df['Original_Datetime'].apply(lambda x: to_hour(x))
df.groupby('Aggregated_Datetime').agg('Real-Time Lo

Resample to Pandas DataFrame to Hourly using Hour as mid-point

I have a data frame with temperature measurements at a frequency of 5 minutes. I would like to resample this dataset to find the mean temperature per hour.
This is typically done using df['temps'].resample('H', how='mean') but this averages all values that fall within the hour - using all times where '12' is the hour, for example. I want something that gets all values from 30 minutes either side of the hour (or times nearest to the actual hour) and finds the mean that way. In other words, for the resampled time step of 1200, use all temperature values from 1130 to 1230 to calculate the mean.
Example code below to create a test data frame:
index = pd.date_range('1/1/2000', periods=200, freq='5min')
temps = pd.Series(range(200), index=index)
df = pd.DataFrame(index=index)
df['temps'] = temps
Can this be done using the built-in resample method? I'm sure I've done it before using pandas but cannot find any reference to it.
It seems you need:
print (df['temps'].shift(freq='30Min').resample('H').mean())

How I can extract a subset of months from all NetCDF files in one directory

I need to calculate the 90th percentile based on temperature data for 1961-1990. I have 30 NETCDF files and every file includes daily data for one year. I need to calculate the percentile (90th) for special Lat, Long while considering just summer days out of all 30 years of daily data. I need also consider the years when February has 29 days. When I run my code it just considered the first summer (summer 1961) and cannot consider all summer days with each other.
data = xr.open_mfdataset('/Tmax-2m/Control/*.nc')
time = data.variables['time']
lon = data.variables['lon'][:]
lat = data.variables['lat'][:]
tmax = data.variables['tmax'][:]
df = data.sel(lat=39.18,lon=-95.57, method='nearest')
time2=df.variables['time'][151:243]
dg=df.sel (time=time2, method = 'nearest')
print np.percentile (dg.tmax, 90)
I tried this way but it calculate the percentile for every summer of every year:
splits=[151,516,881,1247,1612,1977,2342,2708,3073,3438,3803,4169,4534,4899,5264,5630,5995,6360,6725,7091,7456,7821,8186,8552,8917,9282,9647,10013,10378,10743]
t0=92
result=[]
for i in splits:
time3=df.variables['time'][i:(i+t0)]
dg=df.sel(time=time3, method ='nearest')
result.append(np.percentile (dg.tmax, 90))
np.savetxt("percentile1.csv", result, fmt="%s")
Did you consider to use CDO for this task? (If you are running under linux this is easy, if you are on windows, you probably need to install it under cygwin)
You can merge the 30 files into one timeseries like this:
cdo mergetime file_y*.nc timeseries.nc
here the * is a wildcard for the year (1961, 1962 etc) in the filename that I assuming is file_y1961.nc file_y1962.nc etc... adopt as appropriate. timeseries.nc is the output file.
and then calculate the seasonal percentiles like this :
cdo yseaspctl,90 timeseries.nc -yseasmin timeseries.nc -yseasmax timeseries.nc percen.nc
percen.nc will have the seasonal percentiles in and you can extract the one for summer.
further details here: https://code.mpimet.mpg.de/projects/cdo/

Categories