I have a dataset which consists of daily x,y gridded meteorological data for several years. I am interested in calculating annual means of winter data only, ie. not including the summer data as well.
I think that I need to use the resample command with, e.g. a frequency of AS-OCT to resample the time series to annual frequency with winter beginning in October each year (it's northern latitudes).
What I can't work out is how to specify that I only want to use data from months October through April/May, ignoring June, July and August.
As the resample function works with ndarray objects I came up with a fairly unportable way of doing this for a sum:
def winter(x,axis):
# Only use data from 1 October to end of April (day 211)
return np.sum(x[0:211,:,:],axis=0)
win_sum = all_data.resample('AS-OCT',how=winter,dim='TIME')
but I feel like there should be a more elegant solution. Any ideas?
The trick is to create a mask for the dates you wish to exclude. You can do this by using groupby to extract the month.
import xarray as xr
import pandas as pd
import numpy as np
# create some example data at daily resolution that also has a space dimension
time = pd.date_range('01-01-2000','01-01-2020')
space = np.arange(0,100)
data = np.random.rand(len(time), len(space))
da = xr.DataArray(data, dims=['time','space'], coords={'time': time, 'space': space})
# this is the trick -- use groupby to extract the month number of each day
month = da.groupby('time.month').apply(lambda x: x).month
# create a boolen Dataaray that is true only during winter
winter = (month <= 4) | (month >= 10)
# mask the values not in winter and resample annualy starting in october
da_winter_annmean = da.where(winter).resample('AS-Oct', 'time')
Hopefully this works for you. It is slightly more elegant, but the groupby trick still feels kind of hackish. Maybe there is still a better way.
Related
If there was a variable in an xarray dataset with a time dimension with daily values over some multiyear time span
2017-01-01 ... 2018-12-31, then it is possible to group the data by month, or by the day of the year, using
.groupby("time.month") or .groupby("time.dayofyear")
Is there a way to efficiently group the data by the day of the month, for example if I wanted to calculate the mean value on the 21st of each month?
See the xarray docs on the DateTimeAccessor helper object. For more info, you can also check out the xarray docs on Working with Time Series Data: Datetime Components, which in turn refers to the pandas docs on date/time components.
You're looking for day. Unfortunately, both pandas and xarray simply describe .dt.day as referring to "the days of the datetime" which isn't particularly helpful. But if you take a look at python's native datetime.Date.day definition, you'll see the more specific:
date.day
Between 1 and the number of days in the given month of the given year.
So, simply
da.groupby("time.day")
Should do the trick!
I not sure, but maybe you can do like this:
import datetime
x = datetime.datetime.now()
day = x.strftime("%d")
month = x.strftime("%m")
year = x.strftime("%Y")
.groupby(month) or .groupby(year)
Could someone give me a tip on how to use pandas groupby to find similar "days" in a time series dataset?
For example my data is (averaged daily values) a buildings electrical power and weather data, I am attempting to see if Pandas groupby can be used to find similar "days" both in electrical power usage and weather to a unique date in the time stamp of July 25th 2019.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Data/master/stackoverflow_groupby_question.csv', parse_dates=True)
df['Date']=pd.to_datetime(df['Date'], utc=True)
df.set_index('Date', inplace=True)
df_daily_avg = df.resample('D').mean()
What I am trying to find is like the top 10 or 15 most similar days in this dataset to the averaged temperature on that day of July 25th which is:
july_25_temp_avg = df_daily_avg.loc['2019-07-25'].Temperature_C
22.047916666666676
And averaged building power which is:
july_25_power_avg = df_daily_avg.loc['2019-07-25'].kW
52.658333333333324
If I use groupby, something like this below it strips away the time stamp index.
july25_most_similar = df_daily_avg.groupby(['kW','Temperature_C'],as_index=False).Temperature_C.mean()
returns where it seems like most similar days are on the bottom:
kW Temperature_C
0 9.316667 17.256250
1 9.433333 14.979167
2 9.616667 13.933333
3 9.683333 19.822917
4 10.116667 24.606250
... ... ...
360 58.741667 21.816667
361 61.250000 23.839583
362 61.633333 25.204167
363 62.483333 25.970833
364 63.808333 25.300000
Any tips greatly appreciated to return the timestamp/days that are most similar to July 25th Temperature & Power.
Also if it is possible to use more criteria than just Temperature_C is it possible to post an additional answer to use more weather data? For example the averaged power on July 25th and more weather data (beyond just Temperature_C) like Wind_Speed_m_s Relative_Humidity Temperature_C Pressure_mbar DHI_DNI?
I think I would take this approach:
indx = df_daily_avg.sub(df_daily_avg.loc['2019-07-25']).abs()\
.sort_values(['Temperature_C', 'kW']).head(10).index.normalize()
df[df.index.normalize().isin(indx)]
Use diff and take the abs get the top then days sorted on 'Temperature_C' and 'kW' or some sort of metric that ranks the two.
Then get those index normalize them to a date and determine which rows in the original dataframe match retreived index.
I have a DataArray with date, x, and y coordinate dims.
date: 265, y: 1458, x: 1159
For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.
I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:
arr.groupby("date.season")
which results in
DataArrayGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.
Similarly this does not quite get at what I want:
all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year'
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.
The start and the ending date for each year is automatically January/December.
I'd really like to be able to groupby an arbitrary period with a start time and an end time in units of day of year.
If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.
The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).
Data to reproduce (2Gb, but can be tested on a subset):
https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0
Code to reproduce
import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime
all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')
all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))
def eeflux_path_date(path):
year, month, day, _ = path.name.split("_")
return datetime(int(year), int(month), int(day))
def open_eeflux(path, da_for_match):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array.rio.reproject_match(da_for_match)
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = eeflux_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")
all_eeflux_arr = all_eeflux_arr.sortby("date")
### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")
Advice is much appreciated!
I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.
def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
ey = max(all_eeflux_arr['date.year'].values)
sy = min(all_eeflux_arr['date.year'].values)
start_years = range(sy,ey)
end_years = range(sy+1, ey+1)
start_end_years = list(zip(start_year,end_year))
water_year_arrs = []
for water_year in start_end_years:
start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
return water_year_arrs
water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)
I have daily data going back years. If I firstly wanted to see what the monthly average of these was, then to project out this monthly average forecast for the next few years I have written the following code.
For example, my forecast for the next few January's will be the average of the last few January's, and the same for Feb, Mar etc. Over the past few years my January number is 51.8111, so for the January's in my forecast period I want every day in every January to be this 51.8111 number (i.e. moving the monthly to daily granularity).
My question is, my code seems a bit long winded and with the loop, could potentially be a little slow? For my own learning I was wondering, what is a better way of taking daily data, averaging it by a time period, then projecting out this time period? I was looking at map and apply functions within Pandas, but couldn't quite work it out.
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(0)
# create random dataframe of daily values
df = pd.DataFrame(np.random.randint(low=0, high=100,size=2317),
columns=['value'],
index=pd.date_range(start='2014-01-01', end=dt.date.today()-dt.timedelta(days=1), freq='D'))
# gain average by month over entire date range
df_by_month = df.groupby(df.index.month).mean()
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = 0
# project forward the monthly average to each day
for val in df_forecast.index:
df_forecast.loc[val]['value'] = df_by_month.loc[val.month]
# create new dataframe joining together the historical value and forecast
df_complete = df.append(df_forecast)
I think you need Index.map by months by column value from df_by_month:
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = df_forecast.index.month.map(df_by_month['value'])
I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()