How to groupby a custom time range in xarray? - python

I have a DataArray with date, x, and y coordinate dims.
date: 265, y: 1458, x: 1159
For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.
I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:
arr.groupby("date.season")
which results in
DataArrayGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.
Similarly this does not quite get at what I want:
all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year'
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.
The start and the ending date for each year is automatically January/December.
I'd really like to be able to groupby an arbitrary period with a start time and an end time in units of day of year.
If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.
The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).
Data to reproduce (2Gb, but can be tested on a subset):
https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0
Code to reproduce
import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime
all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')
all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))
def eeflux_path_date(path):
year, month, day, _ = path.name.split("_")
return datetime(int(year), int(month), int(day))
def open_eeflux(path, da_for_match):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array.rio.reproject_match(da_for_match)
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = eeflux_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")
all_eeflux_arr = all_eeflux_arr.sortby("date")
### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")
Advice is much appreciated!

I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.
def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
ey = max(all_eeflux_arr['date.year'].values)
sy = min(all_eeflux_arr['date.year'].values)
start_years = range(sy,ey)
end_years = range(sy+1, ey+1)
start_end_years = list(zip(start_year,end_year))
water_year_arrs = []
for water_year in start_end_years:
start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
return water_year_arrs
water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)

Related

change axis in time series for a custom year

I have a database with a number of events for a given date. I want to display a graph showing the number of events each day, displaying the week number on the x-axis and creating a curve for each year.
I have no problem doing this simply.
My concern is that I should not display the "calendar" years (from January 1st to December 31st, in other words the isoweek from 1 to 53), but the so-called "winter" years regrouping the 12 month periods from August to August.
I wrote the code below to do this, I get a correctly indexed table with the number of weeks indexed in order (from 35 to 53 then from 53 to 34), the number of cases ("count") and a "winter_year" column which allows me to group my curves.
In spite of all my attempts, on the diagram the X axis continues to be displayed from 1 to 53 ...
I have recreated an example with random numbers below
Can you help me to get the graph I want?
I am also open to any suggestions for improving my code which, I know, is probably not very optimal... I'm still learning Python.
#%%
import pandas
import numpy as np
from datetime import date
def winter_year(date):
if date.month > 8:
x = date.year
else:
x = date.year-1
return("Winter "+str(x)+"-"+str((x+1)))
#%%
np.random.seed(10)
data = pandas.DataFrame()
data["dates"] = pandas.date_range("2017-07-12","2022-08-10")
data["count"] = pandas.Series(np.random.randint(150, size = len(data["dates"])))
data = data.set_index("dates")
print(data)
#%%
data["week"] = data.index.isocalendar().week
data["year"] = data.index.year
data["date"] = data.index.date
data["winter_year"] = data["date"].apply(winter_year)
datapiv = pandas.pivot_table(data,values = ["count"],index = ["week"], columns = ["winter_year"],aggfunc=np.sum)
order_weeks = [i for i in range(35,54)]
for i in range(1,35):
order_weeks.append(i)
datapiv = datapiv.reindex(order_weeks)
datapiv.plot(use_index=True)
Add this line before the plot:
datapiv.index = [str(x) for x in order_weeks]
it would be something like:
...
order_weeks = [i for i in range(35,54)]
for i in range(1,35):
order_weeks.append(i)
datapiv = datapiv.reindex(order_weeks)
# this is the new line
datapiv.index = [str(x) for x in order_weeks]
datapiv.plot(use_index=True)
Output:

Python function that mimics the distribution of my dataset

I have a food order dataset that looks like this, with a few thousand orders over the span of a few months:
Date
Item Name
Price
2021-10-09 07:10:00
Water Bottle
1.5
2021-10-09 12:30:60
Pizza
12
2021-10-09 17:07:56
Chocolate bar
3
Those orders are time-dependent. Nobody will eat a pizza at midnight, usually. There will be more 3PM Sunday orders than there will be 3PM Monday orders (because people are at work). I want to extract the daily order distribution for each weekday (Monday till Sunday) from those few thousand orders so I can generate new orders later that fits this distribution. I do not want to fill in the gaps in my dataset.
How can I do so?
I want to create a generate_order_date() function that would generate a random hours:minutes:seconds depending on the day. I can already identify which weekday a date corresponds to. I just need to extract the 7 daily distributions so I can call my function like this:
generate_order_date(day=Monday, nb_orders=1)
[12:30:00]
generate_order_date(day=Friday, nb_orders=5)
[12:30:00, 07:23:32, 13:12:09, 19:15:23, 11:44:59]
generated timestamps do not have to be in chronological order. Just like if I was calling
np.random.normal(mu, sigma, 1000)
Try np.histogram(data)
https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
The first argument will give you the density which would be your distribution. You can visualise it
plt.plot(np.histogram(data)[0])
data here would be the time of the day a particular item was ordered. For this approach, I would suggest round up your time to 5 min intervals or more depending on the frequency. For example - round 12:34pm to 12:30 and 12:36pm to 12:35pm. Choose a suitable frequency.
Another method would be scipy.stats.gaussian_kde. This would use a guassian kernel. Below is an implementation I have used previously
def get_kde(df:pd.DataFrame)->list:
xs = np.round(np.linspace(-1,1,3000),3)
kde = gaussian_kde(df.values)
kde_vals = np.round(kde(xs),3)
data = [[xs[i],kde_vals[i]] for i in range(len(xs)) if kde_vals[i]>0.1]
return data
where df.values is your data. There are plenty more kernels which you can use to get a density estimate. The most suitable one depends on the nature of your data.
Also see https://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation
Here's a rough sketch of what you could do:
Assumption: Column Date of the DataFrame df contains datetimes. If not do df.Date = pd.to_datetime(df.Date) first.
First 2 steps:
import pickle
from datetime import datetime, timedelta
from sklearn.neighbors import KernelDensity
# STEP 1: Data preparation
df["Seconds"] = (
df.Date.dt.hour * 3600 + df.Date.dt.minute * 60 + df.Date.dt.second
) / 86400
data = {
day: sdf.Seconds.to_numpy()[:, np.newaxis]
for day, sdf in df.groupby(df.Date.dt.weekday)
}
# STEP 2: Kernel density estimation
kdes = {
day: KernelDensity(bandwidth=0.1).fit(X) for day, X in data.items()
}
with open("kdes.pkl", "wb") as file:
pickle.dump(kdes, file)
STEP 1: Build a normalised column Seconds (normalised between 0 and 1). Then group over the weekdays (numbered 0,..., 6) and prepare for for every day of the week data for estimation of a kernel density.
STEP 2: Estimate the kernel densities for every day of the week with KernelDensity from Scikit-learn and pickle the results.
Based on these estimates build the desired sample function:
# STEP 3: Sampling
with open("kdes.pkl", "rb") as file:
kdes = pickle.load(file)
def generate_order_date(day, orders):
fmt = "%H:%M:%S"
base = datetime(year=2022, month=1, day=1)
kde = kdes[day]
return [
(base + timedelta(seconds=int(s * 86399))).time().strftime(fmt)
for s in kde.sample(orders)
]
I won't pretend that this is anywhere near perfect. But maybe you could use it as a starting point.

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Resample xarray Dataset to annual frequency using only winter data

I have a dataset which consists of daily x,y gridded meteorological data for several years. I am interested in calculating annual means of winter data only, ie. not including the summer data as well.
I think that I need to use the resample command with, e.g. a frequency of AS-OCT to resample the time series to annual frequency with winter beginning in October each year (it's northern latitudes).
What I can't work out is how to specify that I only want to use data from months October through April/May, ignoring June, July and August.
As the resample function works with ndarray objects I came up with a fairly unportable way of doing this for a sum:
def winter(x,axis):
# Only use data from 1 October to end of April (day 211)
return np.sum(x[0:211,:,:],axis=0)
win_sum = all_data.resample('AS-OCT',how=winter,dim='TIME')
but I feel like there should be a more elegant solution. Any ideas?
The trick is to create a mask for the dates you wish to exclude. You can do this by using groupby to extract the month.
import xarray as xr
import pandas as pd
import numpy as np
# create some example data at daily resolution that also has a space dimension
time = pd.date_range('01-01-2000','01-01-2020')
space = np.arange(0,100)
data = np.random.rand(len(time), len(space))
da = xr.DataArray(data, dims=['time','space'], coords={'time': time, 'space': space})
# this is the trick -- use groupby to extract the month number of each day
month = da.groupby('time.month').apply(lambda x: x).month
# create a boolen Dataaray that is true only during winter
winter = (month <= 4) | (month >= 10)
# mask the values not in winter and resample annualy starting in october
da_winter_annmean = da.where(winter).resample('AS-Oct', 'time')
Hopefully this works for you. It is slightly more elegant, but the groupby trick still feels kind of hackish. Maybe there is still a better way.

Compute daily climatology using pandas python

I am trying to use pandas to compute daily climatology. My code is:
import pandas as pd
dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")
cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...
Any suggestions how I can do this using pandas?
You can groupby the day of the year, and the calculate the mean for these groups:
cum_data.groupby(cum_data.index.dayofyear).mean()
However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:
In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1 1 462.25
2 631.00
3 615.50
4 496.00
...
12 28 378.25
29 427.75
30 528.50
31 678.50
Length: 366, dtype: float64
Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.
I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series.
This way, things like leap years are automatically taken care of.
#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()
#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]
#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
#joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.
...
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
(mo, dy) = name
# Note: can use any leap year here.
yrday = (date(1952, mo, dy)).timetuple().tm_yday
doy.append(yrday)
doy_mean.append(group.mean())
doy_size.append(group.count())
# Note: useful climatology stats are also available via group.describe() returned as dict
#desc = group.describe()
# desc["mean"], desc["min"], desc["max"], std,quartiles, etc.
# we lose the counts here.
new_cum_data = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5
pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)
print cum_data_df.ix[366]
>> mean 634.5
>> size 4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean 343
>> size 1
>> Name: 60, dtype: float64
Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,
Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index
Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.

Categories