change axis in time series for a custom year - python

I have a database with a number of events for a given date. I want to display a graph showing the number of events each day, displaying the week number on the x-axis and creating a curve for each year.
I have no problem doing this simply.
My concern is that I should not display the "calendar" years (from January 1st to December 31st, in other words the isoweek from 1 to 53), but the so-called "winter" years regrouping the 12 month periods from August to August.
I wrote the code below to do this, I get a correctly indexed table with the number of weeks indexed in order (from 35 to 53 then from 53 to 34), the number of cases ("count") and a "winter_year" column which allows me to group my curves.
In spite of all my attempts, on the diagram the X axis continues to be displayed from 1 to 53 ...
I have recreated an example with random numbers below
Can you help me to get the graph I want?
I am also open to any suggestions for improving my code which, I know, is probably not very optimal... I'm still learning Python.
#%%
import pandas
import numpy as np
from datetime import date
def winter_year(date):
if date.month > 8:
x = date.year
else:
x = date.year-1
return("Winter "+str(x)+"-"+str((x+1)))
#%%
np.random.seed(10)
data = pandas.DataFrame()
data["dates"] = pandas.date_range("2017-07-12","2022-08-10")
data["count"] = pandas.Series(np.random.randint(150, size = len(data["dates"])))
data = data.set_index("dates")
print(data)
#%%
data["week"] = data.index.isocalendar().week
data["year"] = data.index.year
data["date"] = data.index.date
data["winter_year"] = data["date"].apply(winter_year)
datapiv = pandas.pivot_table(data,values = ["count"],index = ["week"], columns = ["winter_year"],aggfunc=np.sum)
order_weeks = [i for i in range(35,54)]
for i in range(1,35):
order_weeks.append(i)
datapiv = datapiv.reindex(order_weeks)
datapiv.plot(use_index=True)

Add this line before the plot:
datapiv.index = [str(x) for x in order_weeks]
it would be something like:
...
order_weeks = [i for i in range(35,54)]
for i in range(1,35):
order_weeks.append(i)
datapiv = datapiv.reindex(order_weeks)
# this is the new line
datapiv.index = [str(x) for x in order_weeks]
datapiv.plot(use_index=True)
Output:

Related

Calculate Monthly Attrition Rate

I have the following dataframe df
import pandas as pd
import random
dates = pd.date_range(start = "2015-06-02", end = "2022-05-02", freq = "3D")
boolean = [random.randint(0, 1) for i in range(len(dates))]
boolean = [bool(x) for x in boolean]
df = pd.DataFrame(
{"Dates":dates,
"Boolean":boolean}
)
I then add the following attributes and group the data:
df["Year"] = df["Dates"].dt.year
df["Month"] = df["Dates"].dt.month
df.groupby(by = ["Year", "Month", "Boolean"]).size().unstack()
Which gets me something looking like this:
What I need to do is the following:
Calculate the attrition rate for the most recent complete month (say 30 days) - to do this I need to count the number of occurrences where "Boolean == False" at the beginning of this 1-month period, then I need to count the number of occurrences where "Boolean == True" within this 1-month period. I then use these two numbers to get the attrition rate (which I think would be sum(True occurrences within 1-month period) / sum(False occurrences at beginning of 1-month period)
I would use this same above approach to calculate the attrition rate for the entire historical period (that is, all months in between 2015-06-02 to 2022-05-02)
My Current Thinking
I'm wondering if I also need to derive a day attribute (that is, df["Day"] = df["Dates"].dt.day. Once I have this, do I just need to perform the necessary arithmetic over the days in each month in each year?
Please help, I am struggling with this quite a bit

Resampling a time series

I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1

Taking percentile in Python along 3rd dimension

I've been struggling with this one for a bit now. I have a matrix that is 55115 x 34, where each number along the first dimension is one day, for 151 years, totally 55115 points.
I am trying to get monthly percentiles of the values in the first dimension, so I have first added a date column, which subsequently groups it into months, although I cannot figure out the best way to take the percentile (95th) of both the days and the third dimension (here is 34). So after grouping the months, the matrix should be 151 x 12 x 34, and I want to take the 95th percentile along the third dimension, so my final matrix would be 151 x 12, in theory. Below is what I have so far to add the dates to the array:
dates = pd.date_range(start='1950-01-01', end='2100-12-31', freq='D') #create daily date range from 1950 to 2100
leap = [] #empty array
for each in dates:
if each.month==2 and each.day ==29: #find each leap day (feb 29)
leap.append(each)
dates = dates.drop(leap) #get rid of leap days
dates = pd.to_datetime(dates) #convert to datetime format
data = {'wind': winddata, 'time': dates} #create table with both dates and data
df = pd.DataFrame(data) #create dataframe
df.set_index('time') #index time
df.groupby(df['time'].dt.strftime('%b'))['wind'].sort_values()
And this is what I have to take the percentile:
months = df.groupby(pd.Grouper(key='time',freq = "M")) #group each month
monthly_percentile = months.aggregate(lambda x: np.percentile(x, q = 95)) #percentile across each month
Although, this does not appear to work. I'm open to other methods of doing this, I just am hoping to a) rearrange the 55115 x 34 data set into months, so that it is 151 (years) x 365 (days) x 34 (ensembles), and then the percentile is taken across the months and third dimension so I end up with 151 x 12 total. I'm happy to clarify anything if I did not specify well enough. Any detailed response would be really helpful. Thank you so much in advance!
If I get your question right, the most straightforward solution I can think of is to add the columns year and month, then groupby over them and compute a required percentile:
import pandas as pd
import numpy as np
dates = pd.date_range(start='1950-01-01', end='2100-12-31', freq='D')
dates_months = [date.month for date in dates]
dates_years = [date.year for date in dates]
values = np.random.rand(34, len(dates))
df = pd.DataFrame()
df['date'] = dates
df['year'] = dates_years
df['month'] = dates_months
for i in range(34):
df[f'values_{i}'] = values[i]
df = df.melt(id_vars=['date', 'year', 'month'], value_vars=[f'values_{i}' for i in range(34)])
sub = df.groupby(['year', 'month']).value.apply(lambda x: np.quantile(x, .95)).reset_index()
finally, if you really need a 151 x 12 array instead of year-month-percentile table of length 1812 (=151*12) you could use something like this:
crosstab = pd.crosstab(index=sub['year'], columns=sub['month'], values=sub['values'], aggfunc=lambda x: x)

How to groupby a custom time range in xarray?

I have a DataArray with date, x, and y coordinate dims.
date: 265, y: 1458, x: 1159
For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.
I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:
arr.groupby("date.season")
which results in
DataArrayGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.
Similarly this does not quite get at what I want:
all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year'
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.
The start and the ending date for each year is automatically January/December.
I'd really like to be able to groupby an arbitrary period with a start time and an end time in units of day of year.
If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.
The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).
Data to reproduce (2Gb, but can be tested on a subset):
https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0
Code to reproduce
import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime
all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')
all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))
def eeflux_path_date(path):
year, month, day, _ = path.name.split("_")
return datetime(int(year), int(month), int(day))
def open_eeflux(path, da_for_match):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array.rio.reproject_match(da_for_match)
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = eeflux_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")
all_eeflux_arr = all_eeflux_arr.sortby("date")
### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")
Advice is much appreciated!
I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.
def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
ey = max(all_eeflux_arr['date.year'].values)
sy = min(all_eeflux_arr['date.year'].values)
start_years = range(sy,ey)
end_years = range(sy+1, ey+1)
start_end_years = list(zip(start_year,end_year))
water_year_arrs = []
for water_year in start_end_years:
start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
return water_year_arrs
water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Categories