datetime groupby/resample month and day across different years (drop the year) - python

I have looked at the resample/Timegrouper functionality in Pandas. However, I'm trying to figure out how to use it for this specific case. I want to do a seasonal analysis across a financial asset - let's say S&P 500. I want to know how the asset performs between any two custom dates on average across many years.
Example: If I have a 10 year history of daily changes of S&P 500 and I pick the date range between March 13th and March 23rd, then I want to know the average change for each date in my range across the last 10 years - i.e. average change on 3/13 each year for the last 10 years, and then for 3/14, 3/15 and so on until 3/23. This means I need to groupby month and day and do an average of values across different years.
I can probably do this by creating 3 different columns for year, month, and day and then grouping by two of them, but I wonder if there are more elegant ways of doing this.

I figured it out. It turned out to be pretty simple and I was just being dumb.
x.groupby([x.index.month, x.index.day], as_index=True).mean()
where x is a pandas series in my case (but I suppose could also be a dataframe?). This will return a multi-index series which is ok in my case, but if it's not in your case then you can manipulate it to drop a level or turn the index into new columns

Related

Pandas - Distribute values for one day equally across the next week's days?

I have a data frame with a date column and a sales volume column.
There are days that I need to set sales volumes to zero and then distribute those volumes equally to the volumes of the next five non-weekend days.
So if I have a volume of 100 on a Monday, the next five non-business days are their volumes + (100/5).
I've tried some work arounds using date_range, but haven't had any success. I'm very lost on how to do this since I'm not so good with datetime methods.

calculating daily average across 30 years for different variables

I'm working with a dataframe that has daily information (measured data) across 30 years for different variables. I am trying to groupby days of the year, and then find a mean across 30 years. How do I go about this? This is what the dataframe looks like
I tried to groupby day after checking for type of YYYYMMDD (it's an int64 type.) now I have the dataframe looking like this. It has just added new columns for Day, Month year
[]
I'm a bit stuck on how to calculate means from here, i would need to somehow group all Jan-1sts, jan-2nds etc over 30 years and average it after.
You can groupby with month and day:
df.index = pd.to_datetime(df.index)
( df.groupby([df.index.month, df.index.day]).mean().reset_index().
rename({'level_0':'month', 'level_1':'day'}, axis=1))
or if you want to group them by the day of year, i.e. 1, 2, .. 365, set as_index=False:
df.groupby([df.index.month, df.index.day], as_index=False).mean()

AutoArima - Selecting correct value for m

So for argument sake here is an example of autoarima for daily data:
auto_arima(df['orders'],seasonal=True,m=7)
Now in that example after running a Seasonal Decomposition that has shown weekly seasonality I "think" you select 7 for m? Is this correct as the seasonality is shown to be weekly?
My first question is as follows - If seasonality is Monthly do you use 12? If it is Annually do you use 1? And is there ever a reason to select 365 for daily?
Secondly if the data you are given is already weekly e.g
date weekly tot
2021/01/01 - 10,000
2021/01/07 - 15,000
2021/01/14 - 9,000
and so on......
And you do the seasonal decomposition would m=1 be used for weekly, m=4 for monthly and m=52 for annually.
Finally if its monthly like so:
date monthly tot
2020/01/01- 10,000
2020/02/01- 15,000
2020/03/01- 9,000
and so on......
And you do the seasonal decomposition would m=1 for monthly and m=12 for annually.
Any help would be greatly appreciated, I just want to be able to confidently select the right criteria.
A season is a recurring pattern in your data and m is the length of that season. m in that case is not a code or anything but simply the length:
Imagine the weather, if you had the weekly average temperature it will rise in the summer and fall in the winter. Since the length of one "season" is a year or 52 weeks, you set m to 52.
If you had a repeating pattern every quarter, then m would be 12, since a quarter is equal to 12 weeks. It always depends on your data and your use case.
To your questions:
If seasonality is Monthly do you use 12?
If the pattern you are looking for repeats every 12 months yes, if it repeats every 3 months it would be 3 and so on.
If it is Annually do you use 1?
A seasonality of 1 does not really make sense, since it would mean that you have a repeating pattern in every single data point.
And is there ever a reason to select 365 for daily?
If your data is daily and the pattern repeats every 365 days (meaning every year) then yes (you need to remember that every fourth year has 366 days though).
I hope you get the concept behind seasonality and m so you can answer the rest.

Problem with solely year by year prediction using FBProphet

I have a set of yearly cumulative data. Particularly, it is the number of deaths per year for the last half of a century in one region. The problem is that I do not know how to set up the FBProphet to make a forecast on a yearly basis. For example, I have data like this (the number of deaths is the second column)
ds y
1950-01-01 1000
1951-01-01 1010
1952-01-01 1005
... ...
2009-01-01 2101
2010-01-01 2038
Assume that the pandas data frame is assigned to a variable mortality. Next, in order to simplify my problem, I take that the data is increasing linearly and I want a forecast for the next 10 years. Now, I am following the book, so to speak (I am using Python).
model = Prophet(growth="linear")
model.fit(mortality)
periods = 10
future = model.make_future_dataframe(periods=periods)
future_data = model.predict(future)
The issue is that I get the forecast for the next ten days: from 2010-01-02 to 2010-01-11. I do not know how to tell FBProphet to ignore dates, i.e., only to focus on years. As far as I know, under 'ds' column dates must be in format YYYY-MM-DD.
One solution, but a really bad one is to make a forecast for 10 * 365 days, and then to extract the dates on Jan 1st. But that is really cumbersome. It takes really a lot of time and sucks a lot of memory. Furthermore, if the growth is logistic, or there are some special things to take into account, I think this approach will not work. The other solution is to transform yearly data into daily data, i.e., for some 60 consecutive days, without any seasonality to consider. Then, to make a forecast, and again to transform the result back into appropriate form. But this is also cumbersome and what if there is some seasonality? What if there is a rise and fall every ten years? I hope you understand what is the issue.

groupby nightime with varying hours

I am trying to calculate night-time averages of a dataframe except that what I need is a mix between daily average and hour range average.
More specifically, I have a dataframe storing day and night hours and I want to use it as a boolean key to calculate night-time averages of another dataframe.
I cannot use daily averages because each night spreads over two calendar days, and I cannot use by hour range either because hours change by season.
Thanks for your help!
Dariush.
Based on comments received here is what I am looking for - see spreadsheet below. I need to calculate the average of 'Value' during nighttime using the Nighttime flag, and then repeat the average value for all time stamps until the following night, at which time the average is updated and repeated until the next nighttime flag.

Categories