Problem with solely year by year prediction using FBProphet

Problem with solely year by year prediction using FBProphet - python

I have a set of yearly cumulative data. Particularly, it is the number of deaths per year for the last half of a century in one region. The problem is that I do not know how to set up the FBProphet to make a forecast on a yearly basis. For example, I have data like this (the number of deaths is the second column)
ds y
1950-01-01 1000
1951-01-01 1010
1952-01-01 1005
... ...
2009-01-01 2101
2010-01-01 2038
Assume that the pandas data frame is assigned to a variable mortality. Next, in order to simplify my problem, I take that the data is increasing linearly and I want a forecast for the next 10 years. Now, I am following the book, so to speak (I am using Python).
model = Prophet(growth="linear")
model.fit(mortality)
periods = 10
future = model.make_future_dataframe(periods=periods)
future_data = model.predict(future)
The issue is that I get the forecast for the next ten days: from 2010-01-02 to 2010-01-11. I do not know how to tell FBProphet to ignore dates, i.e., only to focus on years. As far as I know, under 'ds' column dates must be in format YYYY-MM-DD.
One solution, but a really bad one is to make a forecast for 10 * 365 days, and then to extract the dates on Jan 1st. But that is really cumbersome. It takes really a lot of time and sucks a lot of memory. Furthermore, if the growth is logistic, or there are some special things to take into account, I think this approach will not work. The other solution is to transform yearly data into daily data, i.e., for some 60 consecutive days, without any seasonality to consider. Then, to make a forecast, and again to transform the result back into appropriate form. But this is also cumbersome and what if there is some seasonality? What if there is a rise and fall every ten years? I hope you understand what is the issue.

Related

How to convert to dates after removing the seasonality from the time series in python?

The question can be reframed as "How to remove daily seasonality from the dataset in python?"
Please read the following:
I have a time series and have used seasonal_decompose() from statsmodel to remove seasonality from the series. As I have used seasonal_decompose() on "Months" data, I get the seasonality only in months. How do I convert these months in to days/dates? Can I use seasonal_decompose() to remove daily seasonality? I tried one option of keeping frequency=365, but it raises following error:
x must have 2 complete cycles requires 730 observations. x only has 24 observation(s)
Snippet of the code:
grp_month = train.append(test).groupby(data['Month']).sum()['Var1']
season_result = seasonal_decompose(grp_month, model='addition', period=12)
This gives me the output:
Month
Out
2018-01-01
-17707.340278
2018-02-01
-49501.548611
2018-03-01
-28172.590278
..
..
..
..
2019-12-01
-13296.173611
As you can see in the table, implementing seasonal_decompose() gives me the monthly seasonality. Is there any way I can get the daily data from this? Or can I convert this into a date wise series?
Edit:
I tried to remove daily seasonality as follows but I'm not really sure if this is the way to go.
period = 365
season_mean = data.groupby(data.index % period).transform('mean')
data -= season_mean
print(data.head())

If you want to substract these values to a daily DataFrame, you should upsample the DataFrame season_result using pandas.DataFrame.resample this way you will be able to substract the monthly seasonnality from your original one.

AutoArima - Selecting correct value for m

So for argument sake here is an example of autoarima for daily data:
auto_arima(df['orders'],seasonal=True,m=7)
Now in that example after running a Seasonal Decomposition that has shown weekly seasonality I "think" you select 7 for m? Is this correct as the seasonality is shown to be weekly?
My first question is as follows - If seasonality is Monthly do you use 12? If it is Annually do you use 1? And is there ever a reason to select 365 for daily?
Secondly if the data you are given is already weekly e.g
date weekly tot
2021/01/01 - 10,000
2021/01/07 - 15,000
2021/01/14 - 9,000
and so on......
And you do the seasonal decomposition would m=1 be used for weekly, m=4 for monthly and m=52 for annually.
Finally if its monthly like so:
date monthly tot
2020/01/01- 10,000
2020/02/01- 15,000
2020/03/01- 9,000
and so on......
And you do the seasonal decomposition would m=1 for monthly and m=12 for annually.
Any help would be greatly appreciated, I just want to be able to confidently select the right criteria.

A season is a recurring pattern in your data and m is the length of that season. m in that case is not a code or anything but simply the length:
Imagine the weather, if you had the weekly average temperature it will rise in the summer and fall in the winter. Since the length of one "season" is a year or 52 weeks, you set m to 52.
If you had a repeating pattern every quarter, then m would be 12, since a quarter is equal to 12 weeks. It always depends on your data and your use case.
To your questions:
If seasonality is Monthly do you use 12?
If the pattern you are looking for repeats every 12 months yes, if it repeats every 3 months it would be 3 and so on.
If it is Annually do you use 1?
A seasonality of 1 does not really make sense, since it would mean that you have a repeating pattern in every single data point.
And is there ever a reason to select 365 for daily?
If your data is daily and the pattern repeats every 365 days (meaning every year) then yes (you need to remember that every fourth year has 366 days though).
I hope you get the concept behind seasonality and m so you can answer the rest.

How to compute SMA for months based on weeks data?

I have a dataframe with 100 Keys(column 1) and 6 months data (from Jan to June in column format like 2019_Jan_Week1,2019_Jan_Week2 etc. till June). Agenda is to forecast for future 3 months (from July to Sep) using Simple Moving Average of last 6 months. For instance, for July Week1 forecast the moving average should be calculated using 2019_Jan_Week1,2019,Feb_Week1,2019_Mar_Week1,2019_Apr_Week1,2019_May_Week1, and 2019_Jun_Week1.
The question is how to effective and speedily compute this operation?
Currently I am using For loop which takes huge amount of time?
I have tried using for loop, but it is taking huge amount of time.
counter=1
for keyIndex in range(0,len(finalForecastingData)):
print(keyIndex)
for forcastingMonthsIndex in range(31,columns):
finalForecastingData.iloc[keyIndex,forcastingMonthsIndex] = finalForecastingData.iloc[keyIndex,counter]+finalForecastingData.iloc[keyIndex,counter+5]+finalForecastingData.iloc[keyIndex,counter+10]+finalForecastingData.iloc[keyIndex,counter+15]+finalForecastingData.iloc[keyIndex,counter+20]
counter = counter+1
counter=1

Welcome to stackoverflow.
You can very easily get a rolling mean with pandas .rolling('60d').mean(). For that you need to convert the time data into pandas Datetime format with pd.to_datetime() and set it as index with set_index().
You should also check out https://stackoverflow.com/help/minimal-reproducible-example. It really helps to give example data. With a sample DataFrame it becomes much easier to give you concrete code rather than a direction where to look.

How I can extract a subset of months from all NetCDF files in one directory

I need to calculate the 90th percentile based on temperature data for 1961-1990. I have 30 NETCDF files and every file includes daily data for one year. I need to calculate the percentile (90th) for special Lat, Long while considering just summer days out of all 30 years of daily data. I need also consider the years when February has 29 days. When I run my code it just considered the first summer (summer 1961) and cannot consider all summer days with each other.
data = xr.open_mfdataset('/Tmax-2m/Control/*.nc')
time = data.variables['time']
lon = data.variables['lon'][:]
lat = data.variables['lat'][:]
tmax = data.variables['tmax'][:]
df = data.sel(lat=39.18,lon=-95.57, method='nearest')
time2=df.variables['time'][151:243]
dg=df.sel (time=time2, method = 'nearest')
print np.percentile (dg.tmax, 90)
I tried this way but it calculate the percentile for every summer of every year:
splits=[151,516,881,1247,1612,1977,2342,2708,3073,3438,3803,4169,4534,4899,5264,5630,5995,6360,6725,7091,7456,7821,8186,8552,8917,9282,9647,10013,10378,10743]
t0=92
result=[]
for i in splits:
time3=df.variables['time'][i:(i+t0)]
dg=df.sel(time=time3, method ='nearest')
result.append(np.percentile (dg.tmax, 90))
np.savetxt("percentile1.csv", result, fmt="%s")

Did you consider to use CDO for this task? (If you are running under linux this is easy, if you are on windows, you probably need to install it under cygwin)
You can merge the 30 files into one timeseries like this:
cdo mergetime file_y*.nc timeseries.nc
here the * is a wildcard for the year (1961, 1962 etc) in the filename that I assuming is file_y1961.nc file_y1962.nc etc... adopt as appropriate. timeseries.nc is the output file.
and then calculate the seasonal percentiles like this :
cdo yseaspctl,90 timeseries.nc -yseasmin timeseries.nc -yseasmax timeseries.nc percen.nc
percen.nc will have the seasonal percentiles in and you can extract the one for summer.
further details here: https://code.mpimet.mpg.de/projects/cdo/

datetime groupby/resample month and day across different years (drop the year)

I have looked at the resample/Timegrouper functionality in Pandas. However, I'm trying to figure out how to use it for this specific case. I want to do a seasonal analysis across a financial asset - let's say S&P 500. I want to know how the asset performs between any two custom dates on average across many years.
Example: If I have a 10 year history of daily changes of S&P 500 and I pick the date range between March 13th and March 23rd, then I want to know the average change for each date in my range across the last 10 years - i.e. average change on 3/13 each year for the last 10 years, and then for 3/14, 3/15 and so on until 3/23. This means I need to groupby month and day and do an average of values across different years.
I can probably do this by creating 3 different columns for year, month, and day and then grouping by two of them, but I wonder if there are more elegant ways of doing this.

I figured it out. It turned out to be pretty simple and I was just being dumb.
x.groupby([x.index.month, x.index.day], as_index=True).mean()
where x is a pandas series in my case (but I suppose could also be a dataframe?). This will return a multi-index series which is ok in my case, but if it's not in your case then you can manipulate it to drop a level or turn the index into new columns

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.