I have a dataframe with 100 Keys(column 1) and 6 months data (from Jan to June in column format like 2019_Jan_Week1,2019_Jan_Week2 etc. till June). Agenda is to forecast for future 3 months (from July to Sep) using Simple Moving Average of last 6 months. For instance, for July Week1 forecast the moving average should be calculated using 2019_Jan_Week1,2019,Feb_Week1,2019_Mar_Week1,2019_Apr_Week1,2019_May_Week1, and 2019_Jun_Week1.
The question is how to effective and speedily compute this operation?
Currently I am using For loop which takes huge amount of time?
I have tried using for loop, but it is taking huge amount of time.
counter=1
for keyIndex in range(0,len(finalForecastingData)):
print(keyIndex)
for forcastingMonthsIndex in range(31,columns):
finalForecastingData.iloc[keyIndex,forcastingMonthsIndex] = finalForecastingData.iloc[keyIndex,counter]+finalForecastingData.iloc[keyIndex,counter+5]+finalForecastingData.iloc[keyIndex,counter+10]+finalForecastingData.iloc[keyIndex,counter+15]+finalForecastingData.iloc[keyIndex,counter+20]
counter = counter+1
counter=1
Welcome to stackoverflow.
You can very easily get a rolling mean with pandas .rolling('60d').mean(). For that you need to convert the time data into pandas Datetime format with pd.to_datetime() and set it as index with set_index().
You should also check out https://stackoverflow.com/help/minimal-reproducible-example. It really helps to give example data. With a sample DataFrame it becomes much easier to give you concrete code rather than a direction where to look.
Related
The question can be reframed as "How to remove daily seasonality from the dataset in python?"
Please read the following:
I have a time series and have used seasonal_decompose() from statsmodel to remove seasonality from the series. As I have used seasonal_decompose() on "Months" data, I get the seasonality only in months. How do I convert these months in to days/dates? Can I use seasonal_decompose() to remove daily seasonality? I tried one option of keeping frequency=365, but it raises following error:
x must have 2 complete cycles requires 730 observations. x only has 24 observation(s)
Snippet of the code:
grp_month = train.append(test).groupby(data['Month']).sum()['Var1']
season_result = seasonal_decompose(grp_month, model='addition', period=12)
This gives me the output:
Month
Out
2018-01-01
-17707.340278
2018-02-01
-49501.548611
2018-03-01
-28172.590278
..
..
..
..
2019-12-01
-13296.173611
As you can see in the table, implementing seasonal_decompose() gives me the monthly seasonality. Is there any way I can get the daily data from this? Or can I convert this into a date wise series?
Edit:
I tried to remove daily seasonality as follows but I'm not really sure if this is the way to go.
period = 365
season_mean = data.groupby(data.index % period).transform('mean')
data -= season_mean
print(data.head())
If you want to substract these values to a daily DataFrame, you should upsample the DataFrame season_result using pandas.DataFrame.resample this way you will be able to substract the monthly seasonnality from your original one.
To use numpy.arange to create an array of dates which increase in 1 day intervals is straightforward and can be achieved using the code
np.arange(datetime(1985,7,1), datetime(2015,7,1), relativedelta(days=1)).astype(datetime)
However, I require an array of dates which increase in 1 year intervals. To do this, I cannot use
np.arange(datetime(1985,7,1), datetime(2015,7,1), relativedelta(days=365)).astype(datetime)
since this does not account for leap years and I need the day and month of my dates to remain the same at all terms.
Is there a way to achieve this using np.arange?
I wish to use numpy.arrange since I am hoping to use #Mustafa Aydın's answer to my earlier question (https://stackoverflow.com/a/68032151/10346788) but with dates rather than with integers.
Specify only the year and the month in the datetime64 , and set the interval as 1 year . For example ,to generate all dates of March 10 , from 1985 to 2015
np.arange(np.datetime64("1985-03"), np.datetime64("2015-03"),np.timedelta64(1,"Y")) +np.timedelta64("9","D")
array(['1985-03-10', '1986-03-10', '1987-03-10', '1988-03-10',
'1989-03-10', '1990-03-10', '1991-03-10', '1992-03-10',
'1993-03-10', '1994-03-10', '1995-03-10', '1996-03-10',
'1997-03-10', '1998-03-10', '1999-03-10', '2000-03-10',
'2001-03-10', '2002-03-10', '2003-03-10', '2004-03-10',
'2005-03-10', '2006-03-10', '2007-03-10', '2008-03-10',
'2009-03-10', '2010-03-10', '2011-03-10', '2012-03-10',
'2013-03-10', '2014-03-10'], dtype='datetime64[D]'
try
np.array([datetime(i,7,1) for i in range(1985,2015+1)])
EDIT: or just as a normal list - in case it does not have to be a numpy array:
[datetime(i,7,1) for i in range(1985,2015+1)]
I have a set of yearly cumulative data. Particularly, it is the number of deaths per year for the last half of a century in one region. The problem is that I do not know how to set up the FBProphet to make a forecast on a yearly basis. For example, I have data like this (the number of deaths is the second column)
ds y
1950-01-01 1000
1951-01-01 1010
1952-01-01 1005
... ...
2009-01-01 2101
2010-01-01 2038
Assume that the pandas data frame is assigned to a variable mortality. Next, in order to simplify my problem, I take that the data is increasing linearly and I want a forecast for the next 10 years. Now, I am following the book, so to speak (I am using Python).
model = Prophet(growth="linear")
model.fit(mortality)
periods = 10
future = model.make_future_dataframe(periods=periods)
future_data = model.predict(future)
The issue is that I get the forecast for the next ten days: from 2010-01-02 to 2010-01-11. I do not know how to tell FBProphet to ignore dates, i.e., only to focus on years. As far as I know, under 'ds' column dates must be in format YYYY-MM-DD.
One solution, but a really bad one is to make a forecast for 10 * 365 days, and then to extract the dates on Jan 1st. But that is really cumbersome. It takes really a lot of time and sucks a lot of memory. Furthermore, if the growth is logistic, or there are some special things to take into account, I think this approach will not work. The other solution is to transform yearly data into daily data, i.e., for some 60 consecutive days, without any seasonality to consider. Then, to make a forecast, and again to transform the result back into appropriate form. But this is also cumbersome and what if there is some seasonality? What if there is a rise and fall every ten years? I hope you understand what is the issue.
I am working with different data sources and storaging each data in a different dataframe. I want to unify those dataframes in a big one but first I need to unify their indexes. Some of the dataframes' index follow the format YYYY-MM-DD, others do with YYYYTNN with n=1,2,3,4 and the last format is YYYYMNN with N from 01 to 12.
They represent the date, the period of 3 months in a year and the period of one month in a year. Mathematically is quite easy to transform all of them to the first format but I was sondering if there is some way to write in Python so that I do not have to change all my data indexes manually. The indexes are just pieces of text so I dont know how could I read the YYYYTN and detect the value of T for example.
Thank you in advance.
I finally managed to solve my issues. In case someone is interested this is what I did:
For those representing a period of 3 months I used datetime and timedelta to create a list of dates from the date where my data began until when it ended taking only 1st of January, April, July and October. The data for the first 3 months of the year was storaged as from the first of January and the other data was storaged the same way. After that I added the list as a new column to the dataframe in order to define it as the index. Here is the code.
def perdelta(start, end, delta):
curr = start
while curr < end:
yield curr
curr += delta
I did not write this code, I got it from another question in this page but I can no longer find it.
Trim20052020=[]
for result in perdelta(date(2005, 1, 1),date(2020,12,1),relativedelta(months=3)):
Trim20052020.append(result)
This one creates the dates set for one data each three months from 2005 to the end of 2020.
df['index']=Trim20052020
df.set_index('index',inplace=True)
And finally defined it as the index.
I have looked at the resample/Timegrouper functionality in Pandas. However, I'm trying to figure out how to use it for this specific case. I want to do a seasonal analysis across a financial asset - let's say S&P 500. I want to know how the asset performs between any two custom dates on average across many years.
Example: If I have a 10 year history of daily changes of S&P 500 and I pick the date range between March 13th and March 23rd, then I want to know the average change for each date in my range across the last 10 years - i.e. average change on 3/13 each year for the last 10 years, and then for 3/14, 3/15 and so on until 3/23. This means I need to groupby month and day and do an average of values across different years.
I can probably do this by creating 3 different columns for year, month, and day and then grouping by two of them, but I wonder if there are more elegant ways of doing this.
I figured it out. It turned out to be pretty simple and I was just being dumb.
x.groupby([x.index.month, x.index.day], as_index=True).mean()
where x is a pandas series in my case (but I suppose could also be a dataframe?). This will return a multi-index series which is ok in my case, but if it's not in your case then you can manipulate it to drop a level or turn the index into new columns