Related
I wanted to forecast stock prices using the ARIMA model (Autoregressive Moving Average) and wanted to plot the forecasted data over the actual and training data. I'm following this tutorial and have browsed others too. But they all follow the same code. Here is the link to their tutorial for your reference:(https://www.analyticsvidhya.com/blog/2021/07/stock-market-forecasting-using-time-series-analysis-with-arima-model/)
# Forecast
fc, se, conf= fitted.forecast(216, alpha=0.05) # 95% conf
I was expecting a graph that looks like this
Instead, an error message shows up: ValueError: too many values to unpack (expected 3)
please help :')
Edit: I tried doing that before and it produces an error message in the next code. My next line of codes are as the following:
result = fitted.forecast(216, alpha =0.05)`
# Make as pandas series
fc_series = pd.Series(result, index=test_data.index)
lower_series = pd.Series(result[:, 0], index=test_data.index)
upper_series = pd.Series(result[:, 1], index=test_data.index)
The error message: KeyError: 'key of type tuple not found and not a MultiIndex'
It seems, that the forecast function is not returning three return values anymore. This may happen if you don’t use the same version as in the tutorial.
Please try something like:
result = fitted.forecast(216, alpha=0.05)
And then inspect the result if it does contain all the data you need.
import library
import statsmodels.api as sm
use a model with sm.tsa
model = sm.tsa.ARIMA(train_data, order=(1, 1, 1))
fitted = model.fit()
print(fitted.summary())
pass a parameter Summary_frame to get a forecast , lower and upper interval
result = fitted.get_forecast(216, alpha =0.05).summary_frame()
print(result)
Make pandas series, dont forget add values to get series not null.
fc_series = pd.Series(result['mean'].values, index=test_data.index)
lower_series = pd.Series(result['mean_ci_lower'].values, index=test_data.index)
upper_series = pd.Series(result['mean_ci_upper'].values, index=test_data.index)
I hope this help you.
I did time series forecasting analysis with ExponentialSmoothing in python. I used statsmodels.tsa.holtwinters.
model = ExponentialSmoothing(df, seasonal='mul', seasonal_periods=12).fit()
pred = model.predict(start=df.index[0], end=122)
plt.plot(df_fc.index, df_fc, label='Train')
plt.plot(pred.index, pred, label='Holt-Winters')
plt.legend(loc='best')
I want to take confidence interval of the model result. But I couldn't find any function about this in "statsmodels.tsa.holtwinters - ExponentialSmoothing". How to I do that?
From this answer from a GitHub issue, it is clear that you should be using the new ETSModel class, and not the old (but still present for compatibility) ExponentialSmoothing.
ETSModel includes more parameters and more functionality than ExponentialSmoothing.
To calculate confidence intervals, I suggest you to use the simulate method of ETSResults:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import pandas as pd
# Build model.
ets_model = ETSModel(
endog=y, # y should be a pd.Series
seasonal='mul',
seasonal_periods=12,
)
ets_result = ets_model.fit()
# Simulate predictions.
n_steps_prediction = y.shape[0]
n_repetitions = 500
df_simul = ets_result.simulate(
nsimulations=n_steps_prediction,
repetitions=n_repetitions,
anchor='start',
)
# Calculate confidence intervals.
upper_ci = df_simul.quantile(q=0.9, axis='columns')
lower_ci = df_simul.quantile(q=0.1, axis='columns')
Basically, calling the simulate method you get a DataFrame with n_repetitions columns, and with n_steps_prediction steps (in this case, the same number of items in your training data-set y).
Then, you calculate the confidence intervals with DataFrame quantile method (remember the axis='columns' option).
You could also calculate other statistics from the df_simul.
I also checked the source code: simulate is internally called by the forecast method to predict steps in the future. So, you could also predict steps in the future and their confidence intervals with the same approach: just use anchor='end', so that the simulations will start from the last step in y.
To be fair, there is also a more direct approach to calculate the confidence intervals: the get_prediction method (which uses simulate internally). But I do not really like its interface, it is not flexible enough for me, I did not find a way to specify the desired confidence intervals. The approach with the simulate method is pretty easy to understand, and very flexible, in my opinion.
If you want further details on how this kind of simulations are performed, read this chapter from the excellent Forecasting: Principles and Practice online book.
Complementing the answer from #Enrico, we can use the get_prediction in the following way:
ci = model.get_prediction(start = forecast_data.index[0], end = forecast_data.index[-1])
preds = ci.pred_int(alpha = .05) #confidence interval
limits = ci.predicted_mean
preds = pd.concat([limits, preds], axis = 1)
preds.columns = ['yhat', 'yhat_lower', 'yhat_upper']
preds
Implemented answer (by myself).... #Enrico, we can use the get_prediction in the following way:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
#---sales:pd.series, time series data(index should be timedate format)
#---new advanced holt's winter ts model implementation
HWTES_Model = ETSModel(endog=sales, trend= 'mul', seasonal='mul', seasonal_periods=4).fit()
point_forecast = HWTES_Model.forecast(16)
#-------Confidence Interval forecast calculation start------------------
ci = HWTES_Model.get_prediction(start = point_forecast.index[0],
end = point_forecast.index[-1])
lower_conf_forecast = ci.pred_int(alpha=alpha_1).iloc[:,0]
upper_conf_forecast = ci.pred_int(alpha=alpha_1).iloc[:,1]
#-------Confidence Interval forecast calculation end-----------------
To complement the previous answers, I provide the function to plot the CI on top of the forecast.
def ets_forecast(model, h=8):
# Simulate predictions.
n_steps_prediction =h
n_repetitions = 1000
yhat = model.forecast(h)
df_simul = model.simulate(
nsimulations=n_steps_prediction,
repetitions=n_repetitions,
anchor='end',
)
# Calculate confidence intervals.
upper_ci = df_simul.quantile(q=0.975, axis='columns')
lower_ci = df_simul.quantile(q=0.025, axis='columns')
plt.plot(yhat.index, yhat.values)
plt.fill_between(yhat.index, (lower_ci), (upper_ci), color='blue', alpha=0.1)
return yhat
plt.plot(y)
ets_forecast(model2, h=8)
plt.show()
enter image description here
#Initialize the mode here
model = GFS(resolution='half', set_type='latest')
#the location I want to forecast the irradiance, and also the timezone
latitude, longitude, tz = 15.134677754177943, 120.63806622424912, 'Asia/Manila'
start = pd.Timestamp(datetime.date.today(), tz=tz)
end = start + pd.Timedelta(days=7)
#pulling the data from the GFS
raw_data = model.get_processed_data(latitude, longitude, start, end)
raw_data = pd.DataFrame(raw_data)
data = raw_data
#Description of the PV system we are using
system = PVSystem(surface_tilt=10, surface_azimuth=180, albedo=0.2,
module_type = 'glass_polymer',
module=module, module_parameters=module,
temperature_model_parameters=temperature_model_parameters,
modules_per_string=24, strings_per_inverter=32,
inverter=inverter, inverter_parameters=inverter,
racking_model='insulated_back')
#Using the ModelChain
mc = ModelChain(system, model.location, orientation_strategy=None,
aoi_model='no_loss', spectral_model='no_loss',
temp_model='sapm', losses_model='no_loss')
mc.run_model(data);
mc.total_irrad.plot()
plt.ylabel('Plane of array irradiance ($W/m^2$)')
plt.legend(loc='best')
Here is the picture of it
I am actually getting the same values for irradiance for days now. So I believe there is something wrong. I think there should somewhat be of different values for everyday at the least
Forecasting Irradiance
I think the reason the days all look the same is that the forecast data predicts those days to be consistently overcast, so there's not necessarily anything "wrong" with the values being very similar across days -- it's just several cloudy days in a row. Take a look at raw_data['total_clouds'] and see how little variation there is for this forecast (nearly always 100% cloud cover). Also note that if you print the actual values of mc.total_irrad, you'll see that there is some minor variation day-to-day that is too small to appear on the plot.
I have a set of data from January 2012 to December 2014 that show some trend and seasonality. I want to make a prediction for the next 2 years (from January 2015 to December 2017), by using the Holt-Winters method from statsmodels.
The data set is the following one:
date,Data
Jan-12,153046
Feb-12,161874
Mar-12,226134
Apr-12,171871
May-12,191416
Jun-12,230926
Jul-12,147518
Aug-12,107449
Sep-12,170645
Oct-12,176492
Nov-12,180005
Dec-12,193372
Jan-13,156846
Feb-13,168893
Mar-13,231103
Apr-13,187390
May-13,191702
Jun-13,252216
Jul-13,175392
Aug-13,150390
Sep-13,148750
Oct-13,173798
Nov-13,171611
Dec-13,165390
Jan-14,155079
Feb-14,172438
Mar-14,225818
Apr-14,188195
May-14,193948
Jun-14,230964
Jul-14,172225
Aug-14,129257
Sep-14,173443
Oct-14,188987
Nov-14,172731
Dec-14,211194
Which looks like follows:
I'm trying to build the Holt-Winters model, in order to improve the prediction performance of the past data (it means, a new graph where I can see if my parameters perform a good prediction of the past) and later on forecast the next years. I made the prediction with the following code, but I'm not able to do the forecast.
# Data loading
data = pd.read_csv('setpoints.csv', parse_dates=['date'], index_col=['date'])
df_data = pd.DataFrame(datos_matric, columns=['Data'])
df_data['Data'].index.freq = 'MS'
train, test = df_data['Data'], df_data['Data']
model = ExponentialSmoothing(train, trend='add', seasonal='add', seasonal_periods=12).fit()
period = ['Jan-12', 'Dec-14']
pred = model.predict(start=period[0], end=period[1])
df_data['Data'].plot(label='Train')
test.plot(label='Test')
pred.plot(label='Holt-Winters')
plt.legend(loc='best')
plt.show()
Which looks like:
Does anyone now how to forecast it?
I think you are making a misconception here. You shouldnt use the same data for train and test. The test data are datapoints which your model "has not seen yet". This way you can test how well your model is performing. So I used the last three months of your data as test.
As for the prediction, we can use different start and end points.
Also notice I used mul as seasonal component, which performs better on your data:
# read in data and convert date column to MS frequency
df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'], format='%b-%y')
df = df.set_index('date').asfreq('MS')
# split data in train, test
train = df.loc[:'2014-09-01']
test = df.loc['2014-10-01':]
# train model and predict
model = ExponentialSmoothing(train, seasonal='mul', seasonal_periods=12).fit()
#model = ExponentialSmoothing(train, trend='add', seasonal='add', seasonal_periods=12).fit()
pred_test = model.predict(start='2014-10-01', end='2014-12-01')
pred_forecast = model.predict(start='2015-01-01', end='2017-12-01')
# plot data and prediction
df.plot(figsize=(15,9), label='Train')
pred_test.plot(label='Test')
pred_forecast.plot(label='Forecast')
plt.legend()
plt.show()
plt.savefig('figure.png')
The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.