How to generate "lower" and "upper" predictions, not just "yhat"?
import statsmodels
from statsmodels.tsa.arima.model import ARIMA
assert statsmodels.__version__ == '0.12.0'
arima = ARIMA(df['value'], order=order)
model = arima.fit()
Now I can generate "yhat" predictions
yhat = model.forecast(123)
and get confidence intervals for model parameters (but not for predictions):
model.conf_int()
but how to generate yhat_lower and yhat_upper predictions?
In general, the forecast and predict methods only produce point predictions, while the get_forecast and get_prediction methods produce full results including prediction intervals.
In your example, you can do:
forecast = model.get_forecast(123)
yhat = forecast.predicted_mean
yhat_conf_int = forecast.conf_int(alpha=0.05)
If your data is a Pandas Series, then yhat_conf_int will be a DataFrame with two columns, lower <name> and upper <name>, where <name> is the name of the Pandas Series.
If your data is a numpy array (or Python list), then yhat_conf_int will be an (n_forecasts, 2) array, where the first column is the lower part of the interval and the second column is the upper part.
To generate prediction intervals as opposed to confidence intervals (which you have neatly made the distinction between, and is also presented in Hyndman's blog post on the difference between prediction intervals and confidence intervals), then you can follow the guidance available in this answer.
You could also try to compute bootstrapped prediction intervals, which is laid out in this answer.
Below, is my attempt at implementing this (I'll update it when I get the chance to check it in more detail):
def bootstrap_prediction_interval(y_train: Union[list, pd.Series],
y_fit: Union[list, pd.Series],
y_pred_value: float,
alpha: float = 0.05,
nbootstrap: int = None,
seed: int = None):
"""
Bootstraps a prediction interval around an ARIMA model's predictions.
Method presented clearly here:
- https://stats.stackexchange.com/a/254321
Also found through here, though less clearly:
- https://otexts.com/fpp3/prediction-intervals.html
Can consider this to be a time-series version of the following generalisation:
- https://saattrupdan.github.io/2020-03-01-bootstrap-prediction/
:param y_train: List or Series of training univariate time-series data.
:param y_fit: List or Series of model fitted univariate time-series data.
:param y_pred_value: Float of the model predicted univariate time-series you want to compute P.I. for.
:param alpha: float = 0.05, the prediction uncertainty.
:param nbootstrap: integer = 1000, the number of bootstrap sampling of the residual forecast error.
Rules of thumb provided here:
- https://stats.stackexchange.com/questions/86040/rule-of-thumb-for-number-of-bootstrap-samples
:param seed: Integer to specify if you want deterministic sampling.
:return: A list [`lower`, `pred`, `upper`] with `pred` being the prediction
of the model and `lower` and `upper` constituting the lower- and upper
bounds for the prediction interval around `pred`, respectively.
"""
# get number of samples
n = len(y_train)
# compute the forecast errors/resid
fe = y_train - y_fit
# get percentile bounds
percentile_lower = (alpha * 100) / 2
percentile_higher = 100 - percentile_lower
if nbootstrap is None:
nbootstrap = np.sqrt(n).astype(int)
if seed is None:
rng = np.random.default_rng()
else:
rng = np.random.default_rng(seed)
# bootstrap sample from errors
error_bootstrap = []
for _ in range(nbootstrap):
idx = rng.integers(low=n)
error_bootstrap.append(fe[idx])
# get lower and higher percentiles of sampled forecast errors
fe_lower = np.percentile(a=error_bootstrap, q=percentile_lower)
fe_higher = np.percentile(a=error_bootstrap, q=percentile_higher)
# compute P.I.
pi = [y_pred_value + fe_lower, y_pred_value, y_pred_value + fe_higher]
return pi
using ARIMA you need to include seasonality and exogenous variables in the model yourself. While using SARIMA (Seasonal ARIMA) or SARIMAX (also for exogenous factors) implementation give C.I. to summary_frame:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
dta = sm.datasets.sunspots.load_pandas().data[['SUNACTIVITY']]
dta.index = pd.Index(pd.date_range("1700", end="2009", freq="A"))
print(dta)
print("init data:\n")
dta.plot(figsize=(12,4));
plt.show()
##print("SARIMAX(dta, order=(2,0,0), trend='c'):\n")
result = sm.tsa.SARIMAX(dta, order=(2,0,0), trend='c').fit(disp=False)
print(">>> result.params:\n", result.params, "\n")
##print("SARIMA_model.plot_diagnostics:\n")
result.plot_diagnostics(figsize=(15,12))
plt.show()
# summary stats of residuals
print(">>> residuals.describe:\n", result.resid.describe(), "\n")
# Out-of-sample forecasts are produced using the forecast or get_forecast methods from the results object
# The get_forecast method is more general, and also allows constructing confidence intervals.
fcast_res1 = result.get_forecast()
# specify that we want a confidence level of 90%
print(">>> forecast summary at alpha=0.01:\n", fcast_res1.summary_frame(alpha=0.10), "\n")
# plot forecast
fig, ax = plt.subplots(figsize=(15, 5))
# Construct the forecasts
fcast = result.get_forecast('2010Q4').summary_frame()
print(fcast)
fcast['mean'].plot(ax=ax, style='k--')
ax.fill_between(fcast.index, fcast['mean_ci_lower'], fcast['mean_ci_upper'], color='k', alpha=0.1);
fig.tight_layout()
plt.show()
docs: "The forecast above may not look very impressive, as it is almost a straight line. This is because this is a very simple, univariate forecasting model. Nonetheless, keep in mind that these simple forecasting models can be extremely competitive"
p.s. here " you can use it in a non-seasonal way by setting the seasonal terms to zero."
Related
I did time series forecasting analysis with ExponentialSmoothing in python. I used statsmodels.tsa.holtwinters.
model = ExponentialSmoothing(df, seasonal='mul', seasonal_periods=12).fit()
pred = model.predict(start=df.index[0], end=122)
plt.plot(df_fc.index, df_fc, label='Train')
plt.plot(pred.index, pred, label='Holt-Winters')
plt.legend(loc='best')
I want to take confidence interval of the model result. But I couldn't find any function about this in "statsmodels.tsa.holtwinters - ExponentialSmoothing". How to I do that?
From this answer from a GitHub issue, it is clear that you should be using the new ETSModel class, and not the old (but still present for compatibility) ExponentialSmoothing.
ETSModel includes more parameters and more functionality than ExponentialSmoothing.
To calculate confidence intervals, I suggest you to use the simulate method of ETSResults:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import pandas as pd
# Build model.
ets_model = ETSModel(
endog=y, # y should be a pd.Series
seasonal='mul',
seasonal_periods=12,
)
ets_result = ets_model.fit()
# Simulate predictions.
n_steps_prediction = y.shape[0]
n_repetitions = 500
df_simul = ets_result.simulate(
nsimulations=n_steps_prediction,
repetitions=n_repetitions,
anchor='start',
)
# Calculate confidence intervals.
upper_ci = df_simul.quantile(q=0.9, axis='columns')
lower_ci = df_simul.quantile(q=0.1, axis='columns')
Basically, calling the simulate method you get a DataFrame with n_repetitions columns, and with n_steps_prediction steps (in this case, the same number of items in your training data-set y).
Then, you calculate the confidence intervals with DataFrame quantile method (remember the axis='columns' option).
You could also calculate other statistics from the df_simul.
I also checked the source code: simulate is internally called by the forecast method to predict steps in the future. So, you could also predict steps in the future and their confidence intervals with the same approach: just use anchor='end', so that the simulations will start from the last step in y.
To be fair, there is also a more direct approach to calculate the confidence intervals: the get_prediction method (which uses simulate internally). But I do not really like its interface, it is not flexible enough for me, I did not find a way to specify the desired confidence intervals. The approach with the simulate method is pretty easy to understand, and very flexible, in my opinion.
If you want further details on how this kind of simulations are performed, read this chapter from the excellent Forecasting: Principles and Practice online book.
Complementing the answer from #Enrico, we can use the get_prediction in the following way:
ci = model.get_prediction(start = forecast_data.index[0], end = forecast_data.index[-1])
preds = ci.pred_int(alpha = .05) #confidence interval
limits = ci.predicted_mean
preds = pd.concat([limits, preds], axis = 1)
preds.columns = ['yhat', 'yhat_lower', 'yhat_upper']
preds
Implemented answer (by myself).... #Enrico, we can use the get_prediction in the following way:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
#---sales:pd.series, time series data(index should be timedate format)
#---new advanced holt's winter ts model implementation
HWTES_Model = ETSModel(endog=sales, trend= 'mul', seasonal='mul', seasonal_periods=4).fit()
point_forecast = HWTES_Model.forecast(16)
#-------Confidence Interval forecast calculation start------------------
ci = HWTES_Model.get_prediction(start = point_forecast.index[0],
end = point_forecast.index[-1])
lower_conf_forecast = ci.pred_int(alpha=alpha_1).iloc[:,0]
upper_conf_forecast = ci.pred_int(alpha=alpha_1).iloc[:,1]
#-------Confidence Interval forecast calculation end-----------------
To complement the previous answers, I provide the function to plot the CI on top of the forecast.
def ets_forecast(model, h=8):
# Simulate predictions.
n_steps_prediction =h
n_repetitions = 1000
yhat = model.forecast(h)
df_simul = model.simulate(
nsimulations=n_steps_prediction,
repetitions=n_repetitions,
anchor='end',
)
# Calculate confidence intervals.
upper_ci = df_simul.quantile(q=0.975, axis='columns')
lower_ci = df_simul.quantile(q=0.025, axis='columns')
plt.plot(yhat.index, yhat.values)
plt.fill_between(yhat.index, (lower_ci), (upper_ci), color='blue', alpha=0.1)
return yhat
plt.plot(y)
ets_forecast(model2, h=8)
plt.show()
enter image description here
I've a dataset with 4 years of sales and trying to forecast sales for next five years. I've split the dataset into 36 months as training-set and 12 months as test-set. I have chosen Holt Winter’s method and written following code to test the model.
from statsmodels.tsa.api import ExponentialSmoothing
holt_winter = ExponentialSmoothing(np.asarray(train_data['Sales']), seasonal_periods=12, trend='add', seasonal='add')
hw_fit = holt_winter.fit()
hw_forecast = hw_fit.forecast(len(test_data))
plt.figure(figsize=(16,8))
plt.plot(train_data.index, train_data['Sales'], "b.-", label='Train Data')
plt.plot(test_data.index, test_data['Sales'], "ro-", label='Original Test Data')
plt.plot(test_data.index, hw_forecast, "gx-", label='Holt_Winter Forecast Data')
plt.ylabel('Score', fontsize=16)
plt.xlabel('Time', fontsize=16)
plt.legend(loc='best')
plt.title('Holt Winters Forecast', fontsize=20)
plt.show()
It seems the code is working fine, and probably correctly predicting outcome of test data set. However, I'm struggling to figure out how to code if I want predict sales for the next five year?
You could also should try ARIMA model it usually gives the better performance and this code makes combinations of different ARIMA parameters (AR, autoregressive parameter; I, differencing parameter; and MA, moving average parameter; - p,d,q respectively) and finds the best combination of them by lowering the Akaike information criteria (AIK) which penalizes the maximum likelihood with number of parameters (i.e. finds the best likelihood, with the smallest number of parameters):
from statsmodels.tsa.arima_model import ARIMA
import itertools
# Grid Search
p = d = q = range(0,3) # p, d, and q can be either 0, 1, or 2
pdq = list(itertools.product(p,d,q)) # gets all possible combinations of p, d, and q
combs = {} # stores aic and order pairs
aics = [] # stores aics
# Grid Search continued
for combination in pdq:
try:
model = ARIMA(train_data['Sales'], order=combination) # create all possible models
model = model.fit()
combs.update({model.aic : combination}) # store combinations
aics.append(model.aic)
except:
continue
best_aic = min(aics)
hw_fit.predict(start, end)
will make prediction from step start to step end, with step 0 being the first value of the training data.
forecast makes out-of-sample predictions. So these two are equivalent:
hw_fit.forecast(steps)
hw_fit.predict(len(train_data), len(train_data)+steps-1)
So, since your model was trained with a monthly step, if you want to forecast n months after the training data, you can call the methods above with steps=n
I have this plot
Now I want to add a trend line to it, how do I do that?
The data looks like this:
I wanted to just plot how the median listing price in California has gone up over the years so I did this:
# Get California data
state_ca = []
state_median_price = []
state_ca_month = []
for state, price, date in zip(data['ZipName'], data['Median Listing Price'], data['Month']):
if ", CA" not in state:
continue
else:
state_ca.append(state)
state_median_price.append(price)
state_ca_month.append(date)
Then I converted the string state_ca_month to datetime:
# Convert state_ca_month to datetime
state_ca_month = [datetime.strptime(x, '%m/%d/%Y %H:%M') for x in state_ca_month]
Then plotted it
# Plot trends
figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(state_ca_month, state_median_price)
plt.show()
I thought of adding a trendline or some type of line but I am new to visualization. If anyone has any other suggestions I would appreciate it.
Following the advice in the comments I get this scatter plot
I am wondering if I should further format the data to make a clearer plot to examine.
If by "trend line" you mean a literal line, then you probably want to fit a linear regression to your data. sklearn provides this functionality in python.
From the example hyperlinked above:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
To clarify, "the overall trend" is not a well-defined thing. Many times, by "trend", people mean a literal line that "fits" the data well. By "fits the data", in turn, we mean "predicts the data." Thus, the most common way to get a trend line is to pick a line that best predicts the data that you have observed. As it turns out, we even need to be clear about what we mean by "predicts". One way to do this (and a very common one) is by defining "best predicts" in such a way as to minimize the sum of the squares of all of the errors between the "trend line" and the observed data. This is called ordinary least squares linear regression, and is one of the simplest ways to obtain a "trend line". This is the algorithm implemented in sklearn.linear_model.LinearRegression.
Here is the link for the LMFIT implementation of the confidence intervals of parameters: http://lmfit.github.io/lmfit-py/confidence.html
Here is the code I am using:
import lmfit
import numpy as np
# x = np.linspace(1, 10, 250)
# np.random.seed(0)
# y = 1. -np.exp(-(x)/10.) + 0.1*np.random.randn(len(x))
pars = lmfit.Parameters()
pars.add_many(('n', 1.), ('tau', 3.))
# def residual(pars,data=None):
def residual(pars):
v = pars.valuesdict()
# if data is None:
# return 1.0 - np.exp(-(x**v['n'])/v['tau'])
return 1.0 - np.exp(-(x**v['n'])/v['tau'])-y
# create Minimizer
mini = lmfit.Minimizer(residual, pars)
# first solve with Nelder-Mead
out1 = mini.minimize(method='Nelder')
out2 = mini.minimize(method='leastsq', params=out1.params)
lmfit.report_fit(out2.params, min_correl=0.5)
ci, trace = lmfit.conf_interval(mini, out2, sigmas=[0.95],
trace=True, verbose=False)
lmfit.printfuncs.report_ci(ci)
It is a bit difficult to understand the title Confidence interval for the data itself using lmfit in python (there is no data), or the the first sentence I am doing curve fitting using lmfit package (you need data to fit).
I think what you are asking for is a way to get extreme values for the model function that best matches your data. If so, would it work to evaluate your function with all combinations of parameter values best +/- delta (where delta could be any uncertainly level you like), and take the extreme values of the model function? That's not very automated, but shouldn't be too hard.
How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)