Applying a Quantile Loss Function to Optimize an Exponential Smoothing Model - python

I have tried posting on others places without luck so I am going to try here as well.
I am trying to build a model that generate prediction intervals from a traditional winters-holt forecast model. There is a built-in function that does it in statmodels but there are a few missing features preventing me from fully depend on it. I am looking for a critic eye and don't know anywhere else to post:
What I did is use a quantile loss pinball function to optimize the parameters (level, trend and seasonality) of the model and generate different forecast corresponding to different quantile values that I input in the quantile loss function.
first of all: is it good practice to optimize the parameters for a quantile
second, is the described process viable to get prediction intervals?
Now, in the results dataset, I observe none difference between predictions made with a quantile of 0.05 and a 0.95 or worst, there are even timeseries where the results give lower values for the 0.95 quantile value.
I am confused and wondered what could be wrong in the method knowing that the model works just fine with finding regular point series using mean square error loss function
here is my quantile loss function:
def quantile_loss(q,y_p, y):
a = np.where((y > y_p), q *(y-y_p), (y_p - y)*(1-q))
return a
and here is the code for the parameter optimization:
def HoltWinterLowHightimeseriesCVscore(params,quantile_values, data, loss_function=quantile_loss,slen=12):
"""
Returns error on CV
params - vector of parameters for optimization
series - dataset with timeseries
slen - season length for Holt-Winters model
"""
# errors array
errors = []
values = data
alpha, beta, gamma = params
# set the number of folds for cross-validation
tscv = TimeSeriesSplit(n_splits=3)
# iterating over folds, train model on each, forecast and calculate error
for train, test in tscv.split(values):
model = HoltWintersLowHigh(series=values, slen=slen,
alpha=alpha, beta=beta, gamma=gamma, n_preds=12)
model.triple_exponential_smoothing()
predictions = model.result[-len(test):]
actual = values[test]
error = loss_function(quantile, predictions, actual)
errors.append(error)
return np.mean(np.array(errors))
and finally here is the final part where the functions are called to make the predictions:
forecast = {}
for i in seasonal_profile_df.index:
quantile_values = [0.92]
if seasonal_profile_df['trend'].loc[i] == 'trending' and seasonal_profile_df['seasonality'].loc[i] == 'seasonal' and seasonal_profile_df['demand_level'].loc[i] == 'low' or seasonal_profile_df['variability'].loc[i] == 'high':
index = pd.DatetimeIndex(new_df.index)
series = pd.Series(data=new_df.iloc[:, i], index=index)
print(len(series))
data = series[:-10] #leave some data for testing
x = [0, 0, 0]
#for i in quantile:
for j in quantile_values:
quantile_values = j
# Minimizing the loss function
opt = minimize(HoltWinterLowHightimeseriesCVscore, x0=x,
args=( quantile_values,data, quantile_loss,),
method="TNC", bounds=((0, 1), (0, 1), (0, 1))
)
alpha_final, beta_final, gamma_final = opt.x
print(opt.x)
##and then predict according to the quantile specified #####
model = HoltWintersLowHigh(series, slen=12,
alpha=alpha_final,
beta=beta_final,
gamma=gamma_final,
n_preds=12, scaling_factor=1.96)
model.triple_exponential_smoothing()
plotHoltWintersLowHigh(series, quantile_values)
result= {"Id": seasonal_profile_df['Id'].loc[i]}
result['results'] = model.result[-12:]
I am really hoping to get a fresh or more experienced eye on this. I feel like the process make sense but I cannot find a solid example in python, and I would like if someone could tell me whether the approach does not make sense or if there is a problematic thing with the code. Any feedback would be greatly appreciated!

Related

Get parameter estimates from logistic regression model using pycaret

I am training and tuning a model in pycaret such as:
from pycaret.classification import *
clf1 = setup(data = train, target = 'target', feature_selection = True, test_data = test, remove_multicollinearity = True, multicollinearity_threshold = 0.4)
# create model
lr = create_model('lr')
# tune model
tuned_lr = tune_model(lr)
# optimize threshold
optimized_lr = optimize_threshold(tuned_lr)
I would like to get the parameters estimated for the features in the Logistic Regression, so I could proceed with understanding the effect size of each feature on the target. However, the object optimized_lr has a function optimized_lr.get_params() which returns the hyperparameters of the model, however, I am not quite interested in my tuning decisions, instead, I am very interested in the real parameters of the model, the ones estimated in Logistic Regression.
How could I get them to use pycaret? (I could easily get those using other packages such as statsmodels, but I want to know in pycaret)
how about
for f, c in zip (optimized_lr.feature_names_in_,tuned.coef_[0]):
print(f, c)
To get the coefficients, use this code:
tuned_lr.feature_importances_ #this will give you the coefficients
get_config('X_train').columns #this code will give you the names of the columns.
Now we can create a dataframe so that we could see clearly how it correlates with the independent variable.
Coeff=pd.DataFrame({"Feature":get_config('X_train').columns.tolist(),"Coefficients":tuned_lr.feature_importances_})
print(Coeff)
# It would give me the Coefficient with the names of the respective columns. Hope it helps.

prediction interval for arma-garch models in python

Is there a way to measure the accuracy of an ARMA-GARCH model in Python using a prediction interval (alpha=0.05)? I fitted an ARMA-GARCH model on log returns and used some classical metrics such as RMSE, MSE (out-of-sample), AIC (in-sample), check on residuals and so on. I would like to add a prediction interval as another measurement of accuracy based on my ARMA-GARCH model predictions. I used the armagarch library (https://github.com/iankhr/armagarch).
I already checked on how to use prediction intervals but not sure how to use it with ARMA-GARCH.
I found these formula searching online: Estimator +- 1.96 (for 95%) * Standard Error.
So far i got it, but i have several Standard Errors in my model output for each parameter in the ARMA and GARCH part, which one i have to use? Is there one Standard Error for the whole model itself?
I would be really happy if anyone could help.
ARMA-GARCH model output
So far I created an ARMA(2,2)-GARCH(1,1) model:
#final test of function
import armagarch as ag
#definitions framework
data = pd.DataFrame(data)
meanMdl = ag.ARMA(order = {'AR':2,'MA':2})
volMdl = ag.garch(order = {'p':1,'q':1})
distMdl = ag.normalDist()
model = ag.empModel(data, meanMdl, volMdl, distMdl)
model_fit = model.fit()
After the model fit defining prediction length and
Recieved two arrays as an output (mean + variance) put them into the correct length:
#first array is mean, second is variance
pred = model.predict(nsteps=len(df_test))
#correct the shapes!
df_pred_mean = pd.DataFrame(np.reshape(pred[0], (len(df_test),
1)))
df_pred_variance = pd.DataFrame(np.reshape(pred[1],
(len(df_test), 1)))
So far so good, now i would like to implement a prediction interval.
I got that one has to use the ARMA part +- 1.96 (95%)* GARCH prediction for each prediction. I implemented it for the upper and lower bound. It just shows the upper bound lower bound is same but using * (-1.96) at the end of the formula.
#upper bound
df_all["upper bound"] =df_all["pred_Mean"]+df_all["pred_Variance"]*1.96
Using it on the actual log returns i trained the model with fails in the way its completely wrong. Now I'm unsure if the main approach i used is wrong or the model I used means the package.
prediction interval vs. actual log return

Tensorflow model architecture for sparse dataset

I have a regression dataset where approximately 95% of the target variables are zeros (the other 5% are between 1 and 30) and I am trying to design a Tensorflow model to model that data. I am thinking of implementing a model that combines a classifier and a regressor (check the output of the classifier submodel, if it's less than a threshold then pass it to the regression submodel). I have the intuition that this should be built using the functional API But I couldn't find helpful resources on that. Any ideas?
Here is the code that generates the data that I am using to replicate the problem:
n = 10000
zero_percentage = 0.95
zeros = np.zeros(round(n * zero_percentage))
non_zeros = np.random.randint(1,30,size=round(n * (1- zero_percentage)))
y = np.concatenate((zeros,non_zeros))
np.random.shuffle(y)
a = 50
b = 10
x = np.array([np.random.randint(31,60) if element == 0 else (element - b) / a for element in y])
y_classification = np.array([0 if element == 0 else 1 for element in y])
Note: I experimented with probabilistic models (Poisson regression and regression with a discretized logistic mixture distribution), and they provided good results but the training was unstable (loss diverges very often).
Instead of trying to find some heuristic to balance the training between the zero values and the others, you might want to try some input preprocessing method that can handle imbalanced training sets better (usually by mapping to another space before running the model, then doing the inverse with the results); for example, an embedding layer. Alternatively, normalize the values to a small range (like [-1, 1]) and apply an activation function before evaluating the model on the data.

How to forecast time series using AutoReg in python

I'm trying to build old school model using only auto regression algorithm. I found out that there's an implementation of it in statsmodel package. I've read the documentation, and as I understand it should work as ARIMA. So, here's my code:
import statsmodels.api as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
And when I want to predict new values, I'm trying to follow the documentation:
y_pred = model.predict(start=df_test.index.min(), end=df_test.index.max())
# or
y_pred = model.predict(start=100, end=1000)
Both returns a list of NaNs.
Also, when I type model.predict(0, df_train.size - 1) it predicts real values, but model.predict(0, df_train.size) predicts NaNs list.
Am I doing something wrong?
P.S. I know there's ARIMA, ARMA or SARIMAX algorithms, that can be used as basic auto regression. But I need exactly AutoReg.
We can do the forecasting in couple of ways:
by directly using the predict() function and
by using the definition of AR(p) process and the parameters learnt with AutoReg(): this will be helpful for short-term predictions, as we shall see.
Let's start with a sample dataset from statsmodels, the data looks like the following:
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
plt.plot(range(len(data)), data)
Let's fit an AR(p) process to model the time series and use partial autocorrelation plot to find the order p, as shown below
As seen from above, the first few PACF values remain significant, let's use p=10 for the AR(p).
Let's divide the data into training and validation (test) datasets and fit auto-regressive model of order 10 using the training data:
from statsmodels.tsa.ar_model import AutoReg
n = len(data)
ntrain = int(n*0.9)
ntest = n - ntrain
lag = 10
res = AutoReg(data[:ntrain], lags = lag).fit()
Now, use the predict() function for forecasting all values corresponding to the held-out dataset:
preds = res.model.predict(res.params, start=n-ntest, end=n)
Notice that we can get the exactly same predictions using the parameters from the trained model, as shown below:
x = data[ntrain-lag:ntrain].values
preds1 = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*x[::-1])
x[:lag-1], x[lag-1] = x[-(lag-1):], pred
preds1.append(pred)
Note that the forecast values generated this way is same as the ones obtained using the predict() function above.
np.allclose(preds.values, np.array(preds1))
# True
Now, let's plot the forecast values for the test data:
As can be seen, for long term prediction, quality of forecasting is not that good (since the forecasted values are used for long term prediction).
Let's instead go for short-term predictions now and use the last lag points from the dataset to forecast the next value, as shown in the next code snippet.
preds = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*data[t-lag:t].values[::-1])
preds.append(pred)
As can be seen from the next plot, short term forecasting works way better:
You can use this code for forecasting
import statsmodels as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
y_pred = model.model.predict(model.params, start=df_test.index.min(), end=df_test.index.max())
from statsmodels.tsa.ar_model import AutoReg
model=AutoReg(dataset[''],lags=1)
ARFit=model.fit()
forecasted=ARFit.predict(start=len(dataset),end=len(dataset)+12)
#visualizacion
dataset[''].plot(figsize=(12,8),legend=True)
forecasted.plot(legend=True)

Custom Error Metric not changing predictions XGBoost R

I have created a custom error metric which prints as I run the XGBoost xgb.train but does not actually have any affect on the output. From what I can tell it is simply printing the custom error metric for the round but not using that to determine the accuracy.
I think this because the prediction outputs is exactly the same to when I use the default error metric. I have also tried hard coding the error output to a static 1 so that the output should be random but the result was exactly the same.
Do I need to create a custome objective function for the custom error metric to work?
Thanks!
My code:
# xgboost fitting with arbitrary parameters
xgb_params_1 = list(
objective = "reg:linear",
eta = 0.2,
max.depth = 6,
booster = "gbtree"
)
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
score <- as.numeric((sum(preds[1:1000]) - sum(labels[1:1000] )) / sum(labels[1:1000]) )
return(list(metric="custom_error",value=1))
}
myWatch <- list(val=dvalid,train=dtrain)
# fit the model with the arbitrary parameters specified above
xgb_1 = xgb.train(data = dtrain,
params = xgb_params_1,
nrounds = 150,
nthread = 6,
verbose = T,
print_every_n = 50,
watchlist = myWatch,
early_stop_round = 1000,
eval_metric = evalerror,
disable_default_eval_metric = 1
)
# Perform a prediction
pred <- predict(xgb_1, dvalid)
results <- cbind(as.data.table(pred), as.data.table(data[year > trainEndDate,"total_installs"]))
#Compute test RMSE
sqrt(mean((results$pred - results$total_installs)**2))
Printed error metrics:
Custom eval_metric is just for evaluation purposes. It is displayed at every round (when using watches) and it is useful to tune number of boosting rounds, and you can use it when you do cross-validation to tune your parameters to maximise/minimise your metric. I use it in particular to tune my learning rate to make the model converge faster with less rounds.
Custom objective function is a completely different beast and it is not the same as evaluation metric. It is more of a type of model like classification, regression etc. It drives the convergence of the model. If you still want it here is an example of xgboost regression objective.

Categories