Get parameter estimates from logistic regression model using pycaret - python

I am training and tuning a model in pycaret such as:
from pycaret.classification import *
clf1 = setup(data = train, target = 'target', feature_selection = True, test_data = test, remove_multicollinearity = True, multicollinearity_threshold = 0.4)
# create model
lr = create_model('lr')
# tune model
tuned_lr = tune_model(lr)
# optimize threshold
optimized_lr = optimize_threshold(tuned_lr)
I would like to get the parameters estimated for the features in the Logistic Regression, so I could proceed with understanding the effect size of each feature on the target. However, the object optimized_lr has a function optimized_lr.get_params() which returns the hyperparameters of the model, however, I am not quite interested in my tuning decisions, instead, I am very interested in the real parameters of the model, the ones estimated in Logistic Regression.
How could I get them to use pycaret? (I could easily get those using other packages such as statsmodels, but I want to know in pycaret)

how about
for f, c in zip (optimized_lr.feature_names_in_,tuned.coef_[0]):
print(f, c)

To get the coefficients, use this code:
tuned_lr.feature_importances_ #this will give you the coefficients
get_config('X_train').columns #this code will give you the names of the columns.
Now we can create a dataframe so that we could see clearly how it correlates with the independent variable.
Coeff=pd.DataFrame({"Feature":get_config('X_train').columns.tolist(),"Coefficients":tuned_lr.feature_importances_})
print(Coeff)
# It would give me the Coefficient with the names of the respective columns. Hope it helps.

Related

prediction interval for arma-garch models in python

Is there a way to measure the accuracy of an ARMA-GARCH model in Python using a prediction interval (alpha=0.05)? I fitted an ARMA-GARCH model on log returns and used some classical metrics such as RMSE, MSE (out-of-sample), AIC (in-sample), check on residuals and so on. I would like to add a prediction interval as another measurement of accuracy based on my ARMA-GARCH model predictions. I used the armagarch library (https://github.com/iankhr/armagarch).
I already checked on how to use prediction intervals but not sure how to use it with ARMA-GARCH.
I found these formula searching online: Estimator +- 1.96 (for 95%) * Standard Error.
So far i got it, but i have several Standard Errors in my model output for each parameter in the ARMA and GARCH part, which one i have to use? Is there one Standard Error for the whole model itself?
I would be really happy if anyone could help.
ARMA-GARCH model output
So far I created an ARMA(2,2)-GARCH(1,1) model:
#final test of function
import armagarch as ag
#definitions framework
data = pd.DataFrame(data)
meanMdl = ag.ARMA(order = {'AR':2,'MA':2})
volMdl = ag.garch(order = {'p':1,'q':1})
distMdl = ag.normalDist()
model = ag.empModel(data, meanMdl, volMdl, distMdl)
model_fit = model.fit()
After the model fit defining prediction length and
Recieved two arrays as an output (mean + variance) put them into the correct length:
#first array is mean, second is variance
pred = model.predict(nsteps=len(df_test))
#correct the shapes!
df_pred_mean = pd.DataFrame(np.reshape(pred[0], (len(df_test),
1)))
df_pred_variance = pd.DataFrame(np.reshape(pred[1],
(len(df_test), 1)))
So far so good, now i would like to implement a prediction interval.
I got that one has to use the ARMA part +- 1.96 (95%)* GARCH prediction for each prediction. I implemented it for the upper and lower bound. It just shows the upper bound lower bound is same but using * (-1.96) at the end of the formula.
#upper bound
df_all["upper bound"] =df_all["pred_Mean"]+df_all["pred_Variance"]*1.96
Using it on the actual log returns i trained the model with fails in the way its completely wrong. Now I'm unsure if the main approach i used is wrong or the model I used means the package.
prediction interval vs. actual log return

Applying a Quantile Loss Function to Optimize an Exponential Smoothing Model

I have tried posting on others places without luck so I am going to try here as well.
I am trying to build a model that generate prediction intervals from a traditional winters-holt forecast model. There is a built-in function that does it in statmodels but there are a few missing features preventing me from fully depend on it. I am looking for a critic eye and don't know anywhere else to post:
What I did is use a quantile loss pinball function to optimize the parameters (level, trend and seasonality) of the model and generate different forecast corresponding to different quantile values that I input in the quantile loss function.
first of all: is it good practice to optimize the parameters for a quantile
second, is the described process viable to get prediction intervals?
Now, in the results dataset, I observe none difference between predictions made with a quantile of 0.05 and a 0.95 or worst, there are even timeseries where the results give lower values for the 0.95 quantile value.
I am confused and wondered what could be wrong in the method knowing that the model works just fine with finding regular point series using mean square error loss function
here is my quantile loss function:
def quantile_loss(q,y_p, y):
a = np.where((y > y_p), q *(y-y_p), (y_p - y)*(1-q))
return a
and here is the code for the parameter optimization:
def HoltWinterLowHightimeseriesCVscore(params,quantile_values, data, loss_function=quantile_loss,slen=12):
"""
Returns error on CV
params - vector of parameters for optimization
series - dataset with timeseries
slen - season length for Holt-Winters model
"""
# errors array
errors = []
values = data
alpha, beta, gamma = params
# set the number of folds for cross-validation
tscv = TimeSeriesSplit(n_splits=3)
# iterating over folds, train model on each, forecast and calculate error
for train, test in tscv.split(values):
model = HoltWintersLowHigh(series=values, slen=slen,
alpha=alpha, beta=beta, gamma=gamma, n_preds=12)
model.triple_exponential_smoothing()
predictions = model.result[-len(test):]
actual = values[test]
error = loss_function(quantile, predictions, actual)
errors.append(error)
return np.mean(np.array(errors))
and finally here is the final part where the functions are called to make the predictions:
forecast = {}
for i in seasonal_profile_df.index:
quantile_values = [0.92]
if seasonal_profile_df['trend'].loc[i] == 'trending' and seasonal_profile_df['seasonality'].loc[i] == 'seasonal' and seasonal_profile_df['demand_level'].loc[i] == 'low' or seasonal_profile_df['variability'].loc[i] == 'high':
index = pd.DatetimeIndex(new_df.index)
series = pd.Series(data=new_df.iloc[:, i], index=index)
print(len(series))
data = series[:-10] #leave some data for testing
x = [0, 0, 0]
#for i in quantile:
for j in quantile_values:
quantile_values = j
# Minimizing the loss function
opt = minimize(HoltWinterLowHightimeseriesCVscore, x0=x,
args=( quantile_values,data, quantile_loss,),
method="TNC", bounds=((0, 1), (0, 1), (0, 1))
)
alpha_final, beta_final, gamma_final = opt.x
print(opt.x)
##and then predict according to the quantile specified #####
model = HoltWintersLowHigh(series, slen=12,
alpha=alpha_final,
beta=beta_final,
gamma=gamma_final,
n_preds=12, scaling_factor=1.96)
model.triple_exponential_smoothing()
plotHoltWintersLowHigh(series, quantile_values)
result= {"Id": seasonal_profile_df['Id'].loc[i]}
result['results'] = model.result[-12:]
I am really hoping to get a fresh or more experienced eye on this. I feel like the process make sense but I cannot find a solid example in python, and I would like if someone could tell me whether the approach does not make sense or if there is a problematic thing with the code. Any feedback would be greatly appreciated!

How to forecast time series using AutoReg in python

I'm trying to build old school model using only auto regression algorithm. I found out that there's an implementation of it in statsmodel package. I've read the documentation, and as I understand it should work as ARIMA. So, here's my code:
import statsmodels.api as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
And when I want to predict new values, I'm trying to follow the documentation:
y_pred = model.predict(start=df_test.index.min(), end=df_test.index.max())
# or
y_pred = model.predict(start=100, end=1000)
Both returns a list of NaNs.
Also, when I type model.predict(0, df_train.size - 1) it predicts real values, but model.predict(0, df_train.size) predicts NaNs list.
Am I doing something wrong?
P.S. I know there's ARIMA, ARMA or SARIMAX algorithms, that can be used as basic auto regression. But I need exactly AutoReg.
We can do the forecasting in couple of ways:
by directly using the predict() function and
by using the definition of AR(p) process and the parameters learnt with AutoReg(): this will be helpful for short-term predictions, as we shall see.
Let's start with a sample dataset from statsmodels, the data looks like the following:
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
plt.plot(range(len(data)), data)
Let's fit an AR(p) process to model the time series and use partial autocorrelation plot to find the order p, as shown below
As seen from above, the first few PACF values remain significant, let's use p=10 for the AR(p).
Let's divide the data into training and validation (test) datasets and fit auto-regressive model of order 10 using the training data:
from statsmodels.tsa.ar_model import AutoReg
n = len(data)
ntrain = int(n*0.9)
ntest = n - ntrain
lag = 10
res = AutoReg(data[:ntrain], lags = lag).fit()
Now, use the predict() function for forecasting all values corresponding to the held-out dataset:
preds = res.model.predict(res.params, start=n-ntest, end=n)
Notice that we can get the exactly same predictions using the parameters from the trained model, as shown below:
x = data[ntrain-lag:ntrain].values
preds1 = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*x[::-1])
x[:lag-1], x[lag-1] = x[-(lag-1):], pred
preds1.append(pred)
Note that the forecast values generated this way is same as the ones obtained using the predict() function above.
np.allclose(preds.values, np.array(preds1))
# True
Now, let's plot the forecast values for the test data:
As can be seen, for long term prediction, quality of forecasting is not that good (since the forecasted values are used for long term prediction).
Let's instead go for short-term predictions now and use the last lag points from the dataset to forecast the next value, as shown in the next code snippet.
preds = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*data[t-lag:t].values[::-1])
preds.append(pred)
As can be seen from the next plot, short term forecasting works way better:
You can use this code for forecasting
import statsmodels as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
y_pred = model.model.predict(model.params, start=df_test.index.min(), end=df_test.index.max())
from statsmodels.tsa.ar_model import AutoReg
model=AutoReg(dataset[''],lags=1)
ARFit=model.fit()
forecasted=ARFit.predict(start=len(dataset),end=len(dataset)+12)
#visualizacion
dataset[''].plot(figsize=(12,8),legend=True)
forecasted.plot(legend=True)

Custom Error Metric not changing predictions XGBoost R

I have created a custom error metric which prints as I run the XGBoost xgb.train but does not actually have any affect on the output. From what I can tell it is simply printing the custom error metric for the round but not using that to determine the accuracy.
I think this because the prediction outputs is exactly the same to when I use the default error metric. I have also tried hard coding the error output to a static 1 so that the output should be random but the result was exactly the same.
Do I need to create a custome objective function for the custom error metric to work?
Thanks!
My code:
# xgboost fitting with arbitrary parameters
xgb_params_1 = list(
objective = "reg:linear",
eta = 0.2,
max.depth = 6,
booster = "gbtree"
)
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
score <- as.numeric((sum(preds[1:1000]) - sum(labels[1:1000] )) / sum(labels[1:1000]) )
return(list(metric="custom_error",value=1))
}
myWatch <- list(val=dvalid,train=dtrain)
# fit the model with the arbitrary parameters specified above
xgb_1 = xgb.train(data = dtrain,
params = xgb_params_1,
nrounds = 150,
nthread = 6,
verbose = T,
print_every_n = 50,
watchlist = myWatch,
early_stop_round = 1000,
eval_metric = evalerror,
disable_default_eval_metric = 1
)
# Perform a prediction
pred <- predict(xgb_1, dvalid)
results <- cbind(as.data.table(pred), as.data.table(data[year > trainEndDate,"total_installs"]))
#Compute test RMSE
sqrt(mean((results$pred - results$total_installs)**2))
Printed error metrics:
Custom eval_metric is just for evaluation purposes. It is displayed at every round (when using watches) and it is useful to tune number of boosting rounds, and you can use it when you do cross-validation to tune your parameters to maximise/minimise your metric. I use it in particular to tune my learning rate to make the model converge faster with less rounds.
Custom objective function is a completely different beast and it is not the same as evaluation metric. It is more of a type of model like classification, regression etc. It drives the convergence of the model. If you still want it here is an example of xgboost regression objective.

Generating predictions from inferred parameters in pymc3

I run into a common problem I'm wondering if someone can help with. I often would like to use pymc3 in two modes: training (i.e. actually running inference on parameters) and evaluation (i.e. using inferred parameters to generate predictions).
In general, I'd like a posterior over predictions, not just point-wise estimates (that's part of the benefit of the Bayesian framework, no?). When your training data is fixed, this is typically accomplished by adding simulated variable of a similar form to the observed variable. For example,
from pymc3 import *
with basic_model:
# Priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta[0]*X1 + beta[1]*X2
# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
Y_sim = Normal('Y_sim', mu=mu, sd=sigma, shape=len(X1))
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start)
But what if my data changes? Say I want to generate predictions based on new data, but without running inference all over again. Ideally, I'd have a function like predict_posterior(X1_new, X2_new, 'Y_sim', trace=trace) or even predict_point(X1_new, X2_new, 'Y_sim', vals=trace[-1]) that would simply run the new data through the theano computation graph.
I suppose part of my question relates to how pymc3 implements the theano computation graph. I've noticed that the function model.Y_sim.eval seems similar to what I want, but it requires Y_sim as an input and seems just to return whatever you give it.
I imagine this process is extremely common, but I can't seem to find any way to do it. Any help is greatly appreciated. (Note also that I have a hack to do this in pymc2; it's more difficult in pymc3 because of theano.)
Note: This functionality is now incorporated in the core code as the pymc.sample_ppc method. Check out the docs for more info.
Based on this link (dead as of July 2017) sent to me by twiecki, there are a couple tricks to solve my issue. The first is to put the training data into a shared theano variable. This allows us to change the data later without screwing up the theano computation graph.
X1_shared = theano.shared(X1)
X2_shared = theano.shared(X2)
Next, build the model and run the inference as usual, but using the shared variables.
with basic_model:
# Priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta[0]*X1_shared + beta[1]*X2_shared
# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start)
Finally, there's a function under development (will likely eventually get added to pymc3) that will allow to predict posteriors for new data.
from collections import defaultdict
def run_ppc(trace, samples=100, model=None):
"""Generate Posterior Predictive samples from a model given a trace.
"""
if model is None:
model = pm.modelcontext(model)
ppc = defaultdict(list)
for idx in np.random.randint(0, len(trace), samples):
param = trace[idx]
for obs in model.observed_RVs:
ppc[obs.name].append(obs.distribution.random(point=param))
return ppc
Next, pass in the new data that you want to run predictions on:
X1_shared.set_value(X1_new)
X2_shared.set_value(X2_new)
Finally, you can generate posterior predictive samples for the new data.
ppc = run_ppc(trace, model=model, samples=200)
The variable ppc is a dictionary with keys for each observed variable in the model. So, in this case ppc['Y_obs'] would contain a list of arrays, each of which is generated using a single set of parameters from trace.
Note that you can even modify the parameters extracted from the trace. For example, I had a model using a GaussianRandomWalk variable and I wanted to generate predictions into the future. While you could allow pymc3 to sample into the future (i.e. allow the random walk variable to diverge), I just wanted to use a fixed value of the coefficient corresponding to the last inferred value. This logic can implemented in the run_ppc function.
It's also worth mentioning that the run_ppc function is extremely slow. It takes about as much time as running the actual inference. I suspect this has to do with some inefficiency related to how theano is used.
EDIT: The link originally included seems to be dead.
Above answer from #santon is correct. I am just adding to that.
Now you don't need to write your own method run_ppc. pymc3 provides sample_posterior_predictive method which does the same.

Categories