How can I add a constant to my statsmodels regression.
As of now, the model is like this:
model = sm.OLS(y,x).fit()
From the documentation for OLS:
exog: A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
X = sm.add_constant(x)
sm.OLS(y,X)
Related
I am training and tuning a model in pycaret such as:
from pycaret.classification import *
clf1 = setup(data = train, target = 'target', feature_selection = True, test_data = test, remove_multicollinearity = True, multicollinearity_threshold = 0.4)
# create model
lr = create_model('lr')
# tune model
tuned_lr = tune_model(lr)
# optimize threshold
optimized_lr = optimize_threshold(tuned_lr)
I would like to get the parameters estimated for the features in the Logistic Regression, so I could proceed with understanding the effect size of each feature on the target. However, the object optimized_lr has a function optimized_lr.get_params() which returns the hyperparameters of the model, however, I am not quite interested in my tuning decisions, instead, I am very interested in the real parameters of the model, the ones estimated in Logistic Regression.
How could I get them to use pycaret? (I could easily get those using other packages such as statsmodels, but I want to know in pycaret)
how about
for f, c in zip (optimized_lr.feature_names_in_,tuned.coef_[0]):
print(f, c)
To get the coefficients, use this code:
tuned_lr.feature_importances_ #this will give you the coefficients
get_config('X_train').columns #this code will give you the names of the columns.
Now we can create a dataframe so that we could see clearly how it correlates with the independent variable.
Coeff=pd.DataFrame({"Feature":get_config('X_train').columns.tolist(),"Coefficients":tuned_lr.feature_importances_})
print(Coeff)
# It would give me the Coefficient with the names of the respective columns. Hope it helps.
I'm trying to build old school model using only auto regression algorithm. I found out that there's an implementation of it in statsmodel package. I've read the documentation, and as I understand it should work as ARIMA. So, here's my code:
import statsmodels.api as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
And when I want to predict new values, I'm trying to follow the documentation:
y_pred = model.predict(start=df_test.index.min(), end=df_test.index.max())
# or
y_pred = model.predict(start=100, end=1000)
Both returns a list of NaNs.
Also, when I type model.predict(0, df_train.size - 1) it predicts real values, but model.predict(0, df_train.size) predicts NaNs list.
Am I doing something wrong?
P.S. I know there's ARIMA, ARMA or SARIMAX algorithms, that can be used as basic auto regression. But I need exactly AutoReg.
We can do the forecasting in couple of ways:
by directly using the predict() function and
by using the definition of AR(p) process and the parameters learnt with AutoReg(): this will be helpful for short-term predictions, as we shall see.
Let's start with a sample dataset from statsmodels, the data looks like the following:
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
plt.plot(range(len(data)), data)
Let's fit an AR(p) process to model the time series and use partial autocorrelation plot to find the order p, as shown below
As seen from above, the first few PACF values remain significant, let's use p=10 for the AR(p).
Let's divide the data into training and validation (test) datasets and fit auto-regressive model of order 10 using the training data:
from statsmodels.tsa.ar_model import AutoReg
n = len(data)
ntrain = int(n*0.9)
ntest = n - ntrain
lag = 10
res = AutoReg(data[:ntrain], lags = lag).fit()
Now, use the predict() function for forecasting all values corresponding to the held-out dataset:
preds = res.model.predict(res.params, start=n-ntest, end=n)
Notice that we can get the exactly same predictions using the parameters from the trained model, as shown below:
x = data[ntrain-lag:ntrain].values
preds1 = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*x[::-1])
x[:lag-1], x[lag-1] = x[-(lag-1):], pred
preds1.append(pred)
Note that the forecast values generated this way is same as the ones obtained using the predict() function above.
np.allclose(preds.values, np.array(preds1))
# True
Now, let's plot the forecast values for the test data:
As can be seen, for long term prediction, quality of forecasting is not that good (since the forecasted values are used for long term prediction).
Let's instead go for short-term predictions now and use the last lag points from the dataset to forecast the next value, as shown in the next code snippet.
preds = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*data[t-lag:t].values[::-1])
preds.append(pred)
As can be seen from the next plot, short term forecasting works way better:
You can use this code for forecasting
import statsmodels as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
y_pred = model.model.predict(model.params, start=df_test.index.min(), end=df_test.index.max())
from statsmodels.tsa.ar_model import AutoReg
model=AutoReg(dataset[''],lags=1)
ARFit=model.fit()
forecasted=ARFit.predict(start=len(dataset),end=len(dataset)+12)
#visualizacion
dataset[''].plot(figsize=(12,8),legend=True)
forecasted.plot(legend=True)
If I have a model like the one below, how do I access the theano function in order to get the value(s) for my model I'm fitting?
This is quite a basic model and so I could just calculate with the raw function for my variables. However, I intend to generate pymc3 models dynamically where some variables are reused/fixed/bounded etc.
I know I can access the theano function from model.makefn([expected]) but this will rely on transformed arguments like sigma_log_ instead of sigma.
Ideally, I'm looking for something like model.evalute([expected], alpha=1, beta=2)
Is there such a method?
Thanks
def function(a, b):
# do something
basic_model = Model()
with basic_model:
# Priors for unknown model parameters
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
# Expected value of outcome
expected = Deterministic('expected', function(alpha,beta))
# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs', mu=function, sd=sigma, observed=Y)
The typical approach here would be to first sample from the model's posterior distribution with something like
with model:
trace = pm.sample(N_SAMPLES)
then use the samples to approximate the posterior expected value of your function.
I tried to compare logistic regression result from statsmodel with sklearn logisticRegression result. actually I tried to compare with R result also.
I made the options C=1e6(no penalty) but I got almost same coefficients except the intercept.
model = sm.Logit(Y, X).fit()
print(model.summary())
==> intercept = 5.4020
model = LogisticRegression(C=1e6,fit_intercept=False)
model = model.fit(X, Y)
===> intercept = 2.4508
so I read the user guide, they said Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
what is this meaning? due to this, sklearn logisticRegression gave a different intercept value?
please help me
LogisticRegression is in some aspects similar to the Perceptron Model and LinearRegression.
You multiply your weights with the data points and compare it to a threshold value b:
w_1 * x_1 + ... + w_n*x_n > b
This can be rewritten as:
-b + w_1 * x_1 + ... + w_n*x_n > 0
or
w_0 * 1 + w_1 * x_1 + ... + w_n*x_n > 0
For linear regression we keep this, for the perceptron we feed this to a chosen function and here for the logistic regression pass this to the logistic function.
Instead of learning n parameters now n+1 are learned. For the perceptron it is called bias, for regression intercept.
For linear regression it's easy to understand geometrically. In the 2D case you can think about this as a shifting the decision boundary by w_0 in the y direction**.
or y = m*x vs y = m*x + c
So now the decision boundary does not go through (0,0) anymore.
For the logistic function it is similar it shifts it away for the origin.
Implementation wise what happens, you add one more weight and a constant 1 to the X values. And then you proceed as normal.
if fit_intercept:
intercept = np.ones((X_train.shape[0], 1))
X_train = np.hstack((intercept, X_train))
weights = np.zeros(X_train.shape[1])
I have the following OLS model from StatsModels:
X = df['Grade']
y = df['Results']
X = statsmodels.tools.tools.add_constant(X)
mod = sm.OLS(y,X)
results = mod.fit()
When trying to predict a new Y value for an X value of 4, I have to pass the following:
results.predict([1,4])
I don't understand why an array with the first value being '1' needs to be passed in order for the predict function to work correctly. Why do I need to include a 1 instead of just saying:
results.predict([4])
I'm not clear on the concept at work here. Does anybody know what's going on?
You are adding a constant to the regression equation with X = statsmodels.tools.tools.add_constant(X). So your regressor X has two columns where the first column is a array of ones.
You need to do the same with the regressor that is used in prediction. So, the 1 means include the constant in the prediction. If you use zero instead, then the contribution of the constant (0 * params[0]) is zero and the prediction is only the slope effect.
The formula interface adds the constant automatically both for the regressor in the model and for the regressor in the prediction. However, with the pandas DataFrame or numpy ndarray interface, the constant needs to be added by the user both for the model and for predict.