What does the high VIF for the constant term (intercept) indicate?

What does the high VIF for the constant term (intercept) indicate? - python

I am building a Linear regression model on a car dataset using RFE technique and statsmodels library. My final model has p-value well within 5% and has high F-statistics. VIF values for the predictor variables are well below 5 but for the constant term(intercept) VIF is 8.18. I have used add_constant method to add constant to the model. Following are my doubts:
What does High variance for the constant indicate ?
Should i ignore the constant term while calculating VIF?
These are my results:
I am new to machine learning and also posting question on this site for the 1st time. Kindly let me know if any more information is needed to answer my question.

statistical question are better asked on stats.stackexchange. However, I just went through this for statsmodels, e.g. https://github.com/statsmodels/statsmodels/issues/2376
First, there is no multicollinearity problem in your model and data. p-values are low and confidence intervals are pretty narrow, so the parameters in the model should be a good estimates. A vif of 8 is not large.
A large vif in the constant indicates that the (slope) explanatory variables have also a large constant component. An example would be when a variable has a large mean but only a small variance. An example for perfect collinearity with the constant and rank deficiency of the design matrix is the dummy variable trap, when we did not remove one of the levels of a categorical variable in dummy encoding and the dummies sum to 1 and, therefore, replicate a constant.
The purpose of including the constant in the vif computation is to discover this kind of problems with the design matrix exog provided by the user. It would not show up if we compute vif on demeaned or standardized explanatory variables.
There has been a long standing debate in statistics and econometrics about whether multicollinearity measures should include a constant or work only with demeaned explanatory variables.
I am currently preparing an extension to statsmodels that gives users the option to compute both versions, with and without constant.
In some cases reparameterization, demeaning and scaling, can improve numerical precision and prediction. So we want to have measures that check the actual design matrix provided by users, but also check a standardized version of the data to see whether demeaning and scaling might improve numerical precision.

Related

Implementing ARIMA(X) model in TensorFlow

I'm currently scratching my head about how I might implement a classic ARIMA(X) model using base TensorFlow (and optionally Keras). The equation I am attempting to setup has the following form:
Where d represents the level of differencing applied to the input observed time series, p is the auto-regressive order, and q is the moving average order. The part which is stumping me currently is the calculation/estimation of the residuals epsilon. The auto-regression portion is a simple linear regression on the lagged samples, while the same is true for the terms involving the exogenous series (X). When I am estimating the residuals, should I simply feed the q-many previous steps into the current estimated parameters, and compute the residuals as y_true-y_predict? Though this also begs the question of: How does one estimate the residuals for observations where you have no previous observations? Do we simply estimate residuals 0 through q simply on a chosen random distribution of set variance (e.g. Normal, Poisson, etc.) with a mean of 0?
I have looked at the source for the statsmodels package to try to understand it, but it is quite opaque. Part of the reason for implementing the model this way is that it needs to fit into a fairly standard ecosystem at the company I work for, and we need control over what slices of data the model is fitted to at a given time step. This is because some data may arrive (much) later than the time stamp it relates to, due to lag at the source etc.
Thank you for any help you might be able to offer.

Evaluating ARIMA models with the AIC

Having come across ARIMA/seasonal ARIMA recently, I am wondering why the AIC is chosen as an estimator for the applicability of a model. According to Wikipedia, it evaluates the goodness of the fit while punishing non-parsimonious models in order to prevent overfitting. Many grid search functions such as auto_arima in Python or R use it as an evaluation metric and suggest the model with the lowest AIC as the best fit.
However, in my case, choosing a simple model (with the lowest AIC -> small amount of parameters) just results in a model, that strongly follows previous in-sample observations and performs very badly on the test sample data. I don't see how overfitting is prevented just by choosing a small number of parameters...
ARIMA(1,0,1)(0,0,0,53); AIC=-16.7
Am I misunderstanding something? What could be a workaround to prevent this?

In the case of an ARIMA model whatever the parameters of the model are it will follow past observations, in the sense that you predict next values given previous values from your data. Now, auto.arima just tries some models and gives you the one with the lowest AIC by default or some other information criterion e.g BIC. This does not mean anything more than what the definition of those criteria are: so the model with the lowest AIC is the one that gives minimizes the AIC function. In case of time series analysis after you make sure that time series is stationary, I would recommend that you examine the ACF and PACF plots of your time series and read this
P.S I don't get this straight orange line in your plot after the dashed vertical line.

We usually use some form of cross-validation to protect against overfitting. It is well known that leave-one-out cross-validation is asymptotically equivalent to AIC under some assumptions about normality etc. Indeed, back when we had less computing power, AIC and other information criteria were handy exactly because they accomplish something very similar to cross-validation analytically.
Also, note that by their nature ARMA(1,1) models -- and other stationary ARMA models for that matter -- tend to converge to a constant rather quickly. The easiest way to see this is to write down the expressions of y_t+1, y_t+2 as a function of y_t. You will see that the expression has exponentials of numbers less than 1 (your AR and MA parameters), which quickly converge to zero as t grows. Also see this discussion.
The reason why your 'observed' data (to the left of the dashed line) does not exhibit this behaviour is that for each period you get a new realisation of random error term epsilon_t. On the right hand side, you do not get these realisations of random shocks, but instead they are replaced with their expressed value 0.

How to approach feature elimination?

I have been working on a couple of dataset to build predictive models based on them. However I am left a bit bewildered when its coming to elimination of features.
The first one is the Boston Housing dataset and the second is Bigmart Sales dataset. I will focus my question around these two however I would also appreciate relatively generalized answers too.
Boston Housing : I have constructed a correlation coefficient matrix and eliminated the features which has an absolute correlation coefficient of less than 0.50 with respect to the target variable medv. That is leaving me with three features. However, I also do understand that a correlation matrix can be highly deceptive and does not capture non-linear relationships and as a matter of fact features such as crim, indus etc does have non-linear relationship with medv and intuitively it simply does not feel correct to discard them right away.
Bigmart Sales : There are around 30+ features that is created after OneHotEncoding in Python. I have given a go to backward elimination method while I was constructing a linear regression model but I am not exactly sure how to apply backward elimination when I was working on a Decision Tree model for this dataset (not sure if it can actually be applied to Decision Tree at all).
It would be of great help if I can get some idea on how to approach to feature elimination for the above two cases. Let me know if you need more info, I will gladly provide.

It's extremely general question. I don't think that it possible to answer to your question in StackOverFlow format.
For every ML / Statistical model you need different Feature Elimination / Feature Engineering approach:
Linear / Logistic / GLM models require removal of correlated features
For Neural Nets / Boosted trees removal of features will heart performance of the model
Even for one type of models there's no single best way of doing Feature Elimination
If you can add more specific information to your question it'll be possible to discuss it in details.

This is a fun one without any definitive answers (No Free Lunch Theorems) that apply across the board. That said, there are many guidelines which typically have success in real-world problems. Those guidelines will work fine in the specific datasets you explicitly mentioned as well.
As with just about anything else, one must always consider the purpose of feature elimination. Without a goal or set of goals, any answer is valid. With an objective, not only can you hone in on a good answer, but it can open up the door to other ideas you may not have considered. Typically feature elimination is done for one of four reasons:
Increased Accuracy
Increased Generalization
Decreased Bias
Decreased Variance
Decreased Computational Costs
Ease of Explanation
Of course there are other reasons, but these cover the main use cases. With respect to any of those metrics, the obvious (and awful -- never do this) way to choose which ones to keep is to try all combinations in your model and see what happens. In the Boston Housing dataset, this yields 2^13=8192 possible combinations of features to test. The combinatorial growth is exponential, and not only is this approach likely to lead to survivorship bias, it is too expensive for most people and most data.
Barring any sort of a comprehensive examination of all possible options, one must use a heuristic of some kind to attempt to find the same results. I'll mention several:
Train the model n times, each with precisely one feature removed (a different feature each time). If a model has poor performance it indicates that the removed feature is important.
Train the model once with all features, and randomly perturb each input one feature at a time (this can be done stochastically if you don't want to waste time on every input). The features which cause the most classification error when perturbed are the ones which matter the most.
As you said, perform some sort of correlation testing with the target variable to determine feature importance and a cross-correlation to remove duplicated linear information.
These different approaches have different assumptions and goals. Feature removal is important from a computational standpoint (many machine learning algorithms are quadratic or worse in the number of features), and with that perspective the goal is to preserve the behavior of the model as best as possible while removing as much information (i.e., as much complexity) as possible. In the Boston Housing data set, your cross-correlation analysis would probably leave you with Charles River Proximity, Nitrous Oxide Concentration, and Average Room Number as the most relevant variables. Between those three you capture nearly all the accuracy a linear model can obtain on the data.
One thing to point out is that feature removal by definition removes information. This can improve accuracy and generalization for only a few reasons.
By removing redundant information, the model has less bias toward those features and is better able to generalize.
By removing noisy information, the model can focus its efforts on features with high informational content. Note that this affects non-deterministic models like neural networks more than models like linear regressions. Linear regressions always converge to the one unique solution (except in special cases that happen with a true 0% probability where there are multiple solutions).
When you're throwing a lot of features into an algorithm (50k different genes for an organism for example), it makes a lot of sense that some of them won't carry any information. By definition then, any variance they have is noise that the model may inadvertently pick up instead of the signal we want. Feature removal is a common strategy in that domain which improves accuracy dramatically.
Contrast that with the Boston Housing data which has 13 carefully curated features, all of which carry information (based on eyeballing crude scatter plots with respect to the target variable). That particular reasoning isn't likely to affect accuracy much. Moreover, there aren't enough features for there to be very much bias introduced with duplicated information.
On top of that, there are hundreds of data points covering the majority of the input space, so even if we did have bias problems or extraneous features, there is more than enough data that the effects will be negligible. Perhaps enough to make or break the 1st or 2nd place winners in Kaggle, but not enough to make the difference between a good analysis and a great analysis.
Especially if you're using a linear algorithm on top though, having fewer features can greatly aid in the explainability of a model. If you restrict your model to those three variables, it's pretty easy to tell a person that you know houses in the area are expensive because they're all waterfront, they're huge, and they have nice lawns (nitrous oxide indicates fertilizer usage).
Removing features is only a small portion of feature engineering, and another important technique is the addition of features. Adding features usually amounts to low-order polynomial interactions (as an example, the age variable has a fairly weak correlation to the medv variable, but if you square it then the data straightens out a bit and improves the correlation).
Adding features (and removing them) can be aided greatly with a little domain knowledge. I don't know a ton about housing, so I can't add a lot of help here, but in other domains like credit worthiness you can easily imagine combining debt and income features to get a ratio of debt to income as a single feature. Reshaping those features so that they linearly correlate to your output and represent physically meaningful quantities in the domain is a big part of obtaining accuracy and generalizability.
With respect to generalizability and domain knowledge, even with something as simple as a linear model it's important to be able to explain why a feature is important. Just because the data says that nitrous oxide matters in the test set doesn't mean that it will carry any predictive weight in the train set as well. Especially as the number of features grows and the amount of data shrinks, you will expect such correlations to occur purely by accident. Having a physical interpretation (nitrous oxide corresponds to nice lawns) yields confidence that the model isn't learning spurious correlations.

How exactly BIC in Augmented Dickey–Fuller test work in Python?

This question is on Augmented Dickey–Fuller test implementation in statsmodels.tsa.stattools python library - adfuller().
In principle, AIC and BIC are supposed to compute information criterion for a set of available models and pick up the best (the one with the lowest information loss).
But how do they operate in the context of Augmented Dickey–Fuller?
The thing which I don't get: I've set maxlag=30, BIC chose lags=5 with some informational criterion. I've set maxlag=40 - BIC still chooses lags=5 but the information criterion have changed! Why in the world would information criterion for the same number of lags differ with maxlag changed?
Sometimes this leads to change of the choice of the model, when BIC switches from lags=5 to lags=4 when maxlag is changed from 20 to 30, which makes no sense as lag=4 was previously available.

When we request automatic lag selection in adfulller, then the function needs to compare all models up to the given maxlag lags. For this comparison we need to use the same observations for all models. Because lagged observations enter the regressor matrix we loose observations as initial conditions corresponding to the largest lag included.
As a consequence autolag uses nobs - maxlags observations for all models. For calculating the test statistic for adfuller itself, we don't need model comparison anymore and we can use all observations available for the chosen lag, i.e. nobs - best_lag.
More general, how to treat initial conditions and different number of initial conditions is not always clear cut, autocorrelation and partial autocorrelation are largely based on using all available observations, full MLE for AR and ARMA models uses the stationary model to include the initial conditions, while conditional MLE or least squares drops them as necessary.

Python Statsmodels Testing Coefficients from Robust Linear Model based on M-Estimators

I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!

I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.