I was looking at the following official documentation from statsmodels:
https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
But when I try to run this code on a practice dataset (statsmodels.api already imported as sm)
variance_inflation_factor=sm.stats.outliers_influence.variance_inflation_factor()
vif=pd.DataFrame()
vif['VIF']=[variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])]
vif['Predictors']=X_train.columns
I get the error message: module 'statsmodels.stats.api' has no attribute 'outliers_influence
Can anyone tell me what is the appropriate way to get this working?
variance_inflation_factor=sm.stats.outliers_influence.variance_inflation_factor() does not need to be defined by calling the function with no arguments. Instead, variance_inflation_factor is a function that takes two inputs.
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_train = pd.DataFrame(np.random.standard_normal((1000,5)), columns=[f"x{i}" for i
in range(5)])
vif=pd.DataFrame()
vif['VIF']=[variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])]
vif['Predictors']=X_train.columns
print(vif)
which produces
VIF Predictors
0 1.002882 x0
1 1.004265 x1
2 1.001945 x2
3 1.004227 x3
4 1.003989 x4
Related
I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.
I use sklearn to impute some time-series which include NaN values. At the moment, I use the following:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
in which array is a numpy array of shape n_points x n_time_steps. It works fine but I get a deprecation warning which suggest I should use SimpleImpute from sklearn.impute. Hence I replaced those lines with the following:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
but I get the following error on the last line:
ValueError: 'X' and 'missing_values' types are expected to be both
numerical. Got X.dtype=float32 and type(missing_values)=< class 'str'>.
If anybody has any idea on what is the cause of this error be glad if you let me know. I am using Python 3.6.7 with sklearn 0.20.1. Thanks!
If array contains missing values represented as np.NaN, you should use np.Nan as the argument to the constructor of SimpleImputer. That's the default argument, so this works:
imp = SimpleImputer(strategy='mean')
I am working with an autoregressive model in Python using Statsmodels. The package is great and I am getting the exact results I need. However, testing for residual correlation (Breusch-Godfrey LM-test) doesn't seem to work, because I get an error message.
My code:
import pandas as pd
import datetime
import numpy as np
from statsmodels.tsa.api import VAR
import statsmodels.api as sm
df = pd.read_csv('US_data.csv')
# converting str formatted dates to datetime and setting the index
j = []
for i in df['Date']:
j.append(datetime.datetime.strptime(i, '%Y-%m-%d').date())
df['Date'] = j
df = df.set_index('Date')
# dataframe contains three columns (GDP, INV and CONS)
# log difference
df = pd.DataFrame(np.log(df)*100)
df = df.diff()
p = 4 # order
model = VAR(df[1:])
results = model.fit(p, method='ols')
sm.stats.diagnostic.acorr_breusch_godfrey(results)
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-11abf518baae> in <module>()
----> 1 sm.stats.diagnostic.acorr_breusch_godfrey(results)
/home/****/anaconda3/lib/python3.6/site-packages/statsmodels/sandbox/stats/diagnostic.py in acorr_breusch_godfrey(results, nlags, store)
501 nlags = int(nlags)
502
--> 503 x = np.concatenate((np.zeros(nlags), x))
504
505 #xdiff = np.diff(x)
ValueError: all the input arrays must have same number of dimensions
A similar question was asked here over five months ago, but with no luck. Does anybody have an idea how to resolve this? Thank you very much in advance!
Those diagnostic tests were designed for univariate models like OLS where we have a one-dimensional residual array.
The only way to use it is most likely to use only a single equation of the VAR system or loop over each equation or variable.
VARResults in statsmodels master has a test_whiteness_new method which is a test for no autocorrelation of the multivariate residuals of a VAR.
It uses a Portmanteau test, which I think is the same as Ljung-Box.
The statespace models also use Ljung-Box for the related tests.
I'm getting acquainted with Statsmodels so as to shift my more complicated stats completely over to python. However, I'm being cautious, so I'm cross-checking my results with SPSS, just to make sure I'm not making any obvious blunders. Most of time, there's no difference, but I have one example of a two-way ANOVA that's throwing up very different test statistics in Statsmodels and SPSS. (Relevant point: the sample sizes in the ANOVA are mismatched, so ANOVA may not be the appropriate model here.)
I'm selecting my model as follows:
import pandas as pd
import scipy as sp
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Body = pd.read_csv(filepath)
Body = Body.dropna()
Body_lm = ols('Effect ~ C(Fiction) + C(Condition) + C(Fiction)*C(Condition)', data = Body).fit()
table = sm.stats.anova_lm(Body_lm, typ=2)
The Statsmodels output is as below:
sum_sq df F PR(>F)
C(Fiction) 278.176684 1.0 307.624463 1.682042e-55
C(Condition) 4.294764 1.0 4.749408 2.971278e-02
C(Fiction):C(Condition) 10.776312 1.0 11.917092 5.970123e-04
Residual 520.861599 576.0 NaN NaN
The corresponding SPSS results are these:
Can anyone help explain the difference? Is is perhaps the unequal sample sizes being treated differently under the hood? Or am I choosing the wrong model?
Any help appreciated!
You should use sum coding when comparing the means of the variables.
BTW you don't need to specify each variable that are in the interaction term if * multiply operator is used:
“:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together.
Your model should be:
Body_lm = ols('Effect ~ C(Fiction, Sum)*C(Condition, Sum)', data = Body).fit()
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.