Ignoring missing values in multiple OLS regression with statsmodels

Ignoring missing values in multiple OLS regression with statsmodels - python

I'm trying to run a multiple OLS regression using statsmodels and a pandas dataframe. There are missing values in different columns for different rows, and I keep getting the error message:
ValueError: array must not contain infs or NaNs
I saw this SO question, which is similar but doesn't exactly answer my question: statsmodel.api.Logit: valueerror array must not contain infs or nans
What I would like to do is run the regression and ignore all rows where there are missing variables for the variables I am using in this regression. Right now I have:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.read_csv('cl_030314.csv')
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df).fit()
I want something like missing = "drop".
Any suggestions would be greatly appreciated. Thanks so much.

You answered your own question. Just pass
missing = 'drop'
to ols
import statsmodels.formula.api as smf
...
results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year",
data=df, missing='drop').fit()
If this doesn't work then it's a bug and please report it with a MWE on github.
FYI, note the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use
import statsmodels.api as sm
sm.formula.ols(...)

The answer from jseabold works very well, but it may be not enough if you the want to do some computation on the predicted values and true values, e.g. if you want to use the function mean_squared_error. In that case, it may be better to get definitely rid of NaN
df = pd.read_csv('cl_030314.csv')
df_cleaned = df.dropna()
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit()

Related

How to calculate Cooks Distance, DFFITS using python statsmodel

I want to calculate Cooks_d and DFFITS in Python using statsmodel.
Here is my code in Python:
X = your_str_cleaned[param]
y = your_str_cleaned['Visitor']
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
I tried using this for getting Cooks Distance and DFFITS:
import statsmodels.stats.outliers_influence as st_inf
st_inf.OLSInfluence.summary_frame(results)
But I am getting this error:
'OLSResults' object has no attribute 'results'.
Can someone help me find where I am going wrong?

I experience the same problem, so I had to find a way around. I don't have much experience, and this doesn't fix the root issue with OLSInfluence. But it gives you summary_frame.
I will use pandas dataframes as the source of the data. Even if you have it in other objects (like arrays) you can transform them into a dataframe with relative ease. To show how it works, I will import the Boston housing prices data set from sklearn.datasets:
import pandas as pd
from sklearn.datasets import load_boston
#imports dataset
boston = load_boston()
#generates DataFrame bos
bos = pd.DataFrame(boston.data)
#adds columns names to bos
bos.columns = boston.feature_names
#adds column 'PRICE' to bos
bos['PRICE'] = boston.target
Now let us consider the relation between the column 'RM' and the column 'PRICE', with 'RM'as independent variable. For simplicity, let us consider simple OLS. Here comes the actual answer:
from statsmodels.formula.api import ols
m = ols('PRICE ~ RM',bos).fit()
infl = m.get_influence()
sm_fr = infl.summary_frame()
sm_fr has the columns cooks_d and dffits that you look for.

Statsmodels gives different ANOVA results to SPSS

I'm getting acquainted with Statsmodels so as to shift my more complicated stats completely over to python. However, I'm being cautious, so I'm cross-checking my results with SPSS, just to make sure I'm not making any obvious blunders. Most of time, there's no difference, but I have one example of a two-way ANOVA that's throwing up very different test statistics in Statsmodels and SPSS. (Relevant point: the sample sizes in the ANOVA are mismatched, so ANOVA may not be the appropriate model here.)
I'm selecting my model as follows:
import pandas as pd
import scipy as sp
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Body = pd.read_csv(filepath)
Body = Body.dropna()
Body_lm = ols('Effect ~ C(Fiction) + C(Condition) + C(Fiction)*C(Condition)', data = Body).fit()
table = sm.stats.anova_lm(Body_lm, typ=2)
The Statsmodels output is as below:
sum_sq df F PR(>F)
C(Fiction) 278.176684 1.0 307.624463 1.682042e-55
C(Condition) 4.294764 1.0 4.749408 2.971278e-02
C(Fiction):C(Condition) 10.776312 1.0 11.917092 5.970123e-04
Residual 520.861599 576.0 NaN NaN
The corresponding SPSS results are these:
Can anyone help explain the difference? Is is perhaps the unequal sample sizes being treated differently under the hood? Or am I choosing the wrong model?
Any help appreciated!

You should use sum coding when comparing the means of the variables.
BTW you don't need to specify each variable that are in the interaction term if * multiply operator is used:
“:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together.
Your model should be:
Body_lm = ols('Effect ~ C(Fiction, Sum)*C(Condition, Sum)', data = Body).fit()

Using weightings in a Poisson model using Statsmodels module

I'm trying to convert the following code from R to Python using the Statsmodels module:
model <- glm(goals ~ att + def + home - (1), data=df, family=poisson, weights=weight)
I've got a similar dataframe (named df) using pandas, and currently have the following line in Python (version 3.4 if it makes a difference):
model = sm.Poisson.from_formula("goals ~ att + def + home - 1", df).fit()
Or, using GLM:
smf.glm("goals ~ att + def + home - 1", df, family=sm.families.Poisson()).fit()
However, I can't get the weighting terms to work. Each record in the dataframe has a date, and I want more recent records to be more valuable for fitting the model than older ones. I've not seen an example of it being used, but surely if it can be done in R, it can be done on Statsmodels... right?

freq_weights is now supported on GLM Poisson, but unfortunately not on sm.Poisson
To use it, pass freq_weights when creating the GLM:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "goals ~ att + def + home - 1"
smf.glm(formula, df, family=sm.families.Poisson(), freq_weights=df['freq_weight']).fit()

I've encountered the same issue.
there is a workaround that should lead to same results. add the weight in logarithm scale (np.log(weight)) you need as one of the explanatory variables with beta equal to 1 (offset option).
I can see there is an option for the exposure which doing the same as I explained above.

There are two solutions for setting up weights for Poisson regression. The first is to use freq_weigths in the GLM function as mentioned by MarkWPiper. The second is to just go with Poisson regression and pass the weights to exposure. As documented here: "Log(exposure) is added to the linear prediction with coefficient equal to 1." This does the same mathematical trick as mentioned by Yaron, although the parameter has a different original meaning. A sample code is as follows:
import statsmodels.api as sm
# or: from statsmodels.discrete.discrete_model import Poisson
fitted = sm.Poisson.from_formula("goals ~ att + def + home - 1", data=df, exposure=df['weight']).fit()

Using pandas pd.cut to generate a categorical variable with statsmodels

I have tried to use pd.cut to create a categorical variable from a continuous variable. I'd like to use this in a subsequent statsmodel defined regression including this dummy variable. When I create a categorical variable created in this way, I get an error
TypeError: data type not understood.
A test case is included below.
import numpy as np
import pandas as pd
import statsmodels as sm
import statsmodels.formula.api as smf
df = pd.DataFrame(np.random.randn(6,4))
df.columns = ['A', 'B', 'C', 'D']
df['ttt']=pd.cut(df['D'], bins=2)
test = smf.ols('A ~ B + ttt', data=df).fit()
I'm sure I've done something obviously wrong. Any help would be appreciated.

I'm not sure exactly where statsmodels is at in terms of including support for the new Categorical type in pandas. For the moment, you may have to convert the categorical back into an object type for it to work (please check that the resulting ols fit is sensible, I don't know the full details of what you're trying to do):
df['ttt_fixed'] = df.ttt.astype(np.object)
test = smf.ols('A ~ B + ttt_fixed', data=df).fit()
test.summary()

Select reference level in y-variable/ LHS/ endogenous side using patsy

I'm trying to use Patsy to make endogenous and a exogenous datamatrices, for use in binary logistic regression. I'm having problems setting the reference level of the endogenous side.
The problem with the following code is that the endogenous side have two levels, where it should only have one in binary logistic regression.
import pandas as pd
import statsmodels.api as sm
import patsy
# data:
url = 'http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv'
df = pd.read_csv(url)
df = df.iloc[:10,1:]
df = df.loc[ ( df.Species == 'setosa') | ( df.Species == 'versicolor' ) ,]
df.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species' ]
y, X = patsy.dmatrices("C(Species,Treatment('versicolor')) ~ Sepal_Length",data = df, return_type = 'dataframe')
The shape of y is (100, 2), but i only need 1 column. So how do I get Patsy to output the endogenous side so I can use it directly in binary logistic regression?

Hmm, my advice would be to slice in to y after you do the above. Patsy isn't really designed with LHS variables in mind. Statsmodels should work in this case (currently, it doesn't, but that's a bug in statsmodels IMO. If you file a bug report on github, I can look into it.)
FYI, you can use
import statsmodels.api as sm
dta = sm.datasets.get_rdataset('iris', cache=True)
As a shortcut to get to the Rdatasets data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ignoring missing values in multiple OLS regression with statsmodels - python

Related

How to calculate Cooks Distance, DFFITS using python statsmodel

Statsmodels gives different ANOVA results to SPSS

Using weightings in a Poisson model using Statsmodels module

Using pandas pd.cut to generate a categorical variable with statsmodels

Select reference level in y-variable/ LHS/ endogenous side using patsy

Categories

Resources