patsy dmatrices raising "AssertionError" - python

Noob trying my first Negative Binomial Regression. iPython on Google's Colab. I load the dataset as a pandas df. The features (and Target) in the formula below all appear in the df (which I named "dataset").
I also bring in
from patsy import dmatrices
import statsmodels.api as sm
however, when I
formula = """Target ~ MeanAge + %White + %HHsNotWater + HHsIneq*10 + %NotSaLang + %male + %Informal + COGTACatG2B09 + %Poor + AGRating """
data = dataset
response, predictors = dmatrices(formula, data, return_type='dataframe')
nb_results = sm.GLM(response, predictors, family=sm.families.NegativeBinomial(alpha=0.15)).fit()
print(nb_results.summary())
I simply get AssertionError:, and an arrow to line four (the one starting "response"). I have no idea how to remedy this, and cannot find similar problems on this site - any sage guidance, please?

...the mistake I made was in the formula line. Python sees the "%" and "*" in my feature names as very different instructions altogether.
So changing each feature from HHsHotWater to Q('HHsNotWater') etc, made all the difference. #njsmith at the pydata/patsy issues github set me straight.

Related

Python statsmodels logit wald test input

I have fitted a logisitic regression model to some data, everything is working great. I need to calculate the wald statistic which is a function of the model result.
My problem is that I do not understand, from the documentation, what the wald test requires as input? Specifically what is the R matrix and how is it generated?
I tried simply inputting the data I used to train and test the model as the R matrix, but I do not think this is correct. The documentation suggest examining the examples however none give an example of this test. I also asked the same question on crossvalidated but got shot down.
Kind regards
http://statsmodels.sourceforge.net/0.6.0/generated/statsmodels.discrete.discrete_model.LogitResults.wald_test.html#statsmodels.discrete.discrete_model.LogitResults.wald_test
The Wald test is used to test if a predictor is significant or not, of the form:
W = (beta_hat - beta_0) / SE(beta_hat) ~ N(0,1)
So somehow you'll want to input the predictors into the test. Judging from the example of the t.test and f.test, it may be simpler to input a string or tuple to indicate what you are testing.
Here is their example using a string for the f.test:
from statsmodels.datasets import longley
from statsmodels.formula.api import ols
dta = longley.load_pandas().data
formula = 'TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR'
results = ols(formula, dta).fit()
hypotheses = '(GNPDEFL = GNP), (UNEMP = 2), (YEAR/1829 = 1)'
f_test = results.f_test(hypotheses)
print(f_test)
And here is their example using a tuple:
import numpy as np
import statsmodels.api as sm
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
results = sm.OLS(data.endog, data.exog).fit()
r = np.zeros_like(results.params)
r[5:] = [1,-1]
T_test = results.t_test(r)
If you're still struggling getting the wald test to work, include your code and I can try to help make it work.

Using weightings in a Poisson model using Statsmodels module

I'm trying to convert the following code from R to Python using the Statsmodels module:
model <- glm(goals ~ att + def + home - (1), data=df, family=poisson, weights=weight)
I've got a similar dataframe (named df) using pandas, and currently have the following line in Python (version 3.4 if it makes a difference):
model = sm.Poisson.from_formula("goals ~ att + def + home - 1", df).fit()
Or, using GLM:
smf.glm("goals ~ att + def + home - 1", df, family=sm.families.Poisson()).fit()
However, I can't get the weighting terms to work. Each record in the dataframe has a date, and I want more recent records to be more valuable for fitting the model than older ones. I've not seen an example of it being used, but surely if it can be done in R, it can be done on Statsmodels... right?
freq_weights is now supported on GLM Poisson, but unfortunately not on sm.Poisson
To use it, pass freq_weights when creating the GLM:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "goals ~ att + def + home - 1"
smf.glm(formula, df, family=sm.families.Poisson(), freq_weights=df['freq_weight']).fit()
I've encountered the same issue.
there is a workaround that should lead to same results. add the weight in logarithm scale (np.log(weight)) you need as one of the explanatory variables with beta equal to 1 (offset option).
I can see there is an option for the exposure which doing the same as I explained above.
There are two solutions for setting up weights for Poisson regression. The first is to use freq_weigths in the GLM function as mentioned by MarkWPiper. The second is to just go with Poisson regression and pass the weights to exposure. As documented here: "Log(exposure) is added to the linear prediction with coefficient equal to 1." This does the same mathematical trick as mentioned by Yaron, although the parameter has a different original meaning. A sample code is as follows:
import statsmodels.api as sm
# or: from statsmodels.discrete.discrete_model import Poisson
fitted = sm.Poisson.from_formula("goals ~ att + def + home - 1", data=df, exposure=df['weight']).fit()

regression on trend + seasonal using python statsmodels

I have a question regarding regression in python. To make a long story short, I need to find a model of form yt = mt + st where mt and st are trends and seasonal component respectively. In my earlier analysis, I have found that a good model for mt is a quadratic trend of type mt = a0 + a1*t + a2*t^2
through my regression analysis. Now, when I want to add the seasonal component, this is where I am having the hardest time. Now, I approached this two ways...one is through R programming where I am calling R objects into python and the other through python solely. Now, following the example in my book, I did the folliwng using R:
%load_ext rmagic
import rpy2.robjects as R
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
stats = importr('stats')
r_df = com.convert_to_r_dataframe(pd.DataFrame(data.logTotal))
%Rpush r_df
%R ss = as.factor(rep(1:12,length(r_df$logTotal)/12))
%R tt = 1:length(r_df$logTotal)
%R tt2 = cbind(tt,tt^2)
%R ts_model = lm(r_df$logTotal ~ tt2+ss-1)
%R print(summary(ts_model))
I get the right regression coefficients. But, if i do the same thing in python, this is where I am getting problem replicating it.
import statsmodels.formula.api as smf
ss_temp= pd.Categorical.from_array(np.repeat(np.arange(1,13),len(data.logTotal)/12))
dtemp = np.column_stack((t,t**2,data.logTotal))
dtemp = pd.DataFrame(dtemp,columns=['t','tsqr','logTotal'])
dtemp['ss'] = sstemp
res_result = smf.ols(formula='logTotal ~ t+tsqr + C(ss) -1',data=dtemp).fit()
res_result.params
What am i doing wrong here? I first get an error saying 'data type not found' which points to the res_result formula. So, then I tried changing ss_temp to a Series. Then, the above statements worked. However, my parameters were completely off when compared to the R output. I have been spending a day on this with no avail. Can someone please help me or guide me as to do or is there an python equivalent to as.factor in R? I assumed that was categorical in pandas.
Thanks
If the above is too hard, its fine. I still have the residual model from my regression in R. But, any ideas how to convert this to a python equivalent to what statsmodels interprets as a res from regression? thanks again

Ignoring missing values in multiple OLS regression with statsmodels

I'm trying to run a multiple OLS regression using statsmodels and a pandas dataframe. There are missing values in different columns for different rows, and I keep getting the error message:
ValueError: array must not contain infs or NaNs
I saw this SO question, which is similar but doesn't exactly answer my question: statsmodel.api.Logit: valueerror array must not contain infs or nans
What I would like to do is run the regression and ignore all rows where there are missing variables for the variables I am using in this regression. Right now I have:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.read_csv('cl_030314.csv')
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df).fit()
I want something like missing = "drop".
Any suggestions would be greatly appreciated. Thanks so much.
You answered your own question. Just pass
missing = 'drop'
to ols
import statsmodels.formula.api as smf
...
results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year",
data=df, missing='drop').fit()
If this doesn't work then it's a bug and please report it with a MWE on github.
FYI, note the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use
import statsmodels.api as sm
sm.formula.ols(...)
The answer from jseabold works very well, but it may be not enough if you the want to do some computation on the predicted values and true values, e.g. if you want to use the function mean_squared_error. In that case, it may be better to get definitely rid of NaN
df = pd.read_csv('cl_030314.csv')
df_cleaned = df.dropna()
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit()

Why are LASSO in sklearn (python) and matlab statistical package different?

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467
it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps
I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2
Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.

Categories