I'm trying to convert the following code from R to Python using the Statsmodels module:
model <- glm(goals ~ att + def + home - (1), data=df, family=poisson, weights=weight)
I've got a similar dataframe (named df) using pandas, and currently have the following line in Python (version 3.4 if it makes a difference):
model = sm.Poisson.from_formula("goals ~ att + def + home - 1", df).fit()
Or, using GLM:
smf.glm("goals ~ att + def + home - 1", df, family=sm.families.Poisson()).fit()
However, I can't get the weighting terms to work. Each record in the dataframe has a date, and I want more recent records to be more valuable for fitting the model than older ones. I've not seen an example of it being used, but surely if it can be done in R, it can be done on Statsmodels... right?
freq_weights is now supported on GLM Poisson, but unfortunately not on sm.Poisson
To use it, pass freq_weights when creating the GLM:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "goals ~ att + def + home - 1"
smf.glm(formula, df, family=sm.families.Poisson(), freq_weights=df['freq_weight']).fit()
I've encountered the same issue.
there is a workaround that should lead to same results. add the weight in logarithm scale (np.log(weight)) you need as one of the explanatory variables with beta equal to 1 (offset option).
I can see there is an option for the exposure which doing the same as I explained above.
There are two solutions for setting up weights for Poisson regression. The first is to use freq_weigths in the GLM function as mentioned by MarkWPiper. The second is to just go with Poisson regression and pass the weights to exposure. As documented here: "Log(exposure) is added to the linear prediction with coefficient equal to 1." This does the same mathematical trick as mentioned by Yaron, although the parameter has a different original meaning. A sample code is as follows:
import statsmodels.api as sm
# or: from statsmodels.discrete.discrete_model import Poisson
fitted = sm.Poisson.from_formula("goals ~ att + def + home - 1", data=df, exposure=df['weight']).fit()
Related
Noob trying my first Negative Binomial Regression. iPython on Google's Colab. I load the dataset as a pandas df. The features (and Target) in the formula below all appear in the df (which I named "dataset").
I also bring in
from patsy import dmatrices
import statsmodels.api as sm
however, when I
formula = """Target ~ MeanAge + %White + %HHsNotWater + HHsIneq*10 + %NotSaLang + %male + %Informal + COGTACatG2B09 + %Poor + AGRating """
data = dataset
response, predictors = dmatrices(formula, data, return_type='dataframe')
nb_results = sm.GLM(response, predictors, family=sm.families.NegativeBinomial(alpha=0.15)).fit()
print(nb_results.summary())
I simply get AssertionError:, and an arrow to line four (the one starting "response"). I have no idea how to remedy this, and cannot find similar problems on this site - any sage guidance, please?
...the mistake I made was in the formula line. Python sees the "%" and "*" in my feature names as very different instructions altogether.
So changing each feature from HHsHotWater to Q('HHsNotWater') etc, made all the difference. #njsmith at the pydata/patsy issues github set me straight.
I'm trying to replicate code from R that estimates a random intercept model. The R code is:
fit=lmer(resid~-1+(1|groupid),data=df)
I'm using the lmer command of the lme4 package to estimate random intercepts for the variable resid for observations in different groups (defined by groupid). There is no 'fixed effects' part, therefore no variable before the (1|groupid). Moreover, I do not want a constant estimated so that I get an intercept for each group.
Not sure how to do similar estimation in Python. I tried something like:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 4), columns=list('ABCD'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5]
df['groupid'] = df['groupid'].astype('category')
###Random intercepts models
md = smf.mixedlm('A~B-1',data=df,groups=df['groupid'])
mdf = md.fit()
print(mdf.random_effects)
A is resid from the earlier example, while groupid is the same.
1) I am not sure whether the mdf.random_effects are the random intercepts I am looking for
2) I cannot remove the variable B, which I understand is the fixed effects part. If I try:
md = smf.mixedlm('A~-1',data=df,groups=df['groupid'])
I get an error that "Arrays cannot be empty".
Just trying to estimate the exact same model as in the R code. Any advice will be appreciated.
I have fitted a logisitic regression model to some data, everything is working great. I need to calculate the wald statistic which is a function of the model result.
My problem is that I do not understand, from the documentation, what the wald test requires as input? Specifically what is the R matrix and how is it generated?
I tried simply inputting the data I used to train and test the model as the R matrix, but I do not think this is correct. The documentation suggest examining the examples however none give an example of this test. I also asked the same question on crossvalidated but got shot down.
Kind regards
http://statsmodels.sourceforge.net/0.6.0/generated/statsmodels.discrete.discrete_model.LogitResults.wald_test.html#statsmodels.discrete.discrete_model.LogitResults.wald_test
The Wald test is used to test if a predictor is significant or not, of the form:
W = (beta_hat - beta_0) / SE(beta_hat) ~ N(0,1)
So somehow you'll want to input the predictors into the test. Judging from the example of the t.test and f.test, it may be simpler to input a string or tuple to indicate what you are testing.
Here is their example using a string for the f.test:
from statsmodels.datasets import longley
from statsmodels.formula.api import ols
dta = longley.load_pandas().data
formula = 'TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR'
results = ols(formula, dta).fit()
hypotheses = '(GNPDEFL = GNP), (UNEMP = 2), (YEAR/1829 = 1)'
f_test = results.f_test(hypotheses)
print(f_test)
And here is their example using a tuple:
import numpy as np
import statsmodels.api as sm
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
results = sm.OLS(data.endog, data.exog).fit()
r = np.zeros_like(results.params)
r[5:] = [1,-1]
T_test = results.t_test(r)
If you're still struggling getting the wald test to work, include your code and I can try to help make it work.
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.
I have a question regarding regression in python. To make a long story short, I need to find a model of form yt = mt + st where mt and st are trends and seasonal component respectively. In my earlier analysis, I have found that a good model for mt is a quadratic trend of type mt = a0 + a1*t + a2*t^2
through my regression analysis. Now, when I want to add the seasonal component, this is where I am having the hardest time. Now, I approached this two ways...one is through R programming where I am calling R objects into python and the other through python solely. Now, following the example in my book, I did the folliwng using R:
%load_ext rmagic
import rpy2.robjects as R
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
stats = importr('stats')
r_df = com.convert_to_r_dataframe(pd.DataFrame(data.logTotal))
%Rpush r_df
%R ss = as.factor(rep(1:12,length(r_df$logTotal)/12))
%R tt = 1:length(r_df$logTotal)
%R tt2 = cbind(tt,tt^2)
%R ts_model = lm(r_df$logTotal ~ tt2+ss-1)
%R print(summary(ts_model))
I get the right regression coefficients. But, if i do the same thing in python, this is where I am getting problem replicating it.
import statsmodels.formula.api as smf
ss_temp= pd.Categorical.from_array(np.repeat(np.arange(1,13),len(data.logTotal)/12))
dtemp = np.column_stack((t,t**2,data.logTotal))
dtemp = pd.DataFrame(dtemp,columns=['t','tsqr','logTotal'])
dtemp['ss'] = sstemp
res_result = smf.ols(formula='logTotal ~ t+tsqr + C(ss) -1',data=dtemp).fit()
res_result.params
What am i doing wrong here? I first get an error saying 'data type not found' which points to the res_result formula. So, then I tried changing ss_temp to a Series. Then, the above statements worked. However, my parameters were completely off when compared to the R output. I have been spending a day on this with no avail. Can someone please help me or guide me as to do or is there an python equivalent to as.factor in R? I assumed that was categorical in pandas.
Thanks
If the above is too hard, its fine. I still have the residual model from my regression in R. But, any ideas how to convert this to a python equivalent to what statsmodels interprets as a res from regression? thanks again