I posted a IPython notebook here http://nbviewer.ipython.org/gist/dartdog/9008026
And I worked through both standard Statsmodels OLS and then similar with PYMC3 with the data provided via Pandas, that part works great by the way.
I can't see how to get the more standard parameters out of PYMC3? The examples seem to just use OLS to plot the base regression line. It seems that the PYMC3 model data should be able to give the parameters for the regression line? in addition to the probable traces,, ie what is the highest probability line?
Any further explanation of interpretation of Alpha, beta and sigma welcomed!
Also how to use PYMC3 model to estimate a future value of y given a new x ie prediction with some probability?
And lastly PYMC3 has a newish GLM wrapper which I tried and it seemed to get messed up? (it could well be me though)
The glm submodule sets some default priors which might very well not be appropriate for every case of which yours is one. You can change them by using the family argument, e.g.:
pm.glm.glm('y ~ x', data,
family=pm.glm.families.Normal(priors={'sd': ('sigma', pm.Uniform.dist(0, 12000))}))
Unfortunately this isn't very well documented yet and requires some good examples.
Related
I am currently running a linear regression on my time-series data set. However, depending on which python module I use, I get completely different results.
First I used Sklearn, and my model had an R^2 score of about 0.65. After that I tried using statsmodels.api, to get the summary of the regression, since Sklearn doesn't provide one, and I got a completely different R-2 score of 0.96.
After this, I used the linear model of statsmodels.formula.api and got another different result, this time, closer to my first result. (R^2 of 0.65)
I want to know why this happens. It seems like a mistake on my part, but I am pretty sure I am using the same data for all of the regressions (doing converting of the data frame to np.arrays where necessary). Can such large differences happen because of differences in implementation of the module?
Thank you for taking the time to read this.
Is there some python packages that helps to do statistical linear regression? For example, I hope such program could do something like automatically performing different types of statistical tests (t-test, F-test etc.) and then automatically removes redundant variable etc., correct for heteroskedasticity etc.. Or is LASSO just the best?
You can perform and visualize linear regression in Python with a wide array of packages like:
scipy, statsmodels and seaborn. LASSO is available through statsmodels as described here. When it comes to automated approches to linear regresssion analysis you could start with Forward Selection with statsmodels that was described in an answer in the post Stepwise Regression in Python.
I am using statsmodels OLS to run some linear regression on my data. My problem is that I would like the coefficients to add up to 1 (I plan to not use constant parameter).
Is it possible to specify at least 1 constraint on coefficients in statsmodels OLS. I see no option to do so.
Thanks in advance.
Coder.
I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python.
Matlab provides the nice function :
lassoglm(X,y, distr)
where distr can be poisson, binomial etc.
I had a look at both statmodels and scikit-learn but I did not find any ready to use function or example that could direct me towards a solution.
In matlab it seems they minimize this:
min (1/N * Deviance(β0,β) + λ * sum(abs(β)) )
where deviance depends on the link function.
Is there a way to implement this easily with scikit or statsmodels or I should go for cvxopt?
statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html
which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html
and there should be some examples available in the documentation or unit tests.
It uses an interior algorithm with either scipy slsqp or optionally, if installed, cvxopt. Compared to steepest descend or coordinate descend methods, this is only appropriate for cases where the number of features/explanatory variables is not too large.
Coordinate descend with elastic net for GLM is in a work in progress pull request and will most likely be available in statsmodels 0.8.
I have a basic linear regression with 80 numerical variables (no classification variables). Training set has 1600 rows, testing 700.
I would like a python package that iterates through all column combinations to find the best custom score function or an out of the box score funtion like AIC.
OR
If that doesnt exist, what do people here use for variable selection? I know R has some packages like this but dont want deal with Rpy2
I have no preference if the LM requires scikit learn, numpy, pandas, statsmodels, or other.
I can suggest an answer that using the Least Absolute Shrinkage and Selection Operator(Lasso). I didn't use in a situation like you, that you have to deal with so many data.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
I often write a code to do linear regression with statsmodels like below,
import statsmodels.api as sm
model = sm.OLS()
results = model.fit(train_X,train_Y)
If I want to do Lasso regression, I write a code like below,
from sklearn import linear_model
model = linear_model.Lasso(alpha=1.0(default))
results = model.fit(train_X,train_Y)
You have to decide appropriate alpha between 0.0 and 1.0. The parameter is determined by how you don't accept the error.
Try this.