Is there some python packages that helps to do statistical linear regression? For example, I hope such program could do something like automatically performing different types of statistical tests (t-test, F-test etc.) and then automatically removes redundant variable etc., correct for heteroskedasticity etc.. Or is LASSO just the best?
You can perform and visualize linear regression in Python with a wide array of packages like:
scipy, statsmodels and seaborn. LASSO is available through statsmodels as described here. When it comes to automated approches to linear regresssion analysis you could start with Forward Selection with statsmodels that was described in an answer in the post Stepwise Regression in Python.
Related
I'm optimizing a logistic regression in sklearn through repeated kfold cross validation. I want to checkout the confidence interval and, based on other stack exchange answers, it seems easier to get that info from statsmodels.
Coming from a basis in sklearn though, statsmodels is opaque. How can I translate the optimized settings for my logistic regression into statsmodels. Things like the L2 penalty, C value, intercept, etc.?
I've done some research and it looks like statsmodel support L2 indirectly through GLM Binomials. The C value will need to be converted from C into alpha (whatever that means), and I have only a vague idea of how to specify the intercept (it looks like it has something to do with the add_constant function).
Can someone drop an example of how to do this kind of translation into statsmodels? I'm sure once I see it done, a lot of it will naturally fall into place in my head.
I need to use knime for regression analysis. I am a python user, I know knime as well but not in deep!
I usually use statsmodel in python for regression analysis and working on statistical models.
However for solving regression problem as a machine learning problem I use sklearn regression model. Each of these packages in python has its own benefit deepened on your task, and also different view of output which is really important to address the problem in the right way.
Here is my question, does knime present any special package for statistical model? If I plan to do a regression analysis which nodes are recommended?
Many thanks for your help
There's a Linear Regression Learner node under Analytics > Mining > Linear/Polynomial Regression in the node repository. Does that do what you need?
I'm attempting to translate R code into Python and running into trouble trying to replicate the R lm{stats} function which contains 'weights', allowing for weights to be used in the fitting process.
My ultimate goal is to simply run a weighted linear regression in Python using the statsmodels library.
Searching through the Statsmodels issues I've located caseweights in linear models #743 and SUMM/ENH rare events, unbalanced sample, matching, weights #2701 which make me think this may not be possible with Statsmodels.
Is it possible to add weights to GLM models in Statsmodels or alternatively, is there a better way to run a weighted linear regression in python?
WLS has weights for the linear model, where weights are interpreted as inverse variance for the result statistics.
http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.WLS.html
The unreleased version of statsmodels has frequency weights for GLM, but no variance weights.
see freq_weights in http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html
(There are many open issues to expand the types of weights and adding weights to other models, but those are not available yet.)
I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python.
Matlab provides the nice function :
lassoglm(X,y, distr)
where distr can be poisson, binomial etc.
I had a look at both statmodels and scikit-learn but I did not find any ready to use function or example that could direct me towards a solution.
In matlab it seems they minimize this:
min (1/N * Deviance(β0,β) + λ * sum(abs(β)) )
where deviance depends on the link function.
Is there a way to implement this easily with scikit or statsmodels or I should go for cvxopt?
statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html
which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html
and there should be some examples available in the documentation or unit tests.
It uses an interior algorithm with either scipy slsqp or optionally, if installed, cvxopt. Compared to steepest descend or coordinate descend methods, this is only appropriate for cases where the number of features/explanatory variables is not too large.
Coordinate descend with elastic net for GLM is in a work in progress pull request and will most likely be available in statsmodels 0.8.
I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.
In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.
What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?
Or maybe an mcmc based approach using pymc is quickest...
Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.
Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...
The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.
I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.
Once you find a few candidate models, then you can try those on the whole dataset.
Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.
PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.