I'm attempting to translate R code into Python and running into trouble trying to replicate the R lm{stats} function which contains 'weights', allowing for weights to be used in the fitting process.
My ultimate goal is to simply run a weighted linear regression in Python using the statsmodels library.
Searching through the Statsmodels issues I've located caseweights in linear models #743 and SUMM/ENH rare events, unbalanced sample, matching, weights #2701 which make me think this may not be possible with Statsmodels.
Is it possible to add weights to GLM models in Statsmodels or alternatively, is there a better way to run a weighted linear regression in python?
WLS has weights for the linear model, where weights are interpreted as inverse variance for the result statistics.
http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.WLS.html
The unreleased version of statsmodels has frequency weights for GLM, but no variance weights.
see freq_weights in http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html
(There are many open issues to expand the types of weights and adding weights to other models, but those are not available yet.)
Related
Is there some python packages that helps to do statistical linear regression? For example, I hope such program could do something like automatically performing different types of statistical tests (t-test, F-test etc.) and then automatically removes redundant variable etc., correct for heteroskedasticity etc.. Or is LASSO just the best?
You can perform and visualize linear regression in Python with a wide array of packages like:
scipy, statsmodels and seaborn. LASSO is available through statsmodels as described here. When it comes to automated approches to linear regresssion analysis you could start with Forward Selection with statsmodels that was described in an answer in the post Stepwise Regression in Python.
I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!
If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.
The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python.
Matlab provides the nice function :
lassoglm(X,y, distr)
where distr can be poisson, binomial etc.
I had a look at both statmodels and scikit-learn but I did not find any ready to use function or example that could direct me towards a solution.
In matlab it seems they minimize this:
min (1/N * Deviance(β0,β) + λ * sum(abs(β)) )
where deviance depends on the link function.
Is there a way to implement this easily with scikit or statsmodels or I should go for cvxopt?
statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html
which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html
and there should be some examples available in the documentation or unit tests.
It uses an interior algorithm with either scipy slsqp or optionally, if installed, cvxopt. Compared to steepest descend or coordinate descend methods, this is only appropriate for cases where the number of features/explanatory variables is not too large.
Coordinate descend with elastic net for GLM is in a work in progress pull request and will most likely be available in statsmodels 0.8.
I am going through the tutorial about Monte Carlo Markov Chain process with pymc library. I am also a newbie using pymc and try to establish my own MCMC process. I have faced couple of question that I couldn't find proper answer in pymc tutorial:
First: How could we define priors with pymc and then marginalise over the priors in the chain process?
My second question is about Dirichlet distribution , how is this distribution related to the prior information in MCMC and how should it be defined?
I recommend following the PyMC user's guide. It explicitly shows you how to specify your model (including priors). With MCMC, you end up getting marginals of all posterior values, so you don't need to know how to marginalize over priors.
The Dirichlet is often used as a prior to multinomial probabilities in Bayesian models. The values of the Dirichlet parameters can be used to encode prior information, typically in terms of a notional number of prior events corresponding to each element of the multinomial. For example, a Dirichlet with a vector of ones as the parameters is just a generalization of a Beta(1,1) prior to multinomial quantities.