I'm optimizing a logistic regression in sklearn through repeated kfold cross validation. I want to checkout the confidence interval and, based on other stack exchange answers, it seems easier to get that info from statsmodels.
Coming from a basis in sklearn though, statsmodels is opaque. How can I translate the optimized settings for my logistic regression into statsmodels. Things like the L2 penalty, C value, intercept, etc.?
I've done some research and it looks like statsmodel support L2 indirectly through GLM Binomials. The C value will need to be converted from C into alpha (whatever that means), and I have only a vague idea of how to specify the intercept (it looks like it has something to do with the add_constant function).
Can someone drop an example of how to do this kind of translation into statsmodels? I'm sure once I see it done, a lot of it will naturally fall into place in my head.
Related
my inputs are feature vectors, and my outputs/targets are floats between 0 and 1 for training.
I am familiar that XGBRFRegressor and XGBRegressor are capable of this, but I am seeking logistic regression equivalent rather than using gradient boosting. I guess I could manually do the log transformations turning it into a standard linear regression problem, but seeking to see if there is already a solution for this.
Thanks,
Will
I solved this by transforming to linear scale then training using any regression method, and then transforming back with a log transformation.
see an example at how to use sklearn when target variable is a proportion
I tried to do the univariate analysis (binary logistic regression, one feature each time) in Python with statsmodel to calculate the p-value for a different feature.
for f_col in f_cols:
model = sm.Logit(y,df[f_col].astype(float))
result = model.fit()
features.append(str(result.pvalues).split(' ')[0])
pvals.append(str(result.pvalues).split(' ')[1].split('\n')[0])
df_pvals = pd.DataFrame(list(zip(features, pvals)),
columns =['features', 'pvals'])
df_pvals
However, the result in the SPSS is different. The p-value of NYHA in the sm.Logit method is 0. And all of the p-values are different.
Is it right to use sm.Logit in the statsmodel to do the binary logistic regression?
Why there is a difference between the results? Probably sm.Logit use L1 regularization?
How should I get the same?
Many thanks!
SPSS regression modeling procedures include constant or intercept terms automatically, unless they're told not to do so. As Josef mentions, statsmodels appears to require you to explicitly add an intercept.
The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
I am trying out logistic regression from scratch in python.(through finding probability estimates,cost function,applying gradient descent for increasing the maximum likelihood).But I have a confusion regarding which estimates should I take for the first iteration process.I took all the estimates as 0(including the intercept).But the results are different from that we get in Scikit-learn.I want to know which are the initial estimates taken in Scikit-learn for logistic regression?
First of all scikit learn's LogsiticRegerssion uses regularization. So unless you apply that too , it is unlikely you will get exactly the same estimates. if you really want to test your method versus scikit's , it is better to use their gradient decent implementation of Logistic regersion which is called SGDClassifier . Make certain you put loss='log' for logistic regression and set alpha=0 to remove regularization, but again you will need to adjust the iterations and eta as their implementation is likely to be slightly different than yours.
To answer specifically about the initial estimates, I don't think it matters, but most commonly you set everything to 0 (including the intercept) and should converge just fine.
Also bear in mind GD (gradient Decent) models are hard to tune sometimes and you may need to apply some scaling(like StandardScaler) to your data beforehand as very high values are very likely to drive your gradient out of its slope. Scikit's implementation adjusts for that.
In python sklearn library, both RandomizedLogisticRegression and RandomizedLasso are supported as feature selection methods.
However, they are all using L1(Lasso) penalty, and I am not sure why both of them are implemented. In fact, I though that Lasso regression is the other term of L1-regularized logistic regression, but maybe there seems to be some difference.
I think even Linear SVM with L1 penalty(combined with resampling) will also produce the similar result.
Are there significant difference among them?
From: http://scikit-learn.org/stable/modules/feature_selection.html#randomized-l1
RandomizedLasso implements this strategy for regression settings, using the Lasso, while RandomizedLogisticRegression uses the logistic regression and is suitable for classification tasks. To get a full path of stability scores you can use lasso_stability_path.
RandomizedLasso is used for regression in which the outcome is continuous. RandomizedLogisticRegression on the other hand is for classification in which the outcome is a class label.