I'm pretty new to python and have just started a class on regression and data modeling. I ran an logistic regression and my psuedo r2 is infinity. I tried googling without much success, can anyone help me interpet what this means theoretically? Is it similar to an r2 of 1 in linear regression, meaning that my model is hypothetically perfects but in reality probably not great?
I've included the top portion of my logit table below.
Related
I'm optimizing a logistic regression in sklearn through repeated kfold cross validation. I want to checkout the confidence interval and, based on other stack exchange answers, it seems easier to get that info from statsmodels.
Coming from a basis in sklearn though, statsmodels is opaque. How can I translate the optimized settings for my logistic regression into statsmodels. Things like the L2 penalty, C value, intercept, etc.?
I've done some research and it looks like statsmodel support L2 indirectly through GLM Binomials. The C value will need to be converted from C into alpha (whatever that means), and I have only a vague idea of how to specify the intercept (it looks like it has something to do with the add_constant function).
Can someone drop an example of how to do this kind of translation into statsmodels? I'm sure once I see it done, a lot of it will naturally fall into place in my head.
I have two features, say F1 and F2 which has a correlation of about 0.9.
When I built my model, I first considered all the features to go into my regression model. Once I have my model, I then ran Lasso regression on my model, with the hope that this will tackle any colinearity between the features. However, the Lasso regression kept both F1 and F2 in my model.
Two questions:
i) If F1 and F2 are highly correlated, but Lasso regression still kept both of them, what could this mean? Does it mean regularization doesn't work in some cases?
ii) How do I adjust my model or the Lasso regression model to kick out F1 or F2 in my model? (I am using sklearn.linear_model.LogisticRegression, and have set penalty = 'l1' or ‘elasticnet’, tried very large or very small C values, tried 'liblinear' or 'saga' solvers, and l1_ratio = 1, but I still can't kick out either F1 or F2 from my model)
Answers to your questions:
i) Lasso reduces coefficients gradually. You may find a nice picture in some books authored by Robert Tibshirani, the person behind the Lasso/Ridge, where you will see how some coefficients gradually fall to zero as regularization coefficient is increasing (you may perform such an experiment yourself). The fact the model still keeps both may mean two things: either the model deems both important or there no enough regularization to kill one of them.
ii) You're right you're going with Lasso with L1 regularization. It is C parameter. The way it's coded in sklearn: the smaller the C the higher the regularization parameter (inverse). Though in machine learning your task is not to totally exclude collinearity ("to kill F1 or F2" in your parlor), but to find a model (or a set of params if you wish) that will generalize best. That is done through model tuning via CV. Warning: higher regularization means more underfitting.
I would add though that collinearity is somewhat dangerous for linear regression because it may give rise to model instability (differing coefficients on different subsamples). So, with linear regression, you may wish to check this too.
I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?
Also I can get rmse using following command:
scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()
From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.
Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.
See the documentation here:
Recursive Feature Elimination
Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.
RMSE measures the average magnitude of the prediction error.
RMSE gives high weight to high errors, lower the values it's always better. RMSE can be improved only if you have a decent model. For feature selection, you can use PCA or stepwise regression or basic correlation technique. If you see a lot of multi-collinearity then go for Lasso or Ridge regression. Also, make sure you have a decent split of test and train data. If you have bad testing data you will get poor results. Also, check training data R-sq and testing data R-sq to make sure the model doesn't over-fit.
It would be helpful if you add information on no. of observations in your test and train data and r-sq value. Hope this helps
During implementing a linear regression model on a bag of words, python returned very large/low values. train_data_features contains all words, which are in the training data. The training data contains about 400 comments of each less than 500 characters with a ranking between 0 and 5. Afterwards, I created a bag of words for each document. While trying to perform a linear regression on the matrix of all bag of words,
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(train_data_features, train['dim_hate'])
coef = clf.coef_
words = vectorizer.get_feature_names()
for i in range(len(words)):
print(str(words[i]) + " " + str(coef[i]))
the result seems to be very strange (just an example of 3 from 4000). It shows the factors of the created regression function for the words.
btw -0.297473967075
land 54662731702.0
landesrekord -483965045.253
I'm very confused because the target variable is between 0 and 5, but the factors are so different. Most of them have very high/low numbers and I was expecting only values like the one of btw.
Do you have an idea, why the results are like they are?
It might be that your model is overfitting to the data, since it's trying to exactly match the outputs. You're right to be worried and suspicious, because it means that your model is probably overfitting to your data and will not generalize well to new data. You can try one of two things:
Run LinearRegression(normalize=True) and see if it helps with the coefficients. But it will only be a temporary solution.
Use Ridge regression instead. It is basically doing Linear Regression, except adding a penalty for having coefficients that are too large.
Check for correlated features in your data-set.
You may run into the problem if your features are highly correlated. For example expenses per customer -
jan_expenses, feb_expenses, mar_expenses, Q1_expenses
the Q1 feature is the sum of the jan-mar and therefore your coefficients, when trying to fit, will go 'crazy' as it will struggle to find a line that best describes the monthly features and the Q feature. Try and remove the highly correlated features and re-run.
(btw Ridge regression also solved the problem for me but I was curious as to why this happens so i dug in a bit)
I am learning Logistic Regression from sklearn and came across this : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
I have a created an implementation which shows me the accuracy scores for training and testing. However it is very unclear how this was achieved. My question is : What is the Maximum likelihood estimate? How is this being calculated? What is the error measure? What is the optimisation algorithm used?
I know all of the above in theory, however I am not sure where and when and how scikit.learn calculates it, or if its something I need to implement at some point. I have an accuracy rate of 83% which was what I was aiming for but I am very confused about how this was achieved by scikit learn.
Would anyone be able to point me in the right direction?
I recently started studying LR myself, I still don't get many steps of the derivation but I think I understand which formulas are being used.
First of all let's assume that you are using the latest version of scikit-learn and that the solver being used is solver='lbfgs' (which is the default I believe).
The code is here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py
What is the Maximum likelihood estimate? How is this being calculated?
The function to compute the likelihood estimate is this one https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57
The interesting line is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
which is the formula 7 of this tutorial. The function also computes the gradient of the likelihood, which is then passed to the minimization function (see below). One important thing is that the intercept is w0 of the formulas in the tutorial. But that's only valid fit_intercept is True.
What is the error measure?
I'm sorry I'm not sure.
What is the optimisation algorithm used?
See the following lines in the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389
It's this function http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html
One very important thing is that the classes are +1 or -1! (for the binary case, in the literature 0 and 1 are common, but it won't work)
Also notice that numpy broadcasting rules are used in all formulas. (That's why you don't see iteration)
This was my attempt at understanding the code. I slowly went mad till the point of ripping appart scikit-learn code (in only works for the binary case). Also this served as inspiration too
Hope it helps.
Check out Prof. Andrew Ng's machine learning notes on Logistic Regression (starting from page 16): http://cs229.stanford.edu/notes/cs229-notes1.pdf
In logistic regression you minimize cross entropy (which in turn maximizes the likelihood of y given x). In order to do this, the gradient of the cross entropy (cost) function is being computed and is used to update the weights of the algorithm which are assigned to each input. In simple terms, logistic regression comes up with a line that best discriminates your two binary classes by changing around its parameters such that the cross entropy keeps going down. The 83% accuracy (i'm not sure what accuracy that is; you should be diving your data into training/validation/testing) means the line Logistic Regression is using for classification can correctly separate the classes 83% of the time.
I would have a look at the following on github :
https://github.com/scikit-learn/scikit-learn/blob/965b109bf2ac3a61dcbd02bc29dd8c9598c2b54c/sklearn/linear_model/logistic.py
The link is to the implementation of sklearn logictic regression. It contains the optimization algorithms used which include newton conjugate gradient (newton-cg) and bfgs (broyden fletcher goldfarb shanno algorithm) all of which require the calculation of the hessian of the loss function (_logistic_loss) . _logistic_loss is your likelihood function.