Based on the Logistic Regression function:
I'm trying to extract the following values from my model in scikit-learn.
and
Where is the intercept and is the regression coefficient. (as per the wikipedia)
Now, I think I can get by doing model.intercept_ but I've been struggling to get . Any ideas?
You can access the coefficient of the features using model.coef_.
It gives a list of values that corresponds to the values beta1, beta2 and so on. The size of the list depends on the amount of explanatory variables your logistic regression uses.
Related
I tried to do the univariate analysis (binary logistic regression, one feature each time) in Python with statsmodel to calculate the p-value for a different feature.
for f_col in f_cols:
model = sm.Logit(y,df[f_col].astype(float))
result = model.fit()
features.append(str(result.pvalues).split(' ')[0])
pvals.append(str(result.pvalues).split(' ')[1].split('\n')[0])
df_pvals = pd.DataFrame(list(zip(features, pvals)),
columns =['features', 'pvals'])
df_pvals
However, the result in the SPSS is different. The p-value of NYHA in the sm.Logit method is 0. And all of the p-values are different.
Is it right to use sm.Logit in the statsmodel to do the binary logistic regression?
Why there is a difference between the results? Probably sm.Logit use L1 regularization?
How should I get the same?
Many thanks!
SPSS regression modeling procedures include constant or intercept terms automatically, unless they're told not to do so. As Josef mentions, statsmodels appears to require you to explicitly add an intercept.
Is there a python package (statsmodels/scipy/pandas/etc...) with functionality for estimating coefficients for a linear regression model with autoregressive errors in python, such as the following SAS implementation below? http://support.sas.com/documentation/cdl/en/etsug/63348/HTML/default/viewer.htm#etsug_autoreg_sect003.htm
statsmodels http://www.statsmodels.org/dev/index.html has ARMA, ARIMA and SARIMAX models that take explanatory variables to model the mean. This corresponds to a linear model, y = X b + e, where the error term e follows an ARMA or seasonal ARMA process. AR errors are a special case when the moving average term has no lags.
statsmodels also has an autoregressive AR class but it does not allow for explanatory variables.
In these time series models, prediction is a conditional prediction that takes the history into account for forecasting.
statsmodels also has a GLSAR class which is a linear model that removes the effect of autocorrelated AR residuals. This uses feasible generalized least squares estimation and can only predict the unconditional term X b.
http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARMA.html#statsmodels.tsa.arima_model.ARMA
http://www.statsmodels.org/dev/statespace.html#seasonal-autoregressive-integrated-moving-average-with-exogenous-regressors-sarimax
How does scikit-learn's sklearn.linear_model.LogisticRegression class work with regression as well as classification problems?
As given on the Wikipedia page as well as a number of sources, since the output of Logistic Regression is based on the sigmoid function, it returns a probability. Then how does the sklearn class work as both a classifier and regressor?
Logistic regression is a method for classification, not regression. This goes for scikit-learn as for anywhere else.
If you have entered continuous values as the target vector y, then LogisticRegression will most probably fail, as it interprets the unique values of y, i.e. np.unique(y) as different classes. So you may end up having as many classes as samples.
TL;DR: Logistic regression needs a categorical target variable, because it is a classification method.
I have texts that are rated on a continous scale from -100 to +100. I am trying to classify them as positive or negative.
How can you perform binomial log regression to get the probability that test data is -100 or +100?
The closest I have got is the SGDClassifier( penalty='l2',alpha=1e-05, n_iter=10), but this doesn't provide the same results as SPSS when I use binomial log regression to predict the probability of -100 and +100. So I'm guessing this is not the right function?
SGDClassifier provides access to several linear classifiers, all trained with stochastic gradient decent. It will default to a linear support vector machine, unless you call it with a different loss function. loss='log' will provide a probabilistic logistic regression.
See the documentation at:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Alternatively, you could use sklearn.linear_model.LogisticRegression to classify your texts with a logistic regression.
It's not clear to me that you will get exactly the same results as you do with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.
Edited to add:
My suspicion is that the 99% accuracy you're getting with the SPSS logistic regression is training set accuracy, while the 87% that you're seeing with scikits-learn logistic regression is test set accuracy. I found this question on the datascience stack exchange where a different person is attempting and extremely similar problem, and getting ~99% accuracy on training sets and 90% test set accuracy.
https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features
My recommended path forwards is a follows: Try several different basic classifiers in scikits-learn, including the standard logistic regression and a linear SVM, and also rerun the SPSS logistic regression several times with different train/test subsets of your data and compare the results. If you continue to see a large divergence across classifiers that can't be accounted for by ensuring similar train/test data splits, then post the results that you're seeing into your question, and we can move forward from there.
Good luck!
If pos/neg, or the probability of pos, is really the only thing you need as output, then you can derive binary labels y as
y = score > 0
assuming you have the scores in a NumPy array score.
You can then feed this to a LogisticRegression instance, using the continuous score to derive relative weights for the samples:
clf = LogisticRegression()
sample_weight = np.abs(score)
sample_weight /= sample_weight.sum()
clf.fit(X, y, sample_weight)
This gives maximum weight to tweets with scores ±100, and a weight of zero to tweets that are labeled neutral, varying linearly between the two.
If the dataset is very large, then as #brentlance showed, you can use SGDClassifier, but you have to give it loss="log" if you want a logistic regression model; otherwise, you'll get a linear SVM.
I am running a logistic regression using statsmodels and am trying to find the score of my regression. The documentation doesn't really provide much information about the score method unlike sklearn which allows the user to pass a test dataset with the y value and the regression coefficients i.e. lr.score(test_data, target). What and how should I pass parameters to the statsmodels's score function? Documentation: http://statsmodels.sourceforge.net/stable/generated/statsmodels.discrete.discrete_model.Logit.score.html#statsmodels.discrete.discrete_model.Logit.score
In statistics and econometrics score refers usually to the derivative of the log-likelihood function. That's the definition used in statsmodels.
Prediction performance measures for classification or regression with binary dependent variables have largely been neglected in statsmodels.
An open pull request is here
https://github.com/statsmodels/statsmodels/issues/1577
statsmodels does have performance measures for continuous dependent variables.
You pass it model parameters, i.e. the coefficients for the predictors. However, that method doesn't do what you think it does: it returns the score vector for the model, not the accuracy of its predictions (like the scikit-learn score method).
But you can always check sm.rsquared