I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?
Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).
Related
There's no specific code for this.
Right now, I have a logistic regressor with target column is is_promoted (boolean) with 0's and 1's. When I find the train and test accuracy and MSE, they are between 0 and 1.
I have a different model, it's a linear regressor. The target column is 'resale_price` with values 10,000 and up. When I find the train and test accuracy, they are negative, and sometimes go past -1. And their MSEs are also at least 5 digits long.
What I am wondering is,
In my logistic regressor, the values are 1 digit long, whereas my linear regressor has values 5-6 digits long. Do bigger numbers produce bigger MSE?
My linear regressor train and test MSE are like 100,000. Could something be wrong with my data preparation?
MSE is not a suitable metric for logistic regression. In a machine learning context, logistic regression predicts membership of a binary class based on the data input variables. As you state in the question, the predicted class can only be 1 or 0. The formula for MSE is
Clearly, when both the predicted and actual Y values are only either 0 or 1, this formula doesn't make sense.
Metrics that make more sense for logistic regression as a classifier algorithm are classifier-specific metrics, such as accuracy, sensitivity, and specificity (see confusion matrix).
Linear regression is a regression algorithm, and so predicts a continuous outcome. In this situation, MSE is a suitable metric (along with R-squared, RMSE, MAE, and others).
And in answer to your second question, MSE is dependent on scale, so without further context, this question cannot be answered. A scaleless metric for linear regression is R-squared, which assesses the correlation of the predicted values versus the actual values, with R-Squared = 1 being perfect fit.
I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.
However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.
On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?
What is the real problem for this weird roc_auc results?
It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
I am building a churn prediction model with logistic regression in python. My model accuracy is 0.47 and only predicts 0s. The realized y variable is actually 81 zeros and 92 ones.
The data set I have is only a few features and 220 users(records). If I set a reference time, it is even less(about 123 records for the training set and 173 for the testing set). So I think the sample size is too small to use logistic regression. But I still tried because this is just a sample test so I only got this small data set. (Theoretically there is more data)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('Accuracy: {:.2f}'.format(logreg.score(x_test, y_test)))
Even if I don't test the model, meaning I use the whole data set to build the model, when I predict the future churn it still returns only 0s.
is it that my sample size is too small, or because the accuracy is less than 0.5 so it just returns one value(0 here) ? Or I did something wrong in the code?
Thanks very much!
There are several potential causes for heavily biased prediction from a logistic regression model. For the purpose of informing general audience, I will list the most common ones even though some of them don't apply to your case.
(Skewed output distribution) Your training data has biased, imbalanced label distribution. If your training contains, for example, 1 positive and 100000 negatives, the bias/intercept term in the regression will be very small. After applying the link function the predictions can be practically zero.
(Sparsity) The feature space is large and your dataset is small, leading to a sparse training data. Therefore most new incoming instances of data point aren't seen before. In the worse case, in which all features are factor, unseen factor values result in zeros because the correct one-hot column cannot be identified.
(Skewed input distribution) The feature space is small and your dataset is dense around a small region. If it turns out at that region there are more zeros, the predictions are always gonna be zero even for future instances of input. For example, my data X has two columns, gender and age. It turns out most of my data points are 30 years old male, and 80 out of 100 30-year-old males like ice-cream, in a 101 data-point dataset. The model will predict 30-year-old males like ice-cream for future input, which are usually for 30-year-old males assuming similar input distribution.
You should check the distribution of score using the predict_proba function, and check the distribution of input features using something like pairplot.
I have a pandas dataframe containing texts and labels, and I'm trying to predict the labels using scikit-learn's CountVectorizer, TfidfTransformer and MultinomialNB. Here's what the dataframe looks like:
text party
0 Herr ålderspresident! Att vara talman i Sverig... S
1 Herr ålderspresident! Ärade ledamöter av Sveri... M
2 Herr ålderspresident! Som företrädare för Alli... M
3 Val av andre vice talman Herr ålderspresident!... SD
4 Herr ålderspresident! Vänsterpartiet vill utny... V
When I construct a pipeline with the three estimators mentioned above, I only get a ~35% accuracy in my predictions, but when I remove the TfidfTransformer the accuracy is bumped up to a more reasonable ~75% accuracy.
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()), # problematic row
('clf', MultinomialNB()),
])
text_clf = text_clf.fit(df.text.values, df.party.values)
test = df.sample(500, random_state=42)
docs_test = test.text.values
predicted = text_clf.predict(docs_test)
np.mean(predicted == test.party.values)
# Out: either 0.35 or 0.75 depending on whether I comment out the problematic row above
When I get 0.35 and inspect predicted I see that it almost exclusively contains one label ('S'). This is the most common label in the original dataset, but that shouldn't impact the predictions, right? Any ideas on why I get these strange results?
EDIT: Link to data where anforandetext and parti are the relevant columns.
The reason that you are getting so much difference is because of smoothing. If you checkout the documentation of MultinomialNB class, checkout the alpha parameter. The default value for that is 1.0. This means that it implements Plus One smoothing by default. Plus one smoothing is a very common technique used with relative frequency estimates to account for unseen data. In Plus One smoothing, we add 1 to all raw counts to account for unseen terms and the sparsity of the document-term matrix.
However, when you end up using TF-IDF weights, the numbers that you get are very small and mostly between 0 - 1. To illustrate, if I use your data and only convert it into TF-IDF weights, this is the small snapshot of the TF-IDF weights that I obtain.
(0, 80914) 0.0698184481033
(0, 80552) 0.0304609466459
(0, 80288) 0.0301759343786
(0, 80224) 0.103630302925
(0, 80204) 0.0437500703747
(0, 80192) 0.0808649191625
You can see that these are really small numbers and adding 1 to them for smoothing will have a drastic effect on the calculations that Multinomial Naive Bayes makes. By adding 1 to these numbers, you completely change their scale for classification and hence your estimates mess up. I am assuming, you have a good idea about how Multinomial Naive Bayes works. If not, then definitely see this video. The video and my answer will be sufficient to understand what is going wrong over here.
You should either use a small value of alpha in TF-IDF case or you should build TF-IDF weights after doing smoothing on the raw counts. Also on a secondary note, please use cross-validation to get any accuracy estimates. By testing the model on a sample of the training data, your accuracy numbers will be extremely biased. I would recommend using cross-validation or a separate hold-out set to evaluate your model.
Hope that helps.
I am training my dataset using linearsvm in scikit. Can I calculate/get the probability with which a sample is classified under a given label?
For example, using SGDClassifier(loss="log") to fit the data, enables the predict_proba method, which gives a vector of probability estimates P(y|x) per sample x:
>>> clf = SGDClassifier(loss="log").fit(X, y)
>>> clf.predict_proba([[1., 1.]])
Output:
array([[ 0.0000005, 0.9999995]])
Is there any similar function which I can use to calculate the prediction probability while using svm.LinearSVC (multi-class classification). I know there is a method decision_function to predict the confidence scores for samples in this case. But, is there any way I can calculate probability estimates for the samples using this decision function?
No, LinearSVC will not compute probabilities because it's not trained to do so. Use sklearn.linear_model.LogisticRegression, which uses the same algorithm as LinearSVC but with the log loss. It uses the standard logistic function for probability estimates:
1. / (1 + exp(-decision_function(X)))
(For the same reason, SGDClassifier will only output probabilities when loss="log", not using its default loss function which causes it to learn a linear SVM.)
Multi class classification is a one-vs-all classification. For a SGDClassifier, as a distance to hyperplane corresponding to to particular class is returned, probability is calculated as
clip(decision_function(X), -1, 1) + 1) / 2
Refer to code for details.
You can implement similar function, it seems being reasonable to me for LinearSVC, althrough that probably needs some justification. Refer to paper mentioned in docs
Zadrozny and Elkan, “Transforming classifier scores into multiclass probability estimates”, SIGKDD‘02, http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf
P.s. A comment from "Is there 'predict_proba' for LinearSVC?":
if you want probabilities, you should either use Logistic regression or SVC. both can predict probsbilities, but in very diferent ways.