I am trying to predict a set of labels using Logistic Regression from SciKit. My data is really imbalanced (there are many more '0' than '1' labels) so I have to use the F1 score metric during the cross-validation step to "balance" the result.
[Input]
X_training, y_training, X_test, y_test = generate_datasets(df_X, df_y, 0.6)
logistic = LogisticRegressionCV(
Cs=50,
cv=4,
penalty='l2',
fit_intercept=True,
scoring='f1'
)
logistic.fit(X_training, y_training)
print('Predicted: %s' % str(logistic.predict(X_test)))
print('F1-score: %f'% f1_score(y_test, logistic.predict(X_test)))
print('Accuracy score: %f'% logistic.score(X_test, y_test))
[Output]
>> Predicted: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
>> Actual: [0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1]
>> F1-score: 0.285714
>> Accuracy score: 0.782609
>> C:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:958:
UndefinedMetricWarning:
F-score is ill-defined and being set to 0.0 due to no predicted samples.
I certainly know that the problem is related to my dataset: it is too small (it is only a sample of the real one). However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?
It seems it is a known bug here which has been fixed, I guess you should try update sklearn.
However, can anybody explain the meaning of the "UndefinedMetricWarning" warning that I am seeing? What is actually happening behind the curtains?
This is well-described at https://stackoverflow.com/a/34758800/1587329:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/classification.py
F1 = 2 * (precision * recall) / (precision + recall)
precision = TP/(TP+FP) as you've just said if predictor doesn't
predicts positive class at all - precision is 0.
recall = TP/(TP+FN), in case if predictor doesn't predict positive
class - TP is 0 - recall is 0.
So now you are dividing 0/0.
To fix the weighting problem (it's easy for the classifier to (almost) always predict the more prevalent class), you can use class_weight="balanced":
logistic = LogisticRegressionCV(
Cs=50,
cv=4,
penalty='l2',
fit_intercept=True,
scoring='f1',
class_weight="balanced"
)
LogisticRegressionCV says:
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
Related
I have a 2D Training dataset with 23 samples. There are 6 true positives in the dataset and 10 true negatives. The data passed into the SVM is of shape (23,2) but the support_vectors_ actually used for training is (16,2).
The consequence is the SVM training model biasing true positives which wouldnt have that bias if the true negatives werent dropped.
Circles are the SVM support vectors. Coloured are the input training vectors to the SVM.
Code:
print(features.shape, ground_truths.shape)
print(ground_truths)
svm_model = svm.SVC(kernel=kernel, degree=degree, C = regularization, probability=True)
svm_model.fit(features, ground_truths)
print(svm_model.support_vectors_.shape)
Output:
(23, 2) (23,)
[0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0]
(16, 2)
Visualization
Why is the SVM model dropping these samples?
I got a matrix Z (3000*2000), where each row describes a sample. Each column describes a single feature which is a nucleotide (A,G,T,C) and I have standardized the data so that each column contains only 0 and 1.
The matrix then looks like this:
[[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]
[0 1 0 ... 1 1 1]
...
[1 0 1 ... 0 1 1]
[1 1 0 ... 1 1 1]
[1 1 1 ... 1 1 0]]
And y looks like this:
[
'6484321.23'
'9646585.73'
'2346813.11'
...
'8369179.01'
'6200894.94'
'7927300.10']
I tried this to do a Supportvector machine
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
grid = {"C": np.logspace(-5,5,10), "gamma": np.logspace(-5,5,10)}
cv = KFold(n_splits=10)
for i,j in cv.split(Z,y):
Z1 = Z[i]
Z2 = Z[j]
y1 = y[i]
y2 = y[j]
supportvectorrregression = SVR(kernel="rbf")
gridsearch = GridSearchCV(supportvectorrregression, grid,cv=2, scoring="accuracy", iid=False)
gridsearch.fit(Z1,y1)
scores = gridsearch.decision_function(Z2)
And now i got this error:
ValueError Traceback (most recent call last)
10 supportvectorrregression = SVR(kernel="rbf")
11 gridsearch = GridSearchCV(supportvectorrregression, grid,cv=2, scoring="accuracy", iid=False)
---> 12 gridsearch.fit(Z1,y1)
ValueError: continuous is not supported
Why is continous data here not supported?
What can i do?
I think the problem is within this line :
GridSearchCV(supportvectorrregression, grid,cv=2, scoring="accuracy", iid=False)
You choose scoring = "accuracy" but it seems your model is a regression problem. So it probably tells you it cannot compute accuracy on continuous output prediction.
Maybe you can try by swapping with another appropriate metrics of your choice: scikitlearn doc metrics.
I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10.
This is the data file:
https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing
My goal is to apply the scikit-learn Gaussian NB model to the data, but in a binary classification task where only class 2 is the positive label and the remainder of the classes are all negatives. For that, I did the following code:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
import pandas as pd
dataset = pd.read_csv("PD_21_22_HA1_dataset.txt", index_col=False, sep="\t")
x_d = dataset.values[:, :-1]
y_d = dataset.values[:, -1]
### train_test_split to split the dataframe into train and test sets
## with a partition of 20% for the test https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.20, random_state=23)
yc_TRAIN=np.array([int(i==2) for i in y_TRAIN])
mdl = GaussianNB()
mdl.fit(X_TRAIN, yc_TRAIN)
preds = mdl.predict(X_IVS)
# binarization of "y_true" array
yc_IVS=np.array([int(i==2) for i in y_IVS])
print("The Precision is: %7.4f" % precision_score(yc_IVS, preds))
print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(yc_IVS, preds))
But I get the following warning message when calculating precision:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
The matthew's correlation coeficient func also outputs 0 and gives a runtimewarning: invalid value encountered in double_scalars message.
Furthermore, by inspecting preds, I got that the model predicts only negatives/zeros.
I've tried increasing the 20% test partition as some forums suggested but it didn't do anything.
Is this simply a problem of the model not being able to fit against the data or am I doing something wrong that may be inputting the wrong data format/type into the model?
Edit: yc_TRAIN is the result of turning all cases from class 2 into my true positive cases "1" and the remaining classes into negatives/0, so it's a 1-d array of length 9450 (which matches my total number of prediction cases) with over 8697 0s and 753 1s, so its aspect would be something like this:
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Your code looks fine; this is a classic problem with imbalanced datasets, and it actually means you do not have enough training data to correctly classify the rare positive class.
The only thing you could improve in the given code is to set stratify=y_d in train_test_split, in order to get a stratified training set; decreasing the size of the test set (i.e. leaving more samples for training) may also help:
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.10, random_state=23, stratify=y_d)
If this does not work, you should start thinking of applying class imbalance techniques (or different models); but this is not a programming question any more but a theory/methodology one, and it should be addressed at the appropriate SE sites and not here (see the intro and NOTE in the machine-learning tag info).
clf = SVC(kernel = 'rbf', probability=True)
cv = KFold(n_splits=5, shuffle = False)
for train_index, test_index in cv.split(data[name][:, :]):
X_train, X_test, y_train, y_test = data[name][train_index], data[name][test_index], labels[train_index], labels[test_index]
clf.fit(X_train, y_train)
print('\ndecision_function : \n', clf.decision_function(X_test))
print('\npredict probabilty : \n', clf.predict_proba(X_test)[:,1])
print('prediction : ', clf.predict(X_test))
scores_temp = cross_val_score(clf, data[name], labels, cv = cv)
y_pred = cross_val_predict(clf, data[name], labels, cv=cv)
cfm = confusion_matrix(labels, y_pred, labels = [1,0])
print('cross_val_score : ', scores_temp)
print('y_pred :', y_pred)
I am conducting a binary classification of 20 data consisting of 23 features.
I use sklearn's SVM to perform 5fold-cross-validation, and both cross_val_score function and decision function methods produce the same results, but for the required predict_proba, my output produce different results.
When I set the threshold(default) to zero in the decision function, my output trying to produce the classification result that I want, but I wonder why clf.predict_proba has a higher probability of other classes (some data cases are printed correctly).
decision_function :
[-0.47648561 -1.10368407 -0.87143091 0.15767618]
predict probabilty :
[0.44280357 0.37062859 0.396561 0.51976092]
prediction : [0 0 0 1]
decision_function :
[ 0.32952592 -0.40706183 0.75996465 -0.03300075]
predict probabilty :
[0.5 0.5 0.5 0.5]
prediction : [1 0 1 0]
decision_function :
[-0.64538296 -0.17156661 0.19868302 -0.01625615]
predict probabilty :
[0.57873957 0.50805464 0.45274971 0.48478674]
prediction : [0 0 1 0]
decision_function :
[-0.07399867 -0.24230866 0.07278593 0.4155084 ]
predict probabilty :
[0.52367709 0.48493753 0.55733607 0.63592478]
prediction : [0 0 1 1]
decision_function :
[-0.30797883 -0.1165149 0.35051264 -0.0306706 ]
predict probabilty :
[0.48693078 0.5 0.53308813 0.50638279]
prediction : [0 0 1 0]
cross_val_score : [0.75 0.5 0.75 1. 0.75]
y_pred : [0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0]
I got this error when I want to get the prediction accuracy and I try every possible way and all stack issues but finally I can not solve the bugs...
the code snippest with bug is:
author_pred1 = model1.predict([ThreeGramTest, ThreeGramTest, ThreeGramTest,ThreeGramTest])
print("class prediction without argmax:",author_pred1)
author_pred1=np.argmax(author_pred1, axis=1)
# Evaluate
print("test data one hot lable", TestAuthorHot)
print("class prediction with argmax:",author_pred1)
# author_pred1 = author_pred1.astype("int64")
print("type of prediction output",type(author_pred1))
print("type of test data", type(TestAuthorHot))
print(np.array(np.unique(author_pred1, return_counts=True)).T)
print(np.array(np.unique(TestAuthorHot, return_counts=True)).T)
# accuracy = accuracy_score(TestAuthorHot, author_pred1.round(), normalize=False)# the bug is here
precision, recall, f1, support = score(TestAuthorHot, author_pred1)
ave_precision = np.average(precision, weights=support / np.sum(support))
ave_recall = np.average(recall, weights=support / np.sum(support))
to know the shapes, the value of data is:
class prediction without argmax: [[3.9413989e-02 8.4685171e-03 2.7781539e-03 ... 5.0324947e-03
6.2263450e-07 3.1461464e-10]
[1.1533947e-02 4.0361892e-02 1.4060171e-02 ... 4.7175577e-05
1.4333490e-01 2.0528505e-07]
[4.5363868e-06 3.1557463e-03 1.4047540e-02 ... 1.3272668e-03
4.6724287e-07 5.9454552e-10]
...
[1.9417159e-04 1.7364822e-02 2.9031632e-03 ... 5.0036388e-04
1.3315305e-04 9.0704253e-07]
[1.8054984e-09 2.9453583e-08 2.3744430e-08 ... 2.7137769e-03
7.7114571e-08 4.9026494e-10]
[7.8946296e-06 5.9516740e-05 8.2868773e-10 ... 3.1905161e-04
2.5262805e-06 2.0384558e-09]]
test data one hot lable [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 1 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]]
class prediction with argmax: [ 7 37 37 ... 39 4 4]
how can I handle the bugs???
The error happens because you are passing to accuracy_score a 2D matrix (TestAuthorHot is a 2D one-hot matrix of labels). accuracy_score accepts only 1D vectors, so you need to transform TestAuthorHot into 1D in order to match it with author_pred1 (which is 1D)
To do this u can simply do:
accuracy_score(np.argmax(TestAuthorHot, axis=1), author_pred1)