I have built a number of sklearn classifier models to perform multi-label classification and I would like to calibrate their predict_proba outputs so that I can obtain confidence scores. I would also like to use metrics such as sklearn.metrics.recall_score to evaluate them.
I have 4 labels to predict and the true labels are multi-hot encoded (e.g. [0, 1, 1, 1]). As a result, CalibratedClassifierCV does not directly accept my data:
clf = tree.DecisionTreeClassifier(max_depth=15)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
This would return an error:
ValueError: classes [[0 1]
[0 1]
[0 1]
[0 1]] mismatch with the labels [0 1 2 3] found in the data
Thus, I tried to wrap it in a OneVsRestClassifier:
clf = OneVsRestClassifier(tree.DecisionTreeClassifier(max_depth=15), n_jobs=4)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
Note that MultiOutputClassifier and ClassifierChain do not work even though they possibly suit my problem better.
It works, but the predict output of the calibrated classifier is multi-class instead of multi-label because of its implementation. There are four classes ([0 1 2 3]) but if there is no need to put a label, it still predicts a 0.
Upon further inspection by means of calibration curves, it turns out the base estimator wrapped inside the calibrated classifier is not calibrated at all. That is, (calibrated_clf.calibrated_classifiers_)[0].base_estimator returns the same clf as before calibration.
I would like to observe the performance of my (calibrated) models doing deterministic (predict) and probabilistic (predict_proba) predictions. How should I design my model/wrap things in other containers to get both calibrated probabilities for each label and comprehensible label predictions?
In your example, you're using a DecisionTreeClassifier which by default support targets of dimension (n, m) where m > 1.
However if you want to have as result the marginal probability of each class then use the OneVsRestClassifier.
Notice that CalibratedClassifierCV expects target to be 1d so the "trick" is to extend it to support Multilabel Classification with MultiOutputClassifier.
Full Example
from sklearn.datasets import make_multilabel_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Generate a sample multilabel target
X, y = make_multilabel_classification(n_classes=4, random_state=0)
y
>>>
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
...
[0, 0, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]])
# Split in train/test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.9, random_state=42
)
# Splits Stratify target variable
cv = StratifiedKFold(n_splits=2)
# Decision tree support by default multiclass target or use OneVsRest if marginal probabilities
clf = OneVsRestClassifier(DecisionTreeClassifier(max_depth=10))
# Calibrate estimator probabilities
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, cv=cv)
# calibrated_clf target is one dimensional, extend classifier to multi-target classification.
multioutput_clf = MultiOutputClassifier(calibrated_clf).fit(X_train, y_train)
# Check predict
multioutput_clf.predict(X_test[-5:])
>>>
array([[0, 0, 1, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
# Check predict_proba
multioutput_clf.predict_proba(X_test[-5:])
>>>
[array([[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685]]),
array([[0.59166537, 0.40833463],
[0.59166537, 0.40833463],
[0.40833361, 0.59166639],
[0.59166537, 0.40833463],
[0.59166537, 0.40833463]]),
array([[0.61666922, 0.38333078],
[0.61666427, 0.38333573],
[0.80000098, 0.19999902],
[0.61666427, 0.38333573],
[0.61666427, 0.38333573]]),
array([[0.26874774, 0.73125226],
[0.26874774, 0.73125226],
[0.45208444, 0.54791556],
[0.26874774, 0.73125226],
[0.26874774, 0.73125226]])]
Notice that the result from predict_proba is a list with 4 arrays, each array is the probability to belong to the class i. For example, inside the first sample of the first array is the probability that first sample belongs to class 1 and so on.
Regarding the calibration curves, scikit-learn provides examples to plot probability path for two dimension and three dimension targets.
Related
I am trying to predict a model using Independent variable (Arabic Sentence) and Dependent variables (Multiclass but using One Hot encoding technique. I used Tokenizer Technique for Train and test set
The Model:
model = Sequential()
model.add(Embedding(num_words,32,input_length=max_length))
model.add(LSTM(64,dropout=0.1))
model.add(Dense(4,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer=optimizer, metrics=['accuracy'])
# some code here
model.fit(train_padded,y_train,epochs=1, validation_data=(test_padded,y_test))
The problem is when I use score = f1_score(y_test, ynew, average='weighted') as evaluation. It shows the following error:
ValueError: Classification metrics can't handle a mix of multilabel-indicator and multiclass targets
ynew and y_test values are the following:
ynew= array([2, 1, 3, ..., 3, 0, 1]`, dtype=int64)
y_test = array([[0, 0, 1, 0],
[0, 1, 0, 0],
[0, 0, 0, 1],
...,
[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]], dtype=uint8)
Both arguments of f1_score() must be in the same format: either one-hot encoding or label encoding. You cannot pass two differently encoded arguments. Use one of the following options.
Option 1: You could convert ynew to one-hot encoding.
# one-hot encode ynew, before calculating f1_score
ynew = keras.utils.to_categorical(ynew)
f1_score(y_test, ynew, average='weighted')
Option 2: You could convert y_new to one-hot encoding using LabelBinarizer.
from sklearn.preprocessing import LabelBinarizer
# one-hot encode ynew, before calculating f1_score
ynew = LabelBinarizer().fit_transform(ynew)
f1_score(y_test, ynew, average='weighted')
Option 3: You could convert y_test from one-hot encoding to label encoding.
import numpy as np
# label encode y_test, before calculating f1_score
y_test = np.argmax(y_test, axis=1)
f1_score(y_test, ynew, average='weighted')
I have been trying to clustering based on the SGD model parameters (Coefficient and Intercept). coef_ holds the weights w and intercept_ holds b.
How can those parameters be used with clustering (KMedoids) on a group of the learned model?
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier()
clf.fit(X, Y)
So I want to make clustering based on clf.coef_ (array([[19.47419669, 9.73709834]])) and clf.intercept_ (array([-10.])) for each learned model.
Build your X dataset for clustering by appending the coeffs and intercept arrays every time after you train a model, ie.:
X = np.vstack((X, np.hstack((clf.coeff_, clf.intercept_))))
Once you have all your data in X feed it a KMedoids model, ie.:
from sklearn_extra.cluster import KMedoids
kmed = KMedoids(n_clusters=N).fit(X)
Note that you have specify N and you should probably test the clustering results for a number of values of N before choosing the best one based on one or more of clustering metrics.
I have some SVM classifier (LinearSVC) outputting final classifications for every sample in the test set, something like
1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1
and so on.
The "truth" labels is also something like
1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1
I would like to run that svm with some parameters, and generate points for the roc curve, and calculate auc.
I could do this by myself, but I am sure someone did it before me for cases like this.
Unfortunately, everything I can find is for cases where the classifier returns probabilities, rather than hard estimations, like here or here
I thought this would work, but from sklearn.metrics import plot_roc_curve is not found!
anything online that fits my case?
Thanks
You could get around the problem by using sklearn.svm.SVC and setting the probability parameter to True.
As you can read:
probability: boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to
calling fit, will slow down that method as it internally uses 5-fold
cross-validation, and predict_proba may be inconsistent with predict.
Read more in the User Guide.
As an example (details omitted):
from sklearn.svm import SVC
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
.
.
.
model = SVC(kernel="linear", probability=True)
model.fit(X_train, y_train)
.
.
.
decision_scores = model.decision_function(X_test)
fpr, tpr, thres = roc_curve(y_test, decision_scores)
print('AUC: {:.3f}'.format(roc_auc_score(y_test, decision_scores)))
# roc curve
plt.plot(fpr, tpr, "b", label='Linear SVM')
plt.plot([0,1],[0,1], "k--", label='Random Guess')
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.legend(loc="best")
plt.title("ROC curve")
plt.show()
and you should get something like this:
NOTE that LinearSVC is MUCH FASTER than SVC(kernel="linear"), especially if the training set is very large or plenty of features.
You can use decision function here
from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification
X, y = make_classification(n_features=4, random_state=0)
clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(X, y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=0, tol=1e-05, verbose=0)
print(clf.predict([[0, 0, 0, 0]]))
#>>[1]
print(clf.decision_function([[0, 0, 0, 0]]))
#>>[ 0.2841757]
The cleanest way would be to use Platt scaling to convert the distance to hyperplane as given by decision_function into a probability.
However, quick and dirty
[math.tanh(v)/2+0.5 for v in clf.decision_function([[0, 0, 0, 0],[1,1,1,1]])]
#>>[0.6383826839666699, 0.9635586809605969]
As Platts scaling is preserves the order of the example the result in the roc curve will be consistent.
In addition:
Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.
I use GridSearchCV to fit SVM, and I want to know the number of support vectors for all the fitted models. For now I can only access this SVM's attribute for the best model.
Toy example:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
clf = SVC()
params = {'C': [0.01, 0.1, 1]}
search = GridSearchCV(estimator=clf, cv=2, param_grid=params, return_train_score=True)
search.fit(X, y);
with the number of support vectors for the best model:
search.best_estimator_.n_support_
How to get the n_support_ for all models? Just as we get train/test error separately for each parameter C.
I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().
Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))
You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.