I tried to use GridSearchCV for multi-class case based on the answer from here:
Accelerating the prediction
But I got value error, multiclass format is not supported.
How can I use this method for multi-class case?
Following code is from the answer in above link.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9, 0.3])
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
print gscv.best_params_
From the documentation on roc_auc_score:
Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
By "label indicator format", they mean each label value is represented as a binary column (rather than as a unique target value in a single column). You don't want to do that for your predictor, as it could result in non-mutually-exclusive predictions (i.e., predicting both label 2 and 4 for case p1, or predicting no labels for case p2).
Pick or custom-implement a scoring function that is well-defined for the multiclass problem, such as F1 score. Personally I find informedness more convincing than F1 score, and easier to generalize to the multiclass problem than roc_auc_score.
It supports multi-class
You can set the para of scoring = f1.macro, example:
gsearch1 = GridSearchCV(estimator = est1, param_grid=params_test1, scoring='f1_macro', cv=5, n_jobs=-1)
Or scoring = 'roc_auc_ovr'
It supports multi-class naturally if the classifier has the correct API by default for y_true and y_pred/y_score.
Otherwise, one has to do some customization using the score function like make_scorer.
For common metrics like AUROC for multi-classes, sklearn offers the 'roc_auc_ovr', where it actually refers to
roc_auc_ovr_scorer = make_scorer(roc_auc_score, needs_proba=True,
multi_class='ovr')
as in the source file.
To deal with multi-class problem with a classifier like e.g.,LogisticRegression, ovr is required and y_true is in the format of categorical values. The above setting will work directly.
Some other metrics for binary classifications can also be extended by wrapping the respective function. E.g., average_precision_score can be wrapped as
from sklearn.preprocessing import OneHotEncoder
def multi_auprc(y_true_cat, y_score):
y_true = OneHotEncoder().fit_transform(y_true_cat.reshape(-1, 1)).toarray()
return average_precision_score(y_true, y_score)
The metric can then be defined for GridsearchCV as
{
'auprc': make_scorer(multi_auprc, needs_proba=True, greater_is_better=True)
}
Related
I am working on a simple multioutput classification problem and noticed this error showing up whenever running the below code:
ValueError: Target is multilabel-indicator but average='binary'. Please
choose another average setting, one of [None, 'micro', 'macro', 'weighted', 'samples'].
I understand the problem it is referencing, i.e., when evaluating multilabel models one needs to explicitly set the type of averaging. Nevertheless, I am unable to figure out where this average argument should go to, since only accuracy_score, precision_score, recall_score built-in methods have this argument which I do not use explicitly in my code. Moreover, since I am doing a RandomizedSearch, I cannot just pass a precision_score(average='micro') to the scoring or refit arguments either, since precision_score() requires correct and true y labels to be passed. This is why this former SO question and this one here, both with a similar issue, didn't help.
My code with example data generation is as follows:
from sklearn.datasets import make_multilabel_classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
X, Y = make_multilabel_classification(
n_samples=1000,
n_features=2,
n_classes=5,
n_labels=2
)
pipe = Pipeline(
steps = [
('scaler', MinMaxScaler()),
('model', MultiOutputClassifier(MultinomialNB()))
]
)
search = RandomizedSearchCV(
estimator = pipe,
param_distributions={'model__estimator__alpha': (0.01,1)},
scoring = ['accuracy', 'precision', 'recall'],
refit = 'precision',
cv = 5
).fit(X, Y)
What am I missing?
From the scikit-learn docs, I see that you can pass a callable that returns a dictionary where the keys are the metric names and the values are the metric scores. This means you can write your own scoring function, which has to take the estimator, X_test, and y_test as inputs. This in turn must compute y_pred and use that to compute the scores you want to use. This you can do doing the built-in methods. There, you can specify which keyword arguments should be used to compute the scores. In code that would look like
def my_scorer(estimator, X_test, y_test) -> dict[str, float]:
y_pred = estimator.predict(X_test)
return {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred, average='micro'),
'recall': recall_score(y_test, y_pred, average='micro'),
}
search = RandomizedSearchCV(
estimator = pipe,
param_distributions={'model__estimator__alpha': (0.01,1)},
scoring = my_scorer,
refit = 'precision',
cv = 5
).fit(X, Y)
From the table of scoring metrics, note f1_micro, f1_macro, etc., and the notes "suffixes apply as with ‘f1’" given for precision and recall. So e.g.
search = RandomizedSearchCV(
...
scoring = ['accuracy', 'precision_micro', 'recall_macro'],
...
)
I have a classification problem where I want to get the roc_auc value using cross_validate in sklearn. My code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))
However, I get the following error.
ValueError: multiclass format is not supported
Please note that I selected roc_auc specifically is that it supports both binary and multiclass classification as mentioned in: https://scikit-learn.org/stable/modules/model_evaluation.html
I have binary classification dataset too. Please let me know how to resolve this error.
I am happy to provide more details if needed.
By default multi_class='raise' so you need explicitly to change this.
From the docs:
multi_class {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’
Multiclass only. Determines the type of configuration to use. The
default value raises an error, so either 'ovr' or 'ovo' must be passed
explicitly.
'ovr':
Computes the AUC of each class against the rest [3] [4]. This treats
the multiclass case in the same way as the multilabel case. Sensitive
to class imbalance even when average == 'macro', because class
imbalance affects the composition of each of the ‘rest’ groupings.
'ovo':
Computes the average AUC of all possible pairwise combinations of
classes [5]. Insensitive to class imbalance when average == 'macro'.
Solution:
Use make_scorer (docs):
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, multi_class='ovo',needs_proba=True)
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = myscore)
I would like the cross_val_score from sklearn function to return the accuracy per each of the classes instead of the average accuracy of all the classes.
Function:
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None,
scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None,
pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)
Reference
How can I do it?
This is not possible with cross_val_score. The approach you suggest would mean cross_val_score would have to return an array of arrays. However, if you look at the source code, you will see that the output of cross_val_score has to be :
Returns
-------
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.
As a result, cross_val_score checks if the scoring method you are using is multimetric or not. If it is, it will throw you an error like:
ValueError: scoring must return a number, got ... instead
Edit:
Like it is correctly pointed out by a comment above, an alternative for you is to use cross_validate instead. Here is how it would work on the Iris dataset for instance:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
scoring = {'recall0': make_scorer(recall_score, average = None, labels = [0]),
'recall1': make_scorer(recall_score, average = None, labels = [1]),
'recall2': make_scorer(recall_score, average = None, labels = [2])}
cross_validate(DecisionTreeClassifier(),X,y, scoring = scoring, cv = 5, return_train_score = False)
Note that this is also supported by the GridSearchCV methodology.
NB: You cannot return "accuracy by each class", I guess you meant recall, which is basically the proportions of correct predictions amongst data points that actually belong to a class.
I am trying to learn how to find the best parameters for a classifier. So, I am using GridSearchCV for a multi-class classification problem. A dummy code was generated on Does not GridSearchCV support multi-class? I am just using that code with n_classes=3.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler,label_binarize
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=10, weights=[0.1, 0.9, 0.3],n_classes=3, n_clusters_per_class=1,n_informative=2)
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
f1_score
my_scorer = make_scorer(f1_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
I am trying to do One-hot encoding as advised here Scikit-learn GridSearch giving "ValueError: multiclass format is not supported" error. Also, sometimes there will be dataset like Toxic Comment Classification dataset on Kaggle which will give you binarized labels.
y = label_binarize(y, classes=[0, 1, 2])
for i in classes:
gscv.fit(X, y[i])
print gscv.best_params_
I am getting:
ValueError: bad input shape (2000L, 3L)
I am not sure why I am getting this error. My objective is to find the best parameters for a multi-class classification problem.
There are two problems in the two parts of your code.
1) Let's start with first part when you have not one-hot encoded the labels. You see, SVC supports the multi-class cases just fine. But the f1_score when combined with (inside) GridSearchCV does not.
f1_score by default returns the scores of positive label in case of binary classification so will throw error in your case.
OR It also can return an array of scores (one for each class), but GridSearchCV only accepts a single value as score because it needs that for finding the best score and best combination of hyper-parameters. So you need to pass the averaging method in f1_score to get a single value from the array.
According to the f1_score documentation, following averaging methods are allowed:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
So change your make_scorer like this:
my_scorer = make_scorer(f1_score, greater_is_better=True, average='micro')
Change the 'average' param above as it suits you.
2) Now coming to the second part: When you one-hot encode the labels, the shape of y becomes 2-d, but SVC only supports a 1-d array as y as specified in the documentation:
fit(X, y, sample_weight=None)[source]
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
But even if you encode the labels and use a classifier which supports the 2-d labels, then the first error will have to be solved. So I would advice you not to one-hot encode the labels and just change the f1_score.
I am using scikit learns cross_val_predict function to perform cross validation on an LDA regression.
Whilst I am doing a binary prediction, I would like to use this function to obtain "raw" predictions as decimals before they are converted into binary.
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
cv = ShuffleSplit(n_splits=5, random_state=1)
scores = cross_val_score(clf, final_list, lab_list, cv=cv, scoring='roc_auc')
pred = cross_val_predict(clf, final_list, lab_list, cv=5)
currently, pred is a binary list, whereas I would like a decimal output in order to perform further statistical analysis. Is this possible with the function used?
thanks!