I am trying to learn how to find the best parameters for a classifier. So, I am using GridSearchCV for a multi-class classification problem. A dummy code was generated on Does not GridSearchCV support multi-class? I am just using that code with n_classes=3.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler,label_binarize
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=10, weights=[0.1, 0.9, 0.3],n_classes=3, n_clusters_per_class=1,n_informative=2)
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
f1_score
my_scorer = make_scorer(f1_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
I am trying to do One-hot encoding as advised here Scikit-learn GridSearch giving "ValueError: multiclass format is not supported" error. Also, sometimes there will be dataset like Toxic Comment Classification dataset on Kaggle which will give you binarized labels.
y = label_binarize(y, classes=[0, 1, 2])
for i in classes:
gscv.fit(X, y[i])
print gscv.best_params_
I am getting:
ValueError: bad input shape (2000L, 3L)
I am not sure why I am getting this error. My objective is to find the best parameters for a multi-class classification problem.
There are two problems in the two parts of your code.
1) Let's start with first part when you have not one-hot encoded the labels. You see, SVC supports the multi-class cases just fine. But the f1_score when combined with (inside) GridSearchCV does not.
f1_score by default returns the scores of positive label in case of binary classification so will throw error in your case.
OR It also can return an array of scores (one for each class), but GridSearchCV only accepts a single value as score because it needs that for finding the best score and best combination of hyper-parameters. So you need to pass the averaging method in f1_score to get a single value from the array.
According to the f1_score documentation, following averaging methods are allowed:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
So change your make_scorer like this:
my_scorer = make_scorer(f1_score, greater_is_better=True, average='micro')
Change the 'average' param above as it suits you.
2) Now coming to the second part: When you one-hot encode the labels, the shape of y becomes 2-d, but SVC only supports a 1-d array as y as specified in the documentation:
fit(X, y, sample_weight=None)[source]
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
But even if you encode the labels and use a classifier which supports the 2-d labels, then the first error will have to be solved. So I would advice you not to one-hot encode the labels and just change the f1_score.
Related
I have a logistic regression that I want to know the AUC for.
I created the logistic regression model using statsmodels:
import statsmodels.api as sm
y = generate_data(dependent_var) # pseudocode
X = generate_data(independent_var) # pseudocode
X['constant'] = 1
logit_model=sm.Logit(y,X)
result=logit_model.fit()
Then, I use sklearn to get an AUC score for my model predictions:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, result.predict())
The code runs and I get a AUC score, I just want to make sure I am passing variables between the package calls correctly.
Input for roc_auc_score :
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
sklearn.metrics.roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]
y_true array-like of shape (n_samples,) or (n_samples, n_classes)
True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
y_score array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores.
To validate the input parameters:
y_true: correct
y_score: result.predict(X), based on https://www.statsmodels.org/devel/examples/notebooks/generated/predict.html
But your validation is missing the concept of train and testsplit wich is necassary in machine learning. Normally you always divide the dataset in a training part and in a test part. In a pseudo code this would look like:
import statsmodels.api as sm
y = generate_data(dependent_var) # pseudocode
X = generate_data(independent_var) # pseudocode
X['constant'] = 1
#pseudo line
X_train, X_test, y_train, y_test = train_test_split
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, result.predict(X_test))
I have a classification problem where I want to get the roc_auc value using cross_validate in sklearn. My code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))
However, I get the following error.
ValueError: multiclass format is not supported
Please note that I selected roc_auc specifically is that it supports both binary and multiclass classification as mentioned in: https://scikit-learn.org/stable/modules/model_evaluation.html
I have binary classification dataset too. Please let me know how to resolve this error.
I am happy to provide more details if needed.
By default multi_class='raise' so you need explicitly to change this.
From the docs:
multi_class {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’
Multiclass only. Determines the type of configuration to use. The
default value raises an error, so either 'ovr' or 'ovo' must be passed
explicitly.
'ovr':
Computes the AUC of each class against the rest [3] [4]. This treats
the multiclass case in the same way as the multilabel case. Sensitive
to class imbalance even when average == 'macro', because class
imbalance affects the composition of each of the ‘rest’ groupings.
'ovo':
Computes the average AUC of all possible pairwise combinations of
classes [5]. Insensitive to class imbalance when average == 'macro'.
Solution:
Use make_scorer (docs):
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, multi_class='ovo',needs_proba=True)
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = myscore)
I need to set a value to a specific threshold and generate a confusion matrix. The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing?
First, i received a error message: ""AttributeError: predict_proba is not available when probability=False""
So i used this for correction:
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
I saw a lot on the internet and I didn't quite understand how a specific threshold value is being persolanized. Sounds pretty hard.
Now, i see a wrong output:
array([[ 0, 0],
[5359, 65]])
I have no idea whats is somenthing wrong.
i need help and i'm new in that.
thanks
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
# set threshold as -220
y_pred = (svc_model.predict_proba(X_test)[:,1] >= -220)
conf_matrix = confusion_matrix(y_pred, svc_model.predict(X_test))
return conf_matrix
answer_four()
This function should return a confusion matrix, a 2x2 numpy array with 4 integers.
This code produces the expected output, in addition to the fact that in the previous code I was using the confusion matrix incorrectly I should have also used decision_function and getting the output filtering the 220 threshold.
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
#SVC without mencions of kernel, the default is rbf
svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
#decision_function scores: Predict confidence scores for samples
y_score = svc.decision_function(X_test)
#Set a threshold -220
y_score = np.where(y_score > -220, 1, 0)
conf_matrix = confusion_matrix(y_test, y_score)
####threshold###
#input threshold in the model after trained this model
#threshold is a limiar of separation of class
return conf_matrix
answer_four()
#output:
array([[5320, 24],
[ 14, 66]])
You are using the confusion matrix in a wrong way.
The idea behind the confusion matrix is to have a picture as to how good our predictions y_pred are compared with the ground truth y_true, usually in a test set.
What you actually do here is computing a "confusion matrix" between your predictions with the custom threshold of -220 (y_pred), compared to some other predictions with the default threshold (the output of svc_model.predict(X_test)), which does not make any sense.
Your ground truth for the test set is y_test; so, to get the confusion matrix with the default threshold, you should use
confusion_matrix(y_test, svc_model.predict(X_test))
To get the confusion matrix with your custom threshold of -220, you should use
confusion_matrix(y_test, y_pred)
See the documentation for more details in the usage (which is your best friend, and should always be the first place to look at, when having issues or doubts).
I am using logistic regression for prediction. My predictions are 0's and 1's. After training my model on given data and also when training on important features i.e X_important_train see screenshot. I am getting score around 70% but when I use roc_auc_score(X,y) or roc_auc_score(X_important_train, y_train) I am getting value error:
ValueError: multiclass-multioutput format is not supported
Code:
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
model.fit(X_important_train, y_train)
model.score(X_important_train, y_train)
roc_auc_score(X_important_train, y_train)
Screenshot:
First of all, the roc_auc_score function expects input arguments with the same shape.
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None)
Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
y_true : array, shape = [n_samples] or [n_samples, n_classes]
True binary labels in binary label indicators.
y_score : array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
Now, the inputs are the true and predicted scores, NOT the training and label data as you are using in the example that you posted.
In more detail,
model.fit(X_important_train, y_train)
model.score(X_important_train, y_train)
# this is wrong here
roc_auc_score(X_important_train, y_train)
You should so something like:
y_pred = model.predict(X_test_data)
roc_auc_score(y_true, y_pred)
I tried to use GridSearchCV for multi-class case based on the answer from here:
Accelerating the prediction
But I got value error, multiclass format is not supported.
How can I use this method for multi-class case?
Following code is from the answer in above link.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9, 0.3])
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
print gscv.best_params_
From the documentation on roc_auc_score:
Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
By "label indicator format", they mean each label value is represented as a binary column (rather than as a unique target value in a single column). You don't want to do that for your predictor, as it could result in non-mutually-exclusive predictions (i.e., predicting both label 2 and 4 for case p1, or predicting no labels for case p2).
Pick or custom-implement a scoring function that is well-defined for the multiclass problem, such as F1 score. Personally I find informedness more convincing than F1 score, and easier to generalize to the multiclass problem than roc_auc_score.
It supports multi-class
You can set the para of scoring = f1.macro, example:
gsearch1 = GridSearchCV(estimator = est1, param_grid=params_test1, scoring='f1_macro', cv=5, n_jobs=-1)
Or scoring = 'roc_auc_ovr'
It supports multi-class naturally if the classifier has the correct API by default for y_true and y_pred/y_score.
Otherwise, one has to do some customization using the score function like make_scorer.
For common metrics like AUROC for multi-classes, sklearn offers the 'roc_auc_ovr', where it actually refers to
roc_auc_ovr_scorer = make_scorer(roc_auc_score, needs_proba=True,
multi_class='ovr')
as in the source file.
To deal with multi-class problem with a classifier like e.g.,LogisticRegression, ovr is required and y_true is in the format of categorical values. The above setting will work directly.
Some other metrics for binary classifications can also be extended by wrapping the respective function. E.g., average_precision_score can be wrapped as
from sklearn.preprocessing import OneHotEncoder
def multi_auprc(y_true_cat, y_score):
y_true = OneHotEncoder().fit_transform(y_true_cat.reshape(-1, 1)).toarray()
return average_precision_score(y_true, y_score)
The metric can then be defined for GridsearchCV as
{
'auprc': make_scorer(multi_auprc, needs_proba=True, greater_is_better=True)
}