Userwarning with GridSearchCV scoring and refit params - python

I am using GridSearchCV for tuning my MNB model. However, I keep getting UserWarning when setting the GridSearchCV(scoring, refit) params. I've read the docs and other related StackOverflow questions over and over and followed the answers but still getting errors. It works when using only one metric though. I am having a hard time understanding what's wrong.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
parameters = {'alpha': [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]}
scorers = {
"accuracy": make_scorer(accuracy_score),
"precision": make_scorer(precision_score),
"recall": make_scorer(recall_score),
"f1_score": make_scorer(f1_score)
}
mnb = MultinomialNB()
classifier = GridSearchCV(mnb, parameters, return_train_score=False, cv=10, scoring=scorers, refit='accuracy')
classifier.fit(x_train_features, y)
Error
UserWarning,
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:774: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
scores = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 103, in __call__
score = scorer._score(cached_call, estimator, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 264, in _score
return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py", line 1765, in precision_score
zero_division=zero_division,
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py", line 1357, in _check_set_wise_labels
f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['Negative', 'Positive']

I don't know how to override the pos_label parameters using the above setup in the score metrics so I
replaced my dataset labels Positive -> 1 and Negative -> 0 instead.

Related

Implementing GridSearchCV and Pipelines to perform Hyperparameters Tuning for KNN Algorithm

I have been reading about perfroming Hyperparameters Tuning for KNN Algorthim, and understood that the best practice of implementing it is to make sure that for each fold, my dataset should be normalized and oversamplmed using a pipeline (To avoid data leakage and overfitting).
What I'm trying to do is that I'm trying to identify the best number of neighbors (n_neighbors) possible that gives me the best accuracy in training. In the code I have set the number of neighbors to be a list range (1,50), and the number of iterations cv=10.
My code below:
# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
#oversmapling
from imblearn.over_sampling import SMOTE
#KNN Model related Libraries
import cuml
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")
#filling missing values with zeros
df = df.fillna(0)
#replace the data in from being objects to integers
df["command response"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
df["binary result"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
#change the datatype of some features to be able to be used later
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)
# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]
# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)
#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])
#Using GridSearchCV
neighbors = list(range(1,50))
parameters = {
'classifier__n_neighbors': neighbors
}
grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)
print("Best Accuracy: {}" .format(grid_search.best_score_))
print("Best num of neighbors: {}" .format(grid_search.best_estimator_.get_params()['n_neighbors']))
At step grid_search.fit(X_train, y_bin_train), the program is repeating the error that i'm getting is :
/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
self._final_estimator.fit(Xt, yt, **fit_params_last_step)
File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.
I'm not sure from which side is this error coming from, is it because I'm importing KNN Algorthim from cuML Library instead of sklearn ? Or is there something wrong wtih my Pipeline and GridSearchCV implementation?
This error indicates you've passed an invalid value for the metric parameter (in both scikit-learn and cuML). You've misspelled "euclidean".
import cuml
from sklearn import datasets
​
from sklearn.preprocessing import MinMaxScaler
​
from imblearn.over_sampling import SMOTE
​
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
​
X, y = datasets.make_classification(
n_samples=100
)
​
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])
​
parameters = {
'classifier__n_neighbors': [1,3,6]
}
​
grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier())]),
param_grid={'classifier__n_neighbors': [1, 3, 6]})

Error when trying to run a GridSearchCV on sklearn Pipeline

I'm trying to run a sklearn pipeline with TFIDF vectorizer and XGBoost Classifier through a GridSearchCV, but it doesn't work because of an internal error. The data is 4000 sentences, marked either true or false (1 or 0). This is the code:
import numpy as np
import pandas as pd
from gensim import utils
import gensim.parsing.preprocessing as gsp
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")
train_x = train.iloc[:, 0]
train_y = train.iloc[:, 1]
test_x = test.iloc[:, 0]
test_y = test.iloc[:, 1]
folds = 4
xgb_parameters = {
'xgboost__n_estimators': [1000, 1500],
'xgboost__max_depth': [12, 15],
'xgboost__learning_rate': [0.1, 0.12],
'xgboost__objective': ['binary:logistic']
}
model = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('xgboost', xgb.XGBClassifier())])
gs_cv = GridSearchCV(estimator=model,
param_grid=xgb_parameters,
n_jobs=1,
refit=True,
cv=2,
scoring=f1_score)
gs_cv.fit(train_x, train_y)
But I am getting an error:
>>> gs_cv.fit(train_x, train_y)
C:\Users\draga\miniconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[21:31:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass labels=0 0
1 1
2 1
3 0
4 1
..
2004 0
2005 0
2008 0
2009 0
2012 0
Name: Bad Sentence, Length: 2000, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py:683: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 674, in _score
scores = scorer(estimator, X_test, y_test)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 74, in inner_f
return f(**kwargs)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1068, in f1_score
return fbeta_score(y_true, y_pred, beta=1, labels=labels,
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1192, in fbeta_score
_, _, f, _ = precision_recall_fscore_support(y_true, y_pred,
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1461, in precision_recall_fscore_support
labels = _check_set_wise_labels(y_true, y_pred, average, labels,
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1274, in _check_set_wise_labels
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 83, in _check_targets
check_consistent_length(y_true, y_pred)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 192, in _num_samples
raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'sklearn.pipeline.Pipeline'>
What could be the problem?
Do I need to include the transform method for TfidfVectorizer() in the pipeline?
The main problem is your scoring parameter for the search. Scorers for hyperparameter tuners in sklearn need to have the signature (estimator, X, y). You can use the make_scorer convenience function, or in this case just pass the name as a string, scorer="f1".
See the docs, the list of builtins and information on signatures.
(You do not need to explicitly use the transform method; that's handled internally by the pipeline.)

Getting error when trying to use cross validation

Trying to get a result out, but getting this error instead:
C:\Users\my_is\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\my_is\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\my_is\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 890, in fit
super().fit(
File "C:\Users\my_is\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 181, in fit
check_classification_targets(y)
File "C:\Users\my_is\anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 172, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
Here is my code:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
data = load_boston()
c = np.array([1 if y > np.median(data['target']) else 0 for y in data['target']])
X_train, X_test, c_train, c_test = train_test_split(data['data'], c, random_state=0)
tree = DecisionTreeClassifier()
tree.fit(X_train, c_train)
#print(data.target)
#logReg = LogisticRegression()
#logReg.fit(X_train, c_train)
#result = cross_validate(logReg, data.data, data.target, cv=5, return_train_score=True)
result = cross_validate(tree, data.data, data.target, cv=5, return_train_score=True)
display(result)
I am completely new to python and ML, any help is appreciated
You have a mistake here:
result = cross_validate(tree, data.data, data.target, cv=5, return_train_score=True)
Should be:
result = cross_validate(tree, data.data, c, cv=5, return_train_score=True)

ValueError: cannot use sparse input in 'SVC' trained on dense data

I'm trying to run my classifier but I get this error
import pandas
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = pd.read_csv('all_topics_limpo.csv', encoding = 'utf-8')
data = pandas.get_dummies(dataset['verbatim_corrige'])
labels = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.2, random_state = 0)
count_vector = CountVectorizer()
tfidf = TfidfTransformer()
classifier = OneVsOneClassifier(SVC(kernel = 'linear', random_state = 100))
#classifier = LogisticRegression()
train_counts = count_vector.fit_transform(X_train)
train_tfidf = tfidf.fit_transform(train_counts)
classifier.fit(X_train, y_train)
test_counts = count_vector.transform(X_test)
test_tfidf = tfidf.transform(test_counts)
predicted = classifier.predict(test_tfidf)
predicted = classifier.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, predicted, labels = labels))
print("F-score")
print(f1_score(y_test, predicted))
print(precision_score(y_test, predicted))
print(recall_score(y_test, predicted))
print("cross validation")
test_counts = count_vector.fit_transform(data)
test_tfidf = tfidf.fit_transform(test_counts)
scores = cross_validation.cross_val_score(classifier, test_tfidf, labels, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
My output error:
ValueError: cannot use sparse input in 'SVC' trained on dense data
I can not execute my code because of this problem and I am not understanding anything of what is happening.
all output error
Traceback (most recent call last):
File "classification.py", line 42, in
predicted = classifier.predict(test_tfidf)
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 584, in predict
Y = self.decision_function(X)
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in decision_function
for est, Xi in zip(self.estimators_, Xs)]).T
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in
for est, Xi in zip(self.estimators_, Xs)]).T
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 548, in predict
y = super(BaseSVC, self).predict(X)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 308, in predict
X = self._validate_for_predict(X)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 448, in _validate_for_predict
% type(self).name)
ValueError: cannot use sparse input in 'SVC' trained on dense data
You get this error because your training & test data are not of the same kind: while you train in your initial X_train set:
classifier.fit(X_train, y_train)
you are trying to get predictions from a dataset which has undergone count vectorization & tf-idf transormations first:
predicted = classifier.predict(test_tfidf)
It is puzzling why you choose to do so, why you nevertheless compute train_counts and train_tfidf (you don't seem to actually use them anywhere), and why you are also trying to redefine predicted as classifier.predict(X_test) immediately afterwards. Normally, changing your training line to
classifier.fit(train_tfidf, y_train)
and getting rid of your second predicted definition should work OK...
you can use this code :
test_tfidf = tfidf.transform(test_counts).toarray()
befor you want to predict your model and after :
predicted = classifier.predict(test_tfidf)
just do this simple code
nice job

GridSearch for an estimator inside a OneVsRestClassifier

I want to perform GridSearchCV in a SVC model, but that uses the one-vs-all strategy. For the latter part, I can just do this:
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
My problem is with the parameters. Let's say I want to try the following values:
parameters = {"C":[1,2,4,8], "kernel":["poly","rbf"],"degree":[1,2,3,4]}
In order to perform GridSearchCV, I should do something like:
cv_generator = StratifiedKFold(y, k=10)
model_tunning = GridSearchCV(model_to_set, param_grid=parameters, score_func=f1_score, n_jobs=1, cv=cv_generator)
However, then I execute it I get:
Traceback (most recent call last):
File "/.../main.py", line 66, in <module>
argclass_sys.set_model_parameters(model_name="SVC", verbose=3, file_path=PATH_ROOT_MODELS)
File "/.../base.py", line 187, in set_model_parameters
model_tunning.fit(self.feature_encoder.transform(self.train_feats), self.label_encoder.transform(self.train_labels))
File "/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.py", line 354, in fit
return self._fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.py", line 392, in _fit
for clf_params in grid for train, test in cv)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 473, in __call__
self.dispatch(function, args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 296, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 124, in __init__
self.results = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.py", line 85, in fit_grid_point
clf.set_params(**clf_params)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 241, in set_params
% (key, self.__class__.__name__))
ValueError: Invalid parameter kernel for estimator OneVsRestClassifier
Basically, since the SVC is inside a OneVsRestClassifier and that's the estimator I send to the GridSearchCV, the SVC's parameters can't be accessed.
In order to accomplish what I want, I see two solutions:
When creating the SVC, somehow tell it not to use the one-vs-one strategy but the one-vs-all.
Somehow indicate the GridSearchCV that the parameters correspond to the estimator inside the OneVsRestClassifier.
I'm yet to find a way to do any of the mentioned alternatives. Do you know if there's a way to do any of them? Or maybe you could suggest another way to get to the same result?
Thanks!
When you use nested estimators with grid search you can scope the parameters with __ as a separator. In this case the SVC model is stored as an attribute named estimator inside the OneVsRestClassifier model:
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score
iris = load_iris()
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly","rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
score_func=f1_score)
model_tunning.fit(iris.data, iris.target)
print model_tunning.best_score_
print model_tunning.best_params_
That yields:
0.973290762737
{'estimator__kernel': 'poly', 'estimator__C': 1, 'estimator__degree': 2}
For Python 3, the following code should be used
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
iris = load_iris()
model_to_set = OneVsRestClassifier(SVC(kernel="poly"))
parameters = {
"estimator__C": [1,2,4,8],
"estimator__kernel": ["poly","rbf"],
"estimator__degree":[1, 2, 3, 4],
}
model_tunning = GridSearchCV(model_to_set, param_grid=parameters,
scoring='f1_weighted')
model_tunning.fit(iris.data, iris.target)
print(model_tunning.best_score_)
print(model_tunning.best_params_)
param_grid = {"estimator__alpha": [10**-5, 10**-3, 10**-1, 10**1, 10**2]}
clf = OneVsRestClassifier(SGDClassifier(loss='log',penalty='l1'))
model = GridSearchCV(clf,param_grid, scoring = 'f1_micro', cv=2,n_jobs=-1)
model.fit(x_train_multilabel, y_train)

Categories