Error when trying to run a GridSearchCV on sklearn Pipeline

Error when trying to run a GridSearchCV on sklearn Pipeline - python

I'm trying to run a sklearn pipeline with TFIDF vectorizer and XGBoost Classifier through a GridSearchCV, but it doesn't work because of an internal error. The data is 4000 sentences, marked either true or false (1 or 0). This is the code:
import numpy as np
import pandas as pd
from gensim import utils
import gensim.parsing.preprocessing as gsp
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")
train_x = train.iloc[:, 0]
train_y = train.iloc[:, 1]
test_x = test.iloc[:, 0]
test_y = test.iloc[:, 1]
folds = 4
xgb_parameters = {
'xgboost__n_estimators': [1000, 1500],
'xgboost__max_depth': [12, 15],
'xgboost__learning_rate': [0.1, 0.12],
'xgboost__objective': ['binary:logistic']
}
model = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('xgboost', xgb.XGBClassifier())])
gs_cv = GridSearchCV(estimator=model,
param_grid=xgb_parameters,
n_jobs=1,
refit=True,
cv=2,
scoring=f1_score)
gs_cv.fit(train_x, train_y)
But I am getting an error:
>>> gs_cv.fit(train_x, train_y)
C:\Users\draga\miniconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[21:31:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass labels=0 0
1 1
2 1
3 0
4 1
..
2004 0
2005 0
2008 0
2009 0
2012 0
Name: Bad Sentence, Length: 2000, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py:683: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 674, in _score
scores = scorer(estimator, X_test, y_test)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 74, in inner_f
return f(**kwargs)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1068, in f1_score
return fbeta_score(y_true, y_pred, beta=1, labels=labels,
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1192, in fbeta_score
_, _, f, _ = precision_recall_fscore_support(y_true, y_pred,
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1461, in precision_recall_fscore_support
labels = _check_set_wise_labels(y_true, y_pred, average, labels,
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1274, in _check_set_wise_labels
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 83, in _check_targets
check_consistent_length(y_true, y_pred)
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]
File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 192, in _num_samples
raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'sklearn.pipeline.Pipeline'>
What could be the problem?
Do I need to include the transform method for TfidfVectorizer() in the pipeline?

The main problem is your scoring parameter for the search. Scorers for hyperparameter tuners in sklearn need to have the signature (estimator, X, y). You can use the make_scorer convenience function, or in this case just pass the name as a string, scorer="f1".
See the docs, the list of builtins and information on signatures.
(You do not need to explicitly use the transform method; that's handled internally by the pipeline.)

Related

Userwarning with GridSearchCV scoring and refit params

I am using GridSearchCV for tuning my MNB model. However, I keep getting UserWarning when setting the GridSearchCV(scoring, refit) params. I've read the docs and other related StackOverflow questions over and over and followed the answers but still getting errors. It works when using only one metric though. I am having a hard time understanding what's wrong.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
parameters = {'alpha': [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]}
scorers = {
"accuracy": make_scorer(accuracy_score),
"precision": make_scorer(precision_score),
"recall": make_scorer(recall_score),
"f1_score": make_scorer(f1_score)
}
mnb = MultinomialNB()
classifier = GridSearchCV(mnb, parameters, return_train_score=False, cv=10, scoring=scorers, refit='accuracy')
classifier.fit(x_train_features, y)
Error
UserWarning,
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:774: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
scores = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 103, in __call__
score = scorer._score(cached_call, estimator, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 264, in _score
return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py", line 1765, in precision_score
zero_division=zero_division,
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py", line 1357, in _check_set_wise_labels
f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['Negative', 'Positive']

I don't know how to override the pos_label parameters using the above setup in the score metrics so I
replaced my dataset labels Positive -> 1 and Negative -> 0 instead.

my data has 14 attributes and 303 observations but when applying knn value of k greater than 1 is giving error

I am getting this error -
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 11
the data I am using has 14 attributes and 303 observations. I want the number of neighbors to be 11(anything greater than one) but this error is showing up every time.
here is my code-
import pandas as pd
header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_cleaned_data.csv', names= header_names)
training_sizes = [1,25,50,75,100,150,200]
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
features = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
target = 'num'
from sklearn.neighbors import KNeighborsClassifier
train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=1), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
here is the traceback of the error-
Traceback (most recent call last):
File "E:\HCU proj doc\heart_disease_scaling_and_learning_curve.py", line 15, in <module>
train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=11), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 1128, in learning_curve
for train, test in train_test_proportions)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
self.results = batch()
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 488, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _score
score = scorer(estimator, X_test, y_test)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\metrics\scorer.py", line 138, in __call__
y_pred = clf.predict_proba(X)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\classification.py", line 190, in predict_proba
neigh_dist, neigh_ind = self.kneighbors(X)
File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\base.py", line 347, in kneighbors
(train_size, n_neighbors)
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 11
what is the problem ? what is going wrong in the code?
what should I do to remove the error ?

Your task is binary. So when you set the training_size=1, only a single sample is passed to the scoring function (log_loss in this case).
So either 0.0 or 1.0 will be there be there. Thats the error. You need to supply all the labels to the metric function so that it can calculate the loss.
To solve this, you can do multiple things:
1) Don't pass the training_sizes to the learning_curve as #desertnaut said, and let it use the default. In that case, the training data will be divided into 5 equally spaced incremental parts which (in most cases) will contain both the labels in training set and log_loss will automatically identify them to calculate the score.
2) Change the training_sizes values to something more meaningful. Maybe just remove the 1 from it.
training_sizes = [25,50,75,100,150,200]
This is working for me for your data.
3) Change the scoring param to pass all the labels explicitly to the log_loss. So that even if you specify 1 in training_sizes, the log_loss method knows that the data have 2 labels and calculated the loss accordingly.
from sklearn.metrics import log_loss
# This will calculate the 'neg_log_loss' as you wanted, just with one extra param
scorer = make_scorer(log_loss, greater_is_better=False,
needs_proba=True,
labels=[0.0, 1.0]) #<== This is what you need.
And then, do this:
....
....
train_size, train_scores, validation_scores = learning_curve(KNeighborsClassifier(n_neighbors=1),
X=dd[features],
y=dd[target],
train_sizes=training_sizes,
cv=5,
scoring=scorer) #<== Add that here

I suspect that the problem concerns the way you are defining your target vector.
try replacing this:
target = 'num'
with this:
target = ['num']
hope this helps

RandomForest score method ValueError

I am trying to find the score of a given data set with respect to some training data. I have written the following code:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [[10,20,30,40,50,60,70,80,90], [10,20,30,40,50,60,70,80,90]]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
On compiling and trying to run the program I am getting the error:
Traceback (most recent call last):
File "trial.py", line 16, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 89, in _check_targets
raise ValueError("{0} is not supported".format(y_type))
ValueError: multiclass-multioutput is not supported
On checking the documentation related to the score method it says:
score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
Both X and y in my case are arrays, 2d arrays.
I also went through this question but I couldn't understand where am I going wrong.
EDIT
So as per the answer and the comments that follow, I have edited the program as follows:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
randomForest = RandomForestClassifier(n_estimators = 200)
mlb = MultiLabelBinarizer()
li_train1 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_train2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
li_text1 = [100,200]
li_text2 = [[1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9]]
randomForest.fit(li_train1, li_train2)
output = randomForest.score(li_train1, li_text1)
After this edit I am getting the error:
Traceback (most recent call last):
File "trial.py", line 19, in <module>
output = randomForest.score(li_train1, li_text1)
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 349, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 172, in accuracy_score
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 82, in _check_targets
"".format(type_true, type_pred))
ValueError: Can't handle mix of binary and multiclass-multioutput

According to the documentation:
Warning: At present, no metric in sklearn.metrics supports the multioutput-multiclass classification task.
The score method invokes sklearn's accuracy metric but this isn't supported for the multi-class, multi-output classification problem you've defined.
It's not clear from your question if you really intend to solve a multi-class, multi-output problem. If that's not your intention, then you should restructure your input arrays.
If on the other hand you really want to solve this kind of problem, you'll simply need to define your own scoring function.
UPDATE
Since you are not solving a multi-class, multi-label problem you should restructure your data so that it looks something like this:
from sklearn.ensemble import RandomForestClassifier
# training data
X = [
[1,2,3,4,5,6,7,8,9],
[1,2,3,4,5,6,7,8,9]
]
y = [0,1]
# fit the model
randomForest.fit(X,y)
# test data
Xtest = [
[1,2,0,4,5,6,0,8,9],
[1,1,3,1,5,0,7,8,9]
]
ytest = [0,1]
output = randomForest.score(Xtest,ytest)
print(output) # 0.5

ValueError: Found arrays with inconsistent numbers of samples [1,299]

Here is data files here and here. You can download it by clicking on links the link. I am using Pandas, Numpy and Python3.
Here is my code:
import pandas as pa
import numpy as nu
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
def get_accuracy(X_train, y_train, X_test, y_test):
perceptron = Perceptron()
perceptron.fit(X_train, y_train)
perceptron.transform(X_train)
prediction = perceptron.predict(X_test)
result = accuracy_score(y_test, prediction)
return result
test_data = pa.read_csv("C:/Users/Roman/Downloads/perceptron-test.csv")
test_data.columns = ["class", "f1", "f2"]
train_data = pa.read_csv("C:/Users/Roman/Downloads/perceptron-train.csv")
train_data.columns = ["class", "f1", "f2"]
scaler = StandardScaler()
scaler.fit_transform(train_data[train_data.columns[1:]]).reshape(-1,1)
X_train = scaler.transform(train_data[train_data.columns[1:]])
scaler.fit_transform(train_data[train_data.columns[0]])
y_train = scaler.transform(train_data[train_data.columns[0]])
scaler.fit_transform(test_data[test_data.columns[1:]])
X_test = scaler.transform(test_data[test_data.columns[1:]])
scaler.fit_transform(test_data[test_data.columns[0]])
y_test = scaler.transform(test_data[test_data.columns[0]])
scaled_accuracy = get_accuracy(nu.ravel(X_train), nu.ravel(y_train), nu.ravel(X_test), nu.ravel(y_test))
print(scaled_accuracy)
And here is error that I get:
Traceback (most recent call last):
File "C:/Users/Roman/PycharmProjects/data_project-1/lecture_2_perceptron.py", line 33, in <module>
scaled_accuracy = get_accuracy(nu.ravel(X_train), nu.ravel(y_train), nu.ravel(X_test), nu.ravel(y_test))
File "C:/Users/Roman/PycharmProjects/data_project-1/lecture_2_perceptron.py", line 9, in get_accuracy
perceptron.fit(X_train, y_train)
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\linear_model\stochastic_gradient.py", line 545, in fit
sample_weight=sample_weight)
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\linear_model\stochastic_gradient.py", line 389, in _fit
X, y = check_X_y(X, y, 'csr', dtype=np.float64, order="C")
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py", line 520, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
**ValueError: Found arrays with inconsistent numbers of samples: [ 1 299]**
Without scaling data everything work fine. But after scaling not.

You should not call fit_transform each time you use scaler. You should fit it once, on the training data, and later only transform, otherwise you get different representation for training and testing (leading to error provided). There is also no point in scaling labels.

Unable to solve an error while running Gridsearch

I am new to the realm of machine learning, and I started competing in Kaggle competitions to get some practical experience. I am competing in the knowledge competition CIFAR 10- Object Recognition in Images, where you have to classify thousands of images in 10 classes,all the data I use can be found there. I tried to implement Gridsearch to optimize the parameters of my machine learning algorithm, but whenever I try to fit my classifier with my training data I get an error. I have found the function that raised the error, and it has something to do with my labels not having the right type, but I have no clue on how to solve it. The Labels I use are strings, and I preprocess them so that I can feed them to the algorithm. Am I doing something wrong there? Or perhaps something goes wrong when I split the dataset for the grid search? Frankly, I lack the experience and knowledge to solve this problem, and I could definitely use your help.
The code involved:
import glob
import os
from sklearn.svm import SVC
from sklearn import preprocessing
import pandas as pd
from sklearn import cross_validation
from sklearn import metrics
from sklearn.grid_search import GridSearchCV
def label_preprocessing(Labels):
Labels = np.array(Labels)[:,1]
le = preprocessing.LabelEncoder()
le.fit_transform(Labels)
return Labels
def model_selection(train,Labels):
parameters = {"C":[0.1,1,10,100],"gamma":[0.0001,0.001,0.01,0.1]}
X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, Labels, test_size = 0.2, random_state = 0)
svm = SVC()
clf = GridSearchCV(svm,parameters)
clf = clf.fit(X_train,y_train)
print ("20 fold cv score: ",np.mean(cross_validation.cross_val_score(clf,X_test,y_test,cv = 10,scoring = "roc_auc")))
return clf
if __name__ == "__main__":
train_images = np.array(file_open(image_dir1,"*.png"))[:100]
test_images = np.array(file_open(image_dir2,"*.png"))[:100]
Labels = label_preprocessing(pd.read_csv(image_dir3)[:100])
train_set = [matrix_image(image) for image in train_images]
test_set = [matrix_image(image) for image in test_images]
train_set = np.array(train_set)
test_set = np.array(test_set)
print("selecting best model and evaluating it")
svm = model_selection(train_set,Labels)
print("predicting stuff")
result = svm.predict(test_set)
np.savetxt("submission.csv", result, fmt = "%s", delimiter = ",")
full traceback:
Traceback (most recent call last):
File "C:\Users\Abdc\workspace\final_submission\src\SVM.py", line 49, in <module>
svm = model_selection(train_set,Labels)
File "C:\Users\Abdc\workspace\final_submission\src\SVM.py", line 35, in model_selection
clf = clf.fit(X_train,y_train)
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 707, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 493, in _fit
for parameters in parameter_iterable
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 517, in __call__
self.dispatch(function, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 136, in __init__
self.results = func(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 311, in fit_grid_point
this_score = clf.score(X_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\base.py", line 294, in score
return accuracy_score(y, self.predict(X))
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 1064, in accuracy_score
y_type, y_true, y_pred = _check_clf_targets(y_true, y_pred)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 123, in _check_clf_targets
raise ValueError("{0} is not supported".format(y_type))
ValueError: unknown is not supported
This is the function that raised the error. It can be found in the sklearn.metrics module:
def _check_clf_targets(y_true, y_pred):
"""Check that y_true and y_pred belong to the same classification task
This converts multiclass or binary types to a common shape, and raises a
ValueError for a mix of multilabel and multiclass targets, a mix of
multilabel formats, for the presence of continuous-valued or multioutput
targets, or for targets of different lengths.
Column vectors are squeezed to 1d.
Parameters
----------
y_true : array-like,
y_pred : array-like
Returns
-------
type_true : one of {'multilabel-indicator', 'multilabel-sequences', \
'multiclass', 'binary'}
The type of the true target data, as output by
``utils.multiclass.type_of_target``
y_true : array or indicator matrix or sequence of sequences
y_pred : array or indicator matrix or sequence of sequences
"""
y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
type_true = type_of_target(y_true)
type_pred = type_of_target(y_pred)
y_type = set([type_true, type_pred])
if y_type == set(["binary", "multiclass"]):
y_type = set(["multiclass"])
if len(y_type) > 1:
raise ValueError("Can't handle mix of {0} and {1}"
"".format(type_true, type_pred))
# We can't have more than one value on y_type => The set is no more needed
y_type = y_type.pop()
# No metrics support "multiclass-multioutput" format
if (y_type not in ["binary", "multiclass", "multilabel-indicator",
"multilabel-sequences"]):
raise ValueError("{0} is not supported".format(y_type))
if y_type in ["binary", "multiclass"]:
y_true = column_or_1d(y_true)
y_pred = column_or_1d(y_pred)
return y_type, y_true, y_pred
Extra information about the Labels:
Content of labels and dtype:
In [21]:
Labels = np.array(Labels)[:,1]
Labels
Out[21]:
array(['frog', 'truck', 'truck', ..., 'truck', 'automobile', 'automobile'], dtype=object)
Content of labels after preprocessing
In [25]:
Labels = np.array(Labels)[:,1]
Labels
le = preprocessing.LabelEncoder()
Labels = le.fit_transform(Labels)
Labels
Out[25]:
array([6, 9, 9, ..., 9, 1, 1])
Shape of labels after preprocessing:
In [18]:
Labels = np.array(Labels)[:,1]
Labels.shape
le = preprocessing.LabelEncoder()
Labels = le.fit_transform(Labels)
Labels.shape
Out[18]:
(50000L,)
The original content can be found here: https://www.kaggle.com/c/cifar-10/data. Which contains an ID for a datapoint and its class Label. So it is a nx2 matrix.

This might be caused by issue #2374. As a workaround you can try to use Labels = Labels.astype(str).
Also, I would suggest you to follow the PEP8 code conventions to share python code with the community. In particular variable names are usually lowercase.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error when trying to run a GridSearchCV on sklearn Pipeline - python

Related

Userwarning with GridSearchCV scoring and refit params

my data has 14 attributes and 303 observations but when applying knn value of k greater than 1 is giving error

RandomForest score method ValueError

ValueError: Found arrays with inconsistent numbers of samples [1,299]

Unable to solve an error while running Gridsearch

Categories

Resources