MultiInputOutput Model RandomSearch with Scikit Pipelines - python

I am trying to compare different regression stategies for a forecasting problem:
Using algorithms that support multiple input output regression by default (i.e Linear Regression, Trees etc..).
Using algorithms a wrapper to do multiple input output regression (i.e SVR, XGboost)
Using the chained regressor to exploit correlations between my targets (as my forecast at t+1 is auto-correlated with the target at t+2).
The documentation of scikit for the multiple input output wrappers is actually not that good but it is mentioned that:
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as Pipeline).
The latter have parameters of the form <component>__<parameter> so that it’s possible to
update each component of a nested object.
Therefore I am building my pipeline as:
pipeline_xgboost = Pipeline([('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator', xgb.XGBRegressor())])
And then creating the wrapper as:
mimo_wrapper = MultiOutputRegressor(pipeline_xgboost)
Following the documentation of scikit pipelines I am defining my xgboost parameters as:
parameters = [
{
'estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__max_depth': [10, 100, 1000]
etc...
}
And then I am running my cross validation as:
randomized_search = RandomizedSearchCV(mimo_wrapper, perparameters, random_state=0, n_iter=5,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score=True,
scoring='neg_mean_absolute_error')
However I am getting the following issue:
ValueError: Invalid parameter reg_alpha for estimator Pipeline(steps=[('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator',
XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))]). Check the list of available parameters with `estimator.get_params().keys()`.
Did I missunderstood the documentation of scikit? I have also tried with setting the parameters as estimator__estimator__param as maybe this is the way to access the parameters when they are in the mimo_wrapper but this as proved unsuccesfull. (Example below):
parameters = {
'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__estimator__max_depth': [10, 100, 1000]
}
random_grid = RandomizedSearchCV(estimator=pipeline_xgboost, param_distributions=parameters,random_state=0, n_iter=5,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score=True,
scoring='neg_mean_absolute_error')
hyperparameters_tuning = random_grid.fit(df.drop(columns=TARGETS+UMAPS),
df[TARGETS])
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/tmp/ipykernel_11898/2539017483.py in <module>
----> 1 hyperparameters_tuning = random_grid.fit(final_file_df_with_aggregates.drop(columns=TARGETS+UMAPS),
2 final_file_df_with_aggregates[TARGETS])
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
889 return results
890
--> 891 self._run_search(evaluate_candidates)
892
893 # multimetric is determined here because in the case of a callable
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
1764 def _run_search(self, evaluate_candidates):
1765 """Search n_iter candidates from param_distributions"""
-> 1766 evaluate_candidates(
1767 ParameterSampler(
1768 self.param_distributions, self.n_iter, random_state=self.random_state
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
836 )
837
--> 838 out = parallel(
839 delayed(_fit_and_score)(
840 clone(base_estimator),
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1054
1055 with self._backend.retrieval_context():
-> 1056 self.retrieve()
1057 # Make sure that we get a last message telling us we are done
1058 elapsed_time = time.time() - self._start_time
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
933 try:
934 if getattr(self._backend, 'supports_timeout', False):
--> 935 self._output.extend(job.get(timeout=self.timeout))
936 else:
937 self._output.extend(job.get())
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
437 raise CancelledError()
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
440 else:
441 raise TimeoutError()
/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result
Funny enough I have noticed that when setting the estimator parameters outside the random search function this works well:
parameters = dict({
'estimator__max_depth': [10, 100, 1000]
})
mimo_wrapper.estimator.set_params(estimator__max_depth=200)
And as you can see the max_depth is now changed.
Pipeline(steps=[('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator',
XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=200,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))])

Dear colleagues it seems that this was due to a problem in XGB.Regressor in any case the right way of creating parameters for the MultiOutput Regressor within a pipeline it would be:
parameters = {
'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__estimator__max_depth': [10, 100, 1000]
}

Related

Error using sci-kit learn MLPClassifier: 'str' object has no attribute 'decode'

I am creating, tuning, and fitting various sci-kit learn models for a classification problem. The structure of the code below works fine for all other methods (SVM, Logistic Regression, Random Forest, Decision Tree, etc.)
When running the MLPClassifier() and its hyperparameter tuning, I get the error message originating from the last line where I fit:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8160/4102235488.py in <module>
19
20 reverse_mlp_en_clf = GridSearchCV(reverse_mlp_en, reverse_mlp_en_hyperparameters, cv = 5, verbose = 1, n_jobs = 5, scoring = 'f1')
---> 21 reverse_mlp_en_best_model = reverse_mlp_en_clf.fit(X_en, y_en)
~\AppData\Roaming\Python\Python38\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
737 refit_start_time = time.time()
738 if y is not None:
--> 739 self.best_estimator_.fit(X, y, **fit_params)
740 else:
741 self.best_estimator_.fit(X, **fit_params)
~\AppData\Roaming\Python\Python38\site-packages\sklearn\neural_network\_multilayer_perceptron.py in fit(self, X, y)
992 self : returns a trained MLP model.
993 """
--> 994 return self._fit(X, y, incremental=(self.warm_start and
995 hasattr(self, "classes_")))
996
~\AppData\Roaming\Python\Python38\site-packages\sklearn\neural_network\_multilayer_perceptron.py in _fit(self, X, y, incremental)
372 # Run the LBFGS solver
373 elif self.solver == 'lbfgs':
--> 374 self._fit_lbfgs(X, y, activations, deltas, coef_grads,
375 intercept_grads, layer_units)
376 return self
~\AppData\Roaming\Python\Python38\site-packages\sklearn\neural_network\_multilayer_perceptron.py in _fit_lbfgs(self, X, y, activations, deltas, coef_grads, intercept_grads, layer_units)
468 },
469 args=(X, y, activations, deltas, coef_grads, intercept_grads))
--> 470 self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
471 self.loss_ = opt_res.fun
472 self._unpack(opt_res.x)
~\AppData\Roaming\Python\Python38\site-packages\sklearn\utils\optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg)
241 " https://scikit-learn.org/stable/modules/"
242 "preprocessing.html"
--> 243 ).format(solver, result.status, result.message.decode("latin1"))
244 if extra_warning_msg is not None:
245 warning_msg += "\n" + extra_warning_msg
AttributeError: 'str' object has no attribute 'decode'
Code is as follows:
# Reverse hyperparameter tuning - MLP English
# Create model
reverse_mlp_en = MLPClassifier()
# Define parameters, store in dictionary
reverse_mlp_en_hidden_layer_sizes = [(50,),(100,),(150,), (200,)]
reverse_mlp_en_activation = ['logistic', 'tanh', 'relu']
reverse_mlp_en_alpha = [0.0001, 0.05]
reverse_mlp_en_learning_rate = ['constant','adaptive']
reverse_mlp_en_hyperparameters = dict(hidden_layer_sizes = reverse_mlp_en_hidden_layer_sizes, activation = reverse_mlp_en_activation
, solver = reverse_mlp_en_solver, alpha = reverse_mlp_en_alpha, learning_rate =reverse_mlp_en_learning_rate )
# Get model with the best parameters
reverse_mlp_en_clf = GridSearchCV(reverse_mlp_en, reverse_mlp_en_hyperparameters, cv = 5, verbose = 1, n_jobs = 5, scoring = 'f1')
reverse_mlp_en_best_model = reverse_mlp_en_clf.fit(X_en, y_en)
Libraries are up to date if I am not mistaken.
If anyone has an idea that would be greatly appreciated - I just find it weird how:
The same code, but using different models, works fine and
how the error is only shown after fitting everything (which is taking 40 minutes or so)

IndexError: positional indexers are out-of-bounds - RandomizedSearchCV() - Random Forest

i am trying to build a random forest model using a walk forward validation approach.
I use TimeBasedCV() to split my data accordingly: TimeBasedCV()
My Code looks like this:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,10]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
from sklearn.model_selection import RandomizedSearchCV
from random import randint, uniform
tscv = TimeBasedCV(train_period=60,test_period=12,freq='months')
index_output = tscv.split(X_train, date_column='Date')
rf = RandomForestRegressor()
model = RandomizedSearchCV(
estimator = rf,
param_distributions = random_grid,
n_iter = 10,
n_jobs = -1,
cv = index_output,
verbose=5,
random_state = 42,
return_train_score = True)
model.fit(X_train.drop('Date', axis=1),y_train)
model.cv_results_
Error Message for my model.fit is
IndexError: positional indexers are out-of-bounds
Do i have to adjust my Randomized Search? Or is this error due to an error in my data?
IndexError Traceback (most recent call last)
<ipython-input-71-eebc6186b2c3> in <module>
18 return_train_score = True)
19
---> 20 model.fit(X_train,y_train)
21 model.cv_results_
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
708 return results
709
--> 710 self._run_search(evaluate_candidates)
711
712 # For multi-metric evaluation, store the best_index_, best_params_ and
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1482 evaluate_candidates(ParameterSampler(
1483 self.param_distributions, self.n_iter,
-> 1484 random_state=self.random_state))
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
687 for parameters, (train, test)
688 in product(candidate_params,
--> 689 cv.split(X, y, groups)))
690
691 if len(out) < 1:
~\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1015
1016 with self._backend.retrieval_context():
-> 1017 self.retrieve()
1018 # Make sure that we get a last message telling us we are done
1019 elapsed_time = time.time() - self._start_time
~\anaconda3\lib\site-packages\joblib\parallel.py in retrieve(self)
907 try:
908 if getattr(self._backend, 'supports_timeout', False):
--> 909 self._output.extend(job.get(timeout=self.timeout))
910 else:
911 self._output.extend(job.get())
~\anaconda3\lib\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
560 AsyncResults.get from multiprocessing."""
561 try:
--> 562 return future.result(timeout=timeout)
563 except LokyTimeoutError:
564 raise TimeoutError()
~\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
~\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
IndexError: positional indexers are out-of-bounds

GridSearchCV paramaters

I'm trying to use GridSearchCV with KMeans clustering to explore the optimal number to clusters to use in order to get the best results on a classification problem.
I've got the following code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
faces = fetch_olivetti_faces()
X_data, y_data = faces.data, faces.target
log_reg = LogisticRegression()
split = StratifiedShuffleSplit(n_splits = 1, test_size=.2, random_state=42)
for train_index, test_index in split.split(X_train, y_train):
X_train_set , y_train_set = X_data[train_index,], y_data[train_index,]
X_test_set, y_test_set = X_data[test_index,], y_data[test_index, ]
pipeline = Pipeline([
('kmeans', KMeans(n_clusters = 30)),
('log_reg', LogisticRegression())
])
cluster_grid = dict(n_clusters=range(2,100))
grid = GridSearchCV(pipeline, cluster_grid)
grid.fit(X_train_set, y_train_set, cv=5, verbose=2)
Here's the entire traceback:
-------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-80e6a3932897> in <module>
----> 1 grid.fit(X_train_set, y_train_set, cv=5, verbose=2)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
686 return results
687
--> 688 self._run_search(evaluate_candidates)
689
690 # For multi-metric evaluation, store the best_index_, best_params_ and
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
1147 def _run_search(self, evaluate_candidates):
1148 """Search all candidates in param_grid"""
-> 1149 evaluate_candidates(ParameterGrid(self.param_grid))
1150
1151
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params)
665 for parameters, (train, test)
666 in product(candidate_params,
--> 667 cv.split(X, y, groups)))
668
669 if len(out) < 1:
~/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
~/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
501 train_scores = {}
502 if parameters is not None:
--> 503 estimator.set_params(**parameters)
504
505 start_time = time.time()
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in set_params(self, **kwargs)
162 self
163 """
--> 164 self._set_params('steps', **kwargs)
165 return self
166
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in _set_params(self, attr, **params)
48 self._replace_estimator(attr, name, params.pop(name))
49 # 3. Step parameters and other initialisation arguments
---> 50 super().set_params(**params)
51 return self
52
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py in set_params(self, **params)
222 'Check the list of available parameters '
223 'with `estimator.get_params().keys()`.' %
--> 224 (key, self))
225
226 if delim:
ValueError: Invalid parameter n_clusters for estimator Pipeline(memory=None,
steps=[('kmeans',
KMeans(algorithm='auto', copy_x=True, init='k-means++',
max_iter=300, n_clusters=30, n_init=10, n_jobs=None,
precompute_distances='auto', random_state=None,
tol=0.0001, verbose=0)),
('log_reg',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='warn', tol=0.0001, verbose=0,
warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
​
I have no idea what the heck is going on...I'm not sure how to interpret this error message but my parameter grid doesn't seem to be out of wack. PLEASE HELP!
When you are using pipeline you need to give the parameters as following:
cluster_grid = {
'kmeans__n_clusters': range(2,100)
}
# adding n_jobs to run in parallel
grid = GridSearchCV(pipeline, cluster_grid, n_jobs=-1)
where kmeans is taken from ('kmeans', KMeans())
So, your code should look as the following:
pipeline = Pipeline([
('kmeans', KMeans(),
('log_reg', LogisticRegression())
])
cluster_grid = {
'kmeans__n_clusters': range(2,100)
}
# adding n_jobs to run in parallel
grid = GridSearchCV(pipeline, cluster_grid, n_jobs=-1)
The parameter n_clusters is only applicable to KMeans and not LogisticRegression
Specify in your cluster_grid that the grid param is only meant for KMeans:
# Parameters of pipelines can be set using ‘__’ separated parameter names:
cluster_grid = dict(kmeans__n_clusters=range(2,100))
Reference : https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

sklearn GridSearchCV : ValueError: X has 21 features per sample; expecting 19

I'm attempting to run GridSearchCV for Logistic Regression in sklearn and the code is giving me the following error:
ValueError: X has 21 features per sample; expecting 19
The shapes of the training and testing data are
X_train.shape
(891L, 21L)
X_test.shape
(418L, 21L)
The code I'm using to run the GridSearchCV with is
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
logistic = LogisticRegression()
parameters = [{'C' : [1.0, 10.0, 100.0, 1000.0],
'fit_intercept' : ['True', 'False'],
'intercept_scaling' : [0, 1, 10, 100, 1000],
'class_weight' : ['auto'],
'random_state' : [26],
'tol' : [0.001, 0.01, 0.1, 1, 10, 100]
}]
logistic = GridSearchCV(LogisticRegression(),
parameters,
cv=3,
refit=True,
verbose=1)
logistic = logistic.fit(X_train, y_train)
logit_pred = logistic.predict(X_test)
The traceback I'm getting is:
ValueError Traceback (most recent call last)
C:\Code\kaggle\titanic\titanic.py in <module>()
351
352
--> 353 logistic = logistic.fit(X_train, y_train)
354
355 logit_pred = logistic.predict(X_test)
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
594
595 """
--> 596 return self._fit(X, y, ParameterGrid(self.param_grid))
597
598
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in _fit(self, X, y, parameter_iterable)
376 train, test, self.verbose, parameters,
377 self.fit_params, return_parameters=True)
--> 378 for parameters in parameter_iterable
379 for train, test in cv)
380
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self, iterable)
651 self._iterating = True
652 for function, args, kwargs in iterable:
--> 653 self.dispatch(function, args, kwargs)
654
655 if pre_dispatch == "all" or n_jobs == 1:
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in dispatch(self, func, args, kwargs)
398 """
399 if self._pool is None:
--> 400 job = ImmediateApply(func, args, kwargs)
401 index = len(self._jobs)
402 if not _verbosity_filter(index, self.verbose):
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __init__(self, func, args, kwargs)
136 # Don't delay the application, to avoid keeping the input
137 # arguments in memory
--> 138 self.results = func(*args, **kwargs)
139
140 def get(self):
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
1238 else:
1239 estimator.fit(X_train, y_train, **fit_params)
-> 1240 test_score = _score(estimator, X_test, y_test, scorer)
1241 if return_train_score:
1242 train_score = _score(estimator, X_train, y_train, scorer)
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
1294 score = scorer(estimator, X_test)
1295 else:
-> 1296 score = scorer(estimator, X_test, y_test)
1297 if not isinstance(score, numbers.Number):
1298 raise ValueError("scoring must return a number, got %s (%s) instead."
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\scorer.pyc in _passthrough_scorer(estimator, *args, **kwargs)
174 def _passthrough_scorer(estimator, *args, **kwargs):
175 """Function that wraps estimator.score"""
--> 176 return estimator.score(*args, **kwargs)
177
178
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\base.pyc in score(self, X, y, sample_weight)
289 """
290 from .metrics import accuracy_score
--> 291 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
292
293
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in predict(self, X)
213 Predicted class label per sample.
214 """
--> 215 scores = self.decision_function(X)
216 if len(scores.shape) == 1:
217 indices = (scores > 0).astype(np.int)
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in decision_function(self, X)
194 if X.shape[1] != n_features:
195 raise ValueError("X has %d features per sample; expecting %d"
--> 196 % (X.shape[1], n_features))
197
198 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 21 features per sample; expecting 19
Why is GridSearchCV expecting a different number of features than the dataset contains?
UPDATE:
Thanks for the response Andy. The datasets are all type numpy.ndarray and dtype is float64.
type(X_Train) type(y_train) type(X_test)
numpy.ndarray numpy.ndarray numpy.ndarray
The steps right before I bring them into sklearn:
train_data = traindf.values
test_data = testdf.values
X_train = train_data[0::, 1::] # training features
y_train = train_data[0::, 0] # training targets
X_test = test_data[0::, 0::] # test features
The next step is the GridSearchCV code I typed above...
UPDATE 2: Link to Data
Here is a link to the datasets
The error is cause by intercept_scaling=0. Looks like a bug in scikit-learn.

Scikit-learn GridSearch giving "ValueError: multiclass format is not supported" error

I'm trying to use GridSearch for parameter estimation of LinearSVC() as follows -
clf_SVM = LinearSVC()
params = {
'C': [0.5, 1.0, 1.5],
'tol': [1e-3, 1e-4, 1e-5],
'multi_class': ['ovr', 'crammer_singer'],
}
gs = GridSearchCV(clf_SVM, params, cv=5, scoring='roc_auc')
gs.fit(corpus1, y)
corpus1 has shape (1726, 7001) and y has shape (1726,)
This is a multiclass classification, and y has values from 0 to 3, both inclusive, i.e. there are four classes.
But this is giving me the following error -
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-220-0c627bda0543> in <module>()
5 }
6 gs = GridSearchCV(clf_SVM, params, cv=5, scoring='roc_auc')
----> 7 gs.fit(corpus1, y)
/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.pyc in fit(self, X, y)
594
595 """
--> 596 return self._fit(X, y, ParameterGrid(self.param_grid))
597
598
/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
376 train, test, self.verbose, parameters,
377 self.fit_params, return_parameters=True)
--> 378 for parameters in parameter_iterable
379 for train, test in cv)
380
/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
651 self._iterating = True
652 for function, args, kwargs in iterable:
--> 653 self.dispatch(function, args, kwargs)
654
655 if pre_dispatch == "all" or n_jobs == 1:
/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs)
398 """
399 if self._pool is None:
--> 400 job = ImmediateApply(func, args, kwargs)
401 index = len(self._jobs)
402 if not _verbosity_filter(index, self.verbose):
/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs)
136 # Don't delay the application, to avoid keeping the input
137 # arguments in memory
--> 138 self.results = func(*args, **kwargs)
139
140 def get(self):
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
1238 else:
1239 estimator.fit(X_train, y_train, **fit_params)
-> 1240 test_score = _score(estimator, X_test, y_test, scorer)
1241 if return_train_score:
1242 train_score = _score(estimator, X_train, y_train, scorer)
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
1294 score = scorer(estimator, X_test)
1295 else:
-> 1296 score = scorer(estimator, X_test, y_test)
1297 if not isinstance(score, numbers.Number):
1298 raise ValueError("scoring must return a number, got %s (%s) instead."
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.pyc in __call__(self, clf, X, y)
136 y_type = type_of_target(y)
137 if y_type not in ("binary", "multilabel-indicator"):
--> 138 raise ValueError("{0} format is not supported".format(y_type))
139
140 try:
ValueError: multiclass format is not supported
Remove scoring='roc_auc' and it will work as roc_auc curve does not support categorical data.
from:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
"Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format."
try:
from sklearn import preprocessing
y = preprocessing.label_binarize(y, classes=[0, 1, 2, 3])
before you train. this will perform a "one-hot" encoding of your y.
As it has been pointed out, you must first binarize y
y = label_binarize(y, classes=[0, 1, 2, 3])
and then use a multiclass learning algorithm like OneVsRestClassifier or OneVsOneClassifier. For example:
clf_SVM = OneVsRestClassifier(LinearSVC())
params = {
'estimator__C': [0.5, 1.0, 1.5],
'estimator__tol': [1e-3, 1e-4, 1e-5],
}
gs = GridSearchCV(clf_SVM, params, cv=5, scoring='roc_auc')
gs.fit(corpus1, y)
You can directly use to_categorical rather than preprocessing.label_binarize() depending on your problem. The problem is actually from using scoring=roc_auc. Note that roc_auc does not support categorical data.

Categories