Related
I am trying to compare different regression stategies for a forecasting problem:
Using algorithms that support multiple input output regression by default (i.e Linear Regression, Trees etc..).
Using algorithms a wrapper to do multiple input output regression (i.e SVR, XGboost)
Using the chained regressor to exploit correlations between my targets (as my forecast at t+1 is auto-correlated with the target at t+2).
The documentation of scikit for the multiple input output wrappers is actually not that good but it is mentioned that:
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as Pipeline).
The latter have parameters of the form <component>__<parameter> so that it’s possible to
update each component of a nested object.
Therefore I am building my pipeline as:
pipeline_xgboost = Pipeline([('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator', xgb.XGBRegressor())])
And then creating the wrapper as:
mimo_wrapper = MultiOutputRegressor(pipeline_xgboost)
Following the documentation of scikit pipelines I am defining my xgboost parameters as:
parameters = [
{
'estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__max_depth': [10, 100, 1000]
etc...
}
And then I am running my cross validation as:
randomized_search = RandomizedSearchCV(mimo_wrapper, perparameters, random_state=0, n_iter=5,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score=True,
scoring='neg_mean_absolute_error')
However I am getting the following issue:
ValueError: Invalid parameter reg_alpha for estimator Pipeline(steps=[('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator',
XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))]). Check the list of available parameters with `estimator.get_params().keys()`.
Did I missunderstood the documentation of scikit? I have also tried with setting the parameters as estimator__estimator__param as maybe this is the way to access the parameters when they are in the mimo_wrapper but this as proved unsuccesfull. (Example below):
parameters = {
'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__estimator__max_depth': [10, 100, 1000]
}
random_grid = RandomizedSearchCV(estimator=pipeline_xgboost, param_distributions=parameters,random_state=0, n_iter=5,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score=True,
scoring='neg_mean_absolute_error')
hyperparameters_tuning = random_grid.fit(df.drop(columns=TARGETS+UMAPS),
df[TARGETS])
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/tmp/ipykernel_11898/2539017483.py in <module>
----> 1 hyperparameters_tuning = random_grid.fit(final_file_df_with_aggregates.drop(columns=TARGETS+UMAPS),
2 final_file_df_with_aggregates[TARGETS])
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
889 return results
890
--> 891 self._run_search(evaluate_candidates)
892
893 # multimetric is determined here because in the case of a callable
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
1764 def _run_search(self, evaluate_candidates):
1765 """Search n_iter candidates from param_distributions"""
-> 1766 evaluate_candidates(
1767 ParameterSampler(
1768 self.param_distributions, self.n_iter, random_state=self.random_state
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
836 )
837
--> 838 out = parallel(
839 delayed(_fit_and_score)(
840 clone(base_estimator),
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1054
1055 with self._backend.retrieval_context():
-> 1056 self.retrieve()
1057 # Make sure that we get a last message telling us we are done
1058 elapsed_time = time.time() - self._start_time
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
933 try:
934 if getattr(self._backend, 'supports_timeout', False):
--> 935 self._output.extend(job.get(timeout=self.timeout))
936 else:
937 self._output.extend(job.get())
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
437 raise CancelledError()
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
440 else:
441 raise TimeoutError()
/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result
Funny enough I have noticed that when setting the estimator parameters outside the random search function this works well:
parameters = dict({
'estimator__max_depth': [10, 100, 1000]
})
mimo_wrapper.estimator.set_params(estimator__max_depth=200)
And as you can see the max_depth is now changed.
Pipeline(steps=[('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator',
XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=200,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))])
Dear colleagues it seems that this was due to a problem in XGB.Regressor in any case the right way of creating parameters for the MultiOutput Regressor within a pipeline it would be:
parameters = {
'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__estimator__max_depth': [10, 100, 1000]
}
i am trying to build a random forest model using a walk forward validation approach.
I use TimeBasedCV() to split my data accordingly: TimeBasedCV()
My Code looks like this:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,10]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
from sklearn.model_selection import RandomizedSearchCV
from random import randint, uniform
tscv = TimeBasedCV(train_period=60,test_period=12,freq='months')
index_output = tscv.split(X_train, date_column='Date')
rf = RandomForestRegressor()
model = RandomizedSearchCV(
estimator = rf,
param_distributions = random_grid,
n_iter = 10,
n_jobs = -1,
cv = index_output,
verbose=5,
random_state = 42,
return_train_score = True)
model.fit(X_train.drop('Date', axis=1),y_train)
model.cv_results_
Error Message for my model.fit is
IndexError: positional indexers are out-of-bounds
Do i have to adjust my Randomized Search? Or is this error due to an error in my data?
IndexError Traceback (most recent call last)
<ipython-input-71-eebc6186b2c3> in <module>
18 return_train_score = True)
19
---> 20 model.fit(X_train,y_train)
21 model.cv_results_
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
708 return results
709
--> 710 self._run_search(evaluate_candidates)
711
712 # For multi-metric evaluation, store the best_index_, best_params_ and
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1482 evaluate_candidates(ParameterSampler(
1483 self.param_distributions, self.n_iter,
-> 1484 random_state=self.random_state))
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
687 for parameters, (train, test)
688 in product(candidate_params,
--> 689 cv.split(X, y, groups)))
690
691 if len(out) < 1:
~\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1015
1016 with self._backend.retrieval_context():
-> 1017 self.retrieve()
1018 # Make sure that we get a last message telling us we are done
1019 elapsed_time = time.time() - self._start_time
~\anaconda3\lib\site-packages\joblib\parallel.py in retrieve(self)
907 try:
908 if getattr(self._backend, 'supports_timeout', False):
--> 909 self._output.extend(job.get(timeout=self.timeout))
910 else:
911 self._output.extend(job.get())
~\anaconda3\lib\site-packages\joblib\_parallel_backends.py in wrap_future_result(future, timeout)
560 AsyncResults.get from multiprocessing."""
561 try:
--> 562 return future.result(timeout=timeout)
563 except LokyTimeoutError:
564 raise TimeoutError()
~\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
~\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
IndexError: positional indexers are out-of-bounds
I am performing a GridSearch with H2O using the Python API using the following code,
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid import H2OGridSearch
hyper_parameters = {'ntrees':[10, 50, 100, 200], 'max_depth':[5, 10, 15, 20, 25], 'balance_classes':[True, False]}
search_criteria = {
"strategy": "RandomDiscrete",
"max_runtime_secs": 600,
"max_models": 30,
"stopping_metric": 'AUTO',
"stopping_tolerance": 0.0001,
'seed': 42
}
grid_search = H2OGridSearch(H2ORandomForestEstimator, hyper_parameters, search_criteria=search_criteria)
grid_search.train(x=events_names_x,
y="total_rsvps",
training_frame=train,
validation_frame=test)
Once run I want to print the models and predict in order of AUC,
grid_search.sort_by('auc', False)
I get the following error,
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-272-b250bf2b838e> in <module>()
----> 1 grid_search.sort_by('auc', False)
/Users/stereo/.pyenv/versions/3.5.2/lib/python3.5/site-packages/h2o/grid/grid_search.py in sort_by(self, metric, increasing)
663
664 if metric[-1] != ')': metric += '()'
--> 665 c_values = [list(x) for x in zip(*sorted(eval('self.' + metric + '.items()'), key=lambda k_v: k_v[1]))]
666 c_values.insert(1, [self.get_hyperparams(model_id, display=False) for model_id in c_values[0]])
667 if not increasing:
/Users/stereo/.pyenv/versions/3.5.2/lib/python3.5/site-packages/h2o/grid/grid_search.py in <module>()
/Users/stereo/.pyenv/versions/3.5.2/lib/python3.5/site-packages/h2o/grid/grid_search.py in auc(self, train, valid, xval)
606 :return: The AUC.
607 """
--> 608 return {model.model_id: model.auc(train, valid, xval) for model in self.models}
609
610 def aic(self, train=False, valid=False, xval=False):
/Users/stereo/.pyenv/versions/3.5.2/lib/python3.5/site-packages/h2o/grid/grid_search.py in <dictcomp>(.0)
606 :return: The AUC.
607 """
--> 608 return {model.model_id: model.auc(train, valid, xval) for model in self.models}
609
610 def aic(self, train=False, valid=False, xval=False):
/Users/stereo/.pyenv/versions/3.5.2/lib/python3.5/site-packages/h2o/model/model_base.py in auc(self, train, valid, xval)
669 tm = ModelBase._get_metrics(self, train, valid, xval)
670 m = {}
--> 671 for k, v in viewitems(tm): m[k] = None if v is None else v.auc()
672 return list(m.values())[0] if len(m) == 1 else m
673
/Users/stereo/.pyenv/versions/3.5.2/lib/python3.5/site-packages/h2o/model/metrics_base.py in auc(self)
158 :return: Retrieve the AUC for this set of metrics.
159 """
--> 160 return self._metric_json['AUC']
161
162 def aic(self):
KeyError: 'AUC'
Any advise on:
can print the models in order of performance
forecast with the model with the highest AUC
what you need is
sorted_grid = grid_search.get_grid(sort_by='auc',decreasing=True)
print(sorted_grid)
you can change decreasing to False if you would prefer
I'm attempting to run GridSearchCV for Logistic Regression in sklearn and the code is giving me the following error:
ValueError: X has 21 features per sample; expecting 19
The shapes of the training and testing data are
X_train.shape
(891L, 21L)
X_test.shape
(418L, 21L)
The code I'm using to run the GridSearchCV with is
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
logistic = LogisticRegression()
parameters = [{'C' : [1.0, 10.0, 100.0, 1000.0],
'fit_intercept' : ['True', 'False'],
'intercept_scaling' : [0, 1, 10, 100, 1000],
'class_weight' : ['auto'],
'random_state' : [26],
'tol' : [0.001, 0.01, 0.1, 1, 10, 100]
}]
logistic = GridSearchCV(LogisticRegression(),
parameters,
cv=3,
refit=True,
verbose=1)
logistic = logistic.fit(X_train, y_train)
logit_pred = logistic.predict(X_test)
The traceback I'm getting is:
ValueError Traceback (most recent call last)
C:\Code\kaggle\titanic\titanic.py in <module>()
351
352
--> 353 logistic = logistic.fit(X_train, y_train)
354
355 logit_pred = logistic.predict(X_test)
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
594
595 """
--> 596 return self._fit(X, y, ParameterGrid(self.param_grid))
597
598
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in _fit(self, X, y, parameter_iterable)
376 train, test, self.verbose, parameters,
377 self.fit_params, return_parameters=True)
--> 378 for parameters in parameter_iterable
379 for train, test in cv)
380
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self, iterable)
651 self._iterating = True
652 for function, args, kwargs in iterable:
--> 653 self.dispatch(function, args, kwargs)
654
655 if pre_dispatch == "all" or n_jobs == 1:
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in dispatch(self, func, args, kwargs)
398 """
399 if self._pool is None:
--> 400 job = ImmediateApply(func, args, kwargs)
401 index = len(self._jobs)
402 if not _verbosity_filter(index, self.verbose):
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __init__(self, func, args, kwargs)
136 # Don't delay the application, to avoid keeping the input
137 # arguments in memory
--> 138 self.results = func(*args, **kwargs)
139
140 def get(self):
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
1238 else:
1239 estimator.fit(X_train, y_train, **fit_params)
-> 1240 test_score = _score(estimator, X_test, y_test, scorer)
1241 if return_train_score:
1242 train_score = _score(estimator, X_train, y_train, scorer)
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
1294 score = scorer(estimator, X_test)
1295 else:
-> 1296 score = scorer(estimator, X_test, y_test)
1297 if not isinstance(score, numbers.Number):
1298 raise ValueError("scoring must return a number, got %s (%s) instead."
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\scorer.pyc in _passthrough_scorer(estimator, *args, **kwargs)
174 def _passthrough_scorer(estimator, *args, **kwargs):
175 """Function that wraps estimator.score"""
--> 176 return estimator.score(*args, **kwargs)
177
178
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\base.pyc in score(self, X, y, sample_weight)
289 """
290 from .metrics import accuracy_score
--> 291 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
292
293
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in predict(self, X)
213 Predicted class label per sample.
214 """
--> 215 scores = self.decision_function(X)
216 if len(scores.shape) == 1:
217 indices = (scores > 0).astype(np.int)
C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in decision_function(self, X)
194 if X.shape[1] != n_features:
195 raise ValueError("X has %d features per sample; expecting %d"
--> 196 % (X.shape[1], n_features))
197
198 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 21 features per sample; expecting 19
Why is GridSearchCV expecting a different number of features than the dataset contains?
UPDATE:
Thanks for the response Andy. The datasets are all type numpy.ndarray and dtype is float64.
type(X_Train) type(y_train) type(X_test)
numpy.ndarray numpy.ndarray numpy.ndarray
The steps right before I bring them into sklearn:
train_data = traindf.values
test_data = testdf.values
X_train = train_data[0::, 1::] # training features
y_train = train_data[0::, 0] # training targets
X_test = test_data[0::, 0::] # test features
The next step is the GridSearchCV code I typed above...
UPDATE 2: Link to Data
Here is a link to the datasets
The error is cause by intercept_scaling=0. Looks like a bug in scikit-learn.
I'm trying to use GridSearch for parameter estimation of LinearSVC() as follows -
clf_SVM = LinearSVC()
params = {
'C': [0.5, 1.0, 1.5],
'tol': [1e-3, 1e-4, 1e-5],
'multi_class': ['ovr', 'crammer_singer'],
}
gs = GridSearchCV(clf_SVM, params, cv=5, scoring='roc_auc')
gs.fit(corpus1, y)
corpus1 has shape (1726, 7001) and y has shape (1726,)
This is a multiclass classification, and y has values from 0 to 3, both inclusive, i.e. there are four classes.
But this is giving me the following error -
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-220-0c627bda0543> in <module>()
5 }
6 gs = GridSearchCV(clf_SVM, params, cv=5, scoring='roc_auc')
----> 7 gs.fit(corpus1, y)
/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.pyc in fit(self, X, y)
594
595 """
--> 596 return self._fit(X, y, ParameterGrid(self.param_grid))
597
598
/usr/local/lib/python2.7/dist-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
376 train, test, self.verbose, parameters,
377 self.fit_params, return_parameters=True)
--> 378 for parameters in parameter_iterable
379 for train, test in cv)
380
/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
651 self._iterating = True
652 for function, args, kwargs in iterable:
--> 653 self.dispatch(function, args, kwargs)
654
655 if pre_dispatch == "all" or n_jobs == 1:
/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs)
398 """
399 if self._pool is None:
--> 400 job = ImmediateApply(func, args, kwargs)
401 index = len(self._jobs)
402 if not _verbosity_filter(index, self.verbose):
/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs)
136 # Don't delay the application, to avoid keeping the input
137 # arguments in memory
--> 138 self.results = func(*args, **kwargs)
139
140 def get(self):
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
1238 else:
1239 estimator.fit(X_train, y_train, **fit_params)
-> 1240 test_score = _score(estimator, X_test, y_test, scorer)
1241 if return_train_score:
1242 train_score = _score(estimator, X_train, y_train, scorer)
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
1294 score = scorer(estimator, X_test)
1295 else:
-> 1296 score = scorer(estimator, X_test, y_test)
1297 if not isinstance(score, numbers.Number):
1298 raise ValueError("scoring must return a number, got %s (%s) instead."
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.pyc in __call__(self, clf, X, y)
136 y_type = type_of_target(y)
137 if y_type not in ("binary", "multilabel-indicator"):
--> 138 raise ValueError("{0} format is not supported".format(y_type))
139
140 try:
ValueError: multiclass format is not supported
Remove scoring='roc_auc' and it will work as roc_auc curve does not support categorical data.
from:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
"Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format."
try:
from sklearn import preprocessing
y = preprocessing.label_binarize(y, classes=[0, 1, 2, 3])
before you train. this will perform a "one-hot" encoding of your y.
As it has been pointed out, you must first binarize y
y = label_binarize(y, classes=[0, 1, 2, 3])
and then use a multiclass learning algorithm like OneVsRestClassifier or OneVsOneClassifier. For example:
clf_SVM = OneVsRestClassifier(LinearSVC())
params = {
'estimator__C': [0.5, 1.0, 1.5],
'estimator__tol': [1e-3, 1e-4, 1e-5],
}
gs = GridSearchCV(clf_SVM, params, cv=5, scoring='roc_auc')
gs.fit(corpus1, y)
You can directly use to_categorical rather than preprocessing.label_binarize() depending on your problem. The problem is actually from using scoring=roc_auc. Note that roc_auc does not support categorical data.