GridSearchCV not giving the most optimal settings? - python

I'm working on a XGBoost model and I also tried the GridSearchCV from Scikit learn. After I did a search for most optimal parameter settings, I got this result:
Fitting 4 folds for each of 2304 candidates, totalling 9216 fits
Best parameters found: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 11, 'n_estimators': 200, 'subsample': 0.9}
Now, after training a model with these settings and doing a prediction on unseen testdata (train/test set used), I got a result. As a test I was changing some settings and then I get a better result than with the most optimal parameters from the grid search.
Is this because the test set is different from the training data? If I have another testset, are those settings also different for the best score? I think both answers can be answered with yes, but how are other people working with this effect?
Because you get the results from the grid search, but do you always use these settings or are you doing the same as I do? What will be your final setting for the model you want to deploy?
Hope to receive some inspirational thoughts:)
My final code for train/test after manual fine tuning:
xgb_skl_tuned_2 = xgb.XGBRegressor(
colsample_bytree = 0.7,
subsample = 0.9,
learning_rate = 0.3,
max_depth = 5,
min_child_weight = 13,
gamma = 10,
n_estimators = 50
)
xgb_skl_tuned_2.fit(X_train_2,y_train_2)
preds_2 = xgb_skl_tuned_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, preds_2, squared=False)
print('Model RMSE: {}'.format(mse))
Also checked this thread: parameters tuning with GridsearchCV not giving best result

Related

Configuration of GridSearchCV for AdaBoost and its base learner

I'm running grid search on AdaBoost with DecisionTreeClassifier as its base learner to get the best parameters for AdaBoost and DecisionTree.
The search on a dataset (130000, 22) has been running for 18 hours so I'm wondering if it's just another typical day of waiting for training or maybe there might be an issue with the set up.
Is the base-learner, grid search, training and params set up correctly?
ada_params = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"base_estimator__min_samples_leaf": [*np.arange(100,1500,100)],
"base_estimator__max_depth": [5,10,13,15],
"base_estimator__max_features": [5,10,15],
"n_estimators": [500, 700, 1000, 1500],
"learning_rate": [0.001, 0.01, 0.1, 0.3]
}
dt_base_learner = DecisionTreeClassifier(random_state = 42, max_features="auto", class_weight = "balanced")
ada_clf = AdaBoostClassifier(base_estimator = dt_base_learner)
ada_search = GridSearchCV(ada_clf, param_grid=ada_params, scoring = 'f1', cv=kf)
ada_search.fit(scaled_X_train, y_train)
If I am not mistaken, your GridSearch tests 14 * 4 * 3 * 4 * 4 = 2,688 different model configuration, each for a crossvalidation of an unknown number of splits. You should definitely try to reduce the number of combinations in the GridSearchCV or go for RandomizedSearchCV or BayesSearchCV from skopt.
Gridsearch will not finish until all joins are done, check the RandomSearchcv documentation and increase the joins a few at a time (n_iter) and put "-1" in "n_jobs" to parallelize as much as possible
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

sklearn GridSearchCV gives questionable results

I have input data X_train with dimension (477 x 200) and y_train with length 477.
I want to use a support vector machine regressor and I am doing grid search.
param_grid = {'kernel': ['poly', 'rbf', 'linear','sigmoid'], 'degree': [2,3,4,5], 'C':[0.01,0.1,0.3,0.5,0.7,1,1.5,2,5,10]}
grid = GridSearchCV(estimator=regressor_2, param_grid=param_grid, scoring='neg_root_mean_squared_error', n_jobs=1, cv=3, verbose = 1)
grid_result = grid.fit(X_train, y_train))
I get for grid_result.best_params_ {'C': 0.3, 'degree': 2, 'kernel': 'linear'} with a score of -7.76. And {'C': 10, 'degree': 2, 'kernel': 'rbf'} gives mit -8.0.
However, when I do
regressor_opt = SVR(kernel='linear', 'degree'=2, C=0.3)
regressor_opt.fit(X_train,y_train)
y_train_pred = regressor_opt.predict(X_train)
print("rmse=",np.sqrt(sum(y_train-y_train_pred)**2)/np.shape(y_train_pred)))
I get 7.4 and when I do
regressor_2 = SVR(kernel='rbf', 'degree'=2, C=10)
regressor_2.fit(X_train,y_train)
y_train_pred = regressor_2.predict(X_train)
print("rmse=",np.sqrt(sum(y_train-y_train_pred)**2)/np.shape(y_train_pred)))
I get 5.9. This is clearly better than 7.4 but in the gridsearch the negative rmse I got for that parameter combination was -8 and therefore worse than 7.4.
Can anybody explain to me what is going on? Should I not use scoring='neg_root_mean_square_error'?
GridSearchCV will give you the score based on the left-out data. This is fundamentally how cross-validation works. What you're doing when you train and assess on the full train set is failing to do that cross-validation; you will be obtaining an overly optimistic result. You see this slightly for the linear kernel (7.4 vs 7.76) and more exaggerated for the more flexible RBF kernel (5.9 vs 8). GridSearchCV has, I expect correctly, identified that your more flexible model does not generalise as well.
You should be able to see this effect more clearly by taking your specific estimators (regressor_opt and regressor_2) and using sklearn's cross_validate() to get the results for left-out folds. I expect you will see regressor_2 performing considerably worse than your optimistic value of 5.9. You may find that an informative exercise.
Remember, you want a model that will perform best on new data, not a model that fits arbitrarily well to your training data.
I suggest further discussion of this does not belong on stackoverflow, but instead on crossvalidated.

How to deal with overfitting of xgboost classifier?

I use xgboost to do a multi-class classification of spectrogram images(data link: automotive target classification). The class number is 5, training data includes 20000 samples(each class 5000 samples), test data includes 5000 samples(each class 1000 samples), the original image size is 144*400. This is my code snippet:
train_data, train_label, test_data, test_label = load_data(data_dir, resampleX=4, resampleY=5)
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
cv_params = {'n_estimators': [100,200,300,400,500], 'learning_rate': [0.01, 0.1]}
other_params = {'learning_rate': 0.1, 'n_estimators': 100,
'max_depth': 5, 'min_child_weight': 1, 'seed': 27, 'nthread': 6,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0,
'reg_alpha': 0, 'reg_lambda': 1,
'objective': 'multi:softmax', 'num_class': 5}
model = XGBClassifier(**other_params)
classifier = GridSearchCV(estimator=model, param_grid=cv_params, cv=3, verbose=1, n_jobs=6)
classifier.fit(train_data, train_label)
print("The best parameters are %s with a score of %0.2f" % (classifier.best_params_, classifier.best_score_))
During hyperparameter tunning, according to https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, I tuned n_estimators at first with GridSearchCV(n_estimators=[100,200,300,400,500]) using training data, then test with test data. Then I tried GridSearchCV with both 'n_estimators' and 'learning_rate' also.
The best hyperparameter is n_estimators=500+ 'learning_rate=0.1' with best_score_=0.83, when I use this best estimator to classify, the training data I get 100% correct result, but the test data only gets precison of [0.864 0.777 0.895 0.856 0.882] and recall of [0.941 0.919 0.764 0.874 0.753]. I guess with n_estimators=500 is overfitting, but I don't know how to choose this n_estimator and learning_rate at this step.
For reducing dimensionality, I tried PCA but more than n_components>3500 is needed to achieve 95% variance, so I use downsampling instead as shown in code.
Sorry for the incomplete info, hope this time is clear. Many thanks!
Why not try Optuna for XGBoost hyperparameter tuning, with pruning and with early_stopping_rounds parameter of XGBoost ?
Here is a notebook of mine as a guide only. XGBoost version must be 1.6 though, as early_stopping_rounds is run differently (fit() method) in XGBoost versions below 1.6.
https://www.kaggle.com/code/josephramon/sba-optuna-xgboost

How to correctly compute the optimal C and gamma for my SVM?

I am trying to compute the optimal C and Gamma for my SVM. When trying to run my script I get this error:
ValueError: Invalid parameter max_features for estimator SVC. Check the list of available parameters withestimator.get_params().keys().
I went through the docs to understand what n_estimators actually means so that I know what values I should fill in there. But it is not totally clear to me. Could someone tell me what this value should be so that I can run my script in order to find the optimal C and gamma?
my code:
if __name__=='__main__':
fname = "/home/John/labels.csv"
labels = pd.read_csv(fname, header=None).as_matrix()[:, 1]
labels = map(itemgetter(1),
map(os.path.split,
map(os.path.dirname, labels)))
fname = "/home/John/reps.csv"
embeddings = pd.read_csv(fname, header=None).as_matrix()
le = LabelEncoder().fit(labels)
labelsNum = le.transform(labels)
nClasses = len(le.classes_)
svcClassifier = SVC(kernel='rbf', probability=True, C=10, gamma=10)
#classifier = OneVsRestClassifier(svcClassifier).fit(embeddings, labelsNum)
param_grid = {
'n_estimators': [200, 700],
'max_features': ['auto', 'sqrt', 'log2']
}
CV_rfc = GridSearchCV(estimator=svcClassifier, param_grid=param_grid, cv= 5)
CV_rfc.fit(embeddings, labelsNum)
print CV_rfc.best_params_
After trying I manually found out that in my case C=10 and gamma=10 give the best results. I would however like to use this function to find out what the optimal values should be.
My code is insired by this post: How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)
The SVC class has no argument max_features or n_estimators as these are arguments of the RandomForest you used as a base for your code. If you want to optimize the model regarding C and gamma you can try to use:
param_grid = {
'C': [0.1, 0.5, 1.0],
'gamma': [0.1, 0.5, 1.0]
}
Furhtermore, I also recommend you to search for the optimal kernel, which can be rbf, linear or poly in the sklearn framework.
Edit: The values here are just arbitray and meant to illustrate the general approach. You should add many different values here, which depend on your situation. And whose range also depends on your situation.

XGBoost early stopping cv versus GridSearchCV

I am trying XGBoost to solve a regression problem. In the process of hyperparameter tuning, XGBoost's early stopping cv never stops for my code/data, whatever the parameter num_boost_round is set to be. Also, it produces poorer RMSE scores than GridSearchCV. What am I doing wrong here?
And, if I am not doing anything wrong, what advantages then early stopping cv offers over GridSearchCV?
GridSearchCV:
import math
def RMSE(y_true, y_pred):
rmse = math.sqrt(mean_squared_error(y_true, y_pred))
print 'RMSE: %2.3f' % rmse
return rmse
scorer = make_scorer(RMSE, greater_is_better=False)
cv_params = {'max_depth': [2,8], 'min_child_weight': [1,5]}
ind_params = {'learning_rate': 0.01, 'n_estimators': 1000,
'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'reg_alpha':0, 'reg_lambda':1} #regularization => L1 : alpha, L2 : lambda
optimized_GBM = GridSearchCV(xgb.XGBRegressor(**ind_params),
cv_params,
scoring = scorer,
cv = 5, verbose=1,
n_jobs = 1)
optimized_GBM.fit(train_X, train_Y)
optimized_GBM.grid_scores_
Output:
[mean: -62.42736, std: 5.18004, params: {'max_depth': 2, 'min_child_weight': 1},
mean: -62.42736, std: 5.18004, params: {'max_depth': 2, 'min_child_weight': 5},
mean: -57.11358, std: 3.62918, params: {'max_depth': 8, 'min_child_weight': 1},
mean: -57.12148, std: 3.64145, params: {'max_depth': 8, 'min_child_weight': 5}]
XGBoost CV:
our_params = {'eta': 0.01, 'max_depth':8, 'min_child_weight':1,
'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'reg:linear', 'booster':'gblinear',
'eval_metric':'rmse',
'silent':False}
num_rounds=1000
cv_xgb = xgb.cv(params = our_params,
dtrain = train_mat,
num_boost_round = num_rounds,
nfold = 5,
metrics = ['rmse'], # Make sure you enter metrics inside a list or you may encounter issues!
early_stopping_rounds = 100, # Look for early stopping that minimizes error
verbose_eval = True)
print cv_xgb.shape
print cv_xgb.tail(5)
Output:
(1000, 4)
test-rmse-mean test-rmse-std train-rmse-mean train-rmse-std
995 89.937926 0.263546 89.932823 0.062540
996 89.937773 0.263537 89.932671 0.062537
997 89.937622 0.263526 89.932517 0.062535
998 89.937470 0.263516 89.932364 0.062532
999 89.937317 0.263510 89.932210 0.062525
I have the same issue with XGboost ignoring num_boost_rounds (when early stopping is specified) and continuing to fit. I would wager that this is a bug.
As for the advantages of early stopping over GridSearchCV:
The advantage is that you don't have to try a series of values for num_boost_rounds, but you automatically stop at the best.
Early stopping is designed to find the optimum number of boosting iterations. If you specify a very large number for num_boost_round (i.e. 10000) and the best number of trees turns out to be 5261 it will stop at 5261+early_stopping_rounds, giving you a model that is pretty close to the optimum.
If you wanted to find the same optimum using GridSearchCV without early stopping rounds you would have to try many different values of num_boost_rounds (i.e. 100,200,300,...,5000,5100,5200,5300,...etc...). This would take a much longer time.
The property that early stopping is exploiting is that there is an optimal number of boosting steps after which the validation error while start to increase. So ....
why doesn't it work for your case?
impossible to say precisely without having the data, but it is probably because of a combination of the following:
num_boost_round is too small (and you run into the bug where xgboost resets and starts over, creating an neverending loop)
early_stopping_rounds is too large (maybe your data has a strongly oscillating convergence behavior. Try a smaller value and see whether the CV error is good enough)
something might be strange about your validation data
Why are you seeing different results between GridSearchCV and xgboost.cv?
Difficult to tell without having a fully working example, but have you checked all the default values for the variables that you only specify in one of the two interfaces (like 'reg_alpha':0, 'reg_lambda':1, 'objective': 'reg:linear', 'booster':'gblinear') and whether your definition of RMSE exactly matches xgboost's definition?

Categories