How to deal with overfitting of xgboost classifier?

How to deal with overfitting of xgboost classifier? - python

I use xgboost to do a multi-class classification of spectrogram images(data link: automotive target classification). The class number is 5, training data includes 20000 samples(each class 5000 samples), test data includes 5000 samples(each class 1000 samples), the original image size is 144*400. This is my code snippet:
train_data, train_label, test_data, test_label = load_data(data_dir, resampleX=4, resampleY=5)
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
cv_params = {'n_estimators': [100,200,300,400,500], 'learning_rate': [0.01, 0.1]}
other_params = {'learning_rate': 0.1, 'n_estimators': 100,
'max_depth': 5, 'min_child_weight': 1, 'seed': 27, 'nthread': 6,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0,
'reg_alpha': 0, 'reg_lambda': 1,
'objective': 'multi:softmax', 'num_class': 5}
model = XGBClassifier(**other_params)
classifier = GridSearchCV(estimator=model, param_grid=cv_params, cv=3, verbose=1, n_jobs=6)
classifier.fit(train_data, train_label)
print("The best parameters are %s with a score of %0.2f" % (classifier.best_params_, classifier.best_score_))
During hyperparameter tunning, according to https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, I tuned n_estimators at first with GridSearchCV(n_estimators=[100,200,300,400,500]) using training data, then test with test data. Then I tried GridSearchCV with both 'n_estimators' and 'learning_rate' also.
The best hyperparameter is n_estimators=500+ 'learning_rate=0.1' with best_score_=0.83, when I use this best estimator to classify, the training data I get 100% correct result, but the test data only gets precison of [0.864 0.777 0.895 0.856 0.882] and recall of [0.941 0.919 0.764 0.874 0.753]. I guess with n_estimators=500 is overfitting, but I don't know how to choose this n_estimator and learning_rate at this step.
For reducing dimensionality, I tried PCA but more than n_components>3500 is needed to achieve 95% variance, so I use downsampling instead as shown in code.
Sorry for the incomplete info, hope this time is clear. Many thanks!

Why not try Optuna for XGBoost hyperparameter tuning, with pruning and with early_stopping_rounds parameter of XGBoost ?
Here is a notebook of mine as a guide only. XGBoost version must be 1.6 though, as early_stopping_rounds is run differently (fit() method) in XGBoost versions below 1.6.
https://www.kaggle.com/code/josephramon/sba-optuna-xgboost

Related

GridSearchCV not giving the most optimal settings?

I'm working on a XGBoost model and I also tried the GridSearchCV from Scikit learn. After I did a search for most optimal parameter settings, I got this result:
Fitting 4 folds for each of 2304 candidates, totalling 9216 fits
Best parameters found: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 11, 'n_estimators': 200, 'subsample': 0.9}
Now, after training a model with these settings and doing a prediction on unseen testdata (train/test set used), I got a result. As a test I was changing some settings and then I get a better result than with the most optimal parameters from the grid search.
Is this because the test set is different from the training data? If I have another testset, are those settings also different for the best score? I think both answers can be answered with yes, but how are other people working with this effect?
Because you get the results from the grid search, but do you always use these settings or are you doing the same as I do? What will be your final setting for the model you want to deploy?
Hope to receive some inspirational thoughts:)
My final code for train/test after manual fine tuning:
xgb_skl_tuned_2 = xgb.XGBRegressor(
colsample_bytree = 0.7,
subsample = 0.9,
learning_rate = 0.3,
max_depth = 5,
min_child_weight = 13,
gamma = 10,
n_estimators = 50
)
xgb_skl_tuned_2.fit(X_train_2,y_train_2)
preds_2 = xgb_skl_tuned_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, preds_2, squared=False)
print('Model RMSE: {}'.format(mse))
Also checked this thread: parameters tuning with GridsearchCV not giving best result

Fine Tuning hyperparameters doesn't improve score of classifiers

I am experiencing a problem where finetuning the hyperparameters using GridSearchCV doesn't really improve my classifiers. I figured the improvement should be bigger than that. The biggest improvement for a classifier I've gotten with my current code is around +-0.03. I have a dataset with eight columns and an unbalanced binary outcome. For scoring I use f1 and I use KFold with 10 splits. I was hoping if someone could spot something which is off and I should look at? Thank you!
I use the following code:
model_parameters = {
"GaussianNB": {
},
"DecisionTreeClassifier": {
'min_samples_leaf': range(5, 9),
'max_depth': [None, 0, 1, 2, 3, 4]
},
"KNeighborsClassifier": {
'n_neighbors': range(1, 10),
'weights': ["distance", "uniform"]
},
"SVM": {
'kernel': ["poly"],
'C': np.linspace(0, 15, 30)
},
"LogisticRegression": {
'C': np.linspace(0, 15, 30),
'penalty': ["l1", "l2", "elasticnet", "none"]
}
}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
n_splits = 10
scoring_method = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="micro"))
cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
for model_name, parameters in model_parameters.items():
# Models is a dict with 5 classifiers
model = models[model_name]
grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, scoring=scoring_method, verbose=False).fit(X_train, y_train)
cvScore = cross_val_score(grid_search.best_estimator_, X_test, y_test, cv=cv, scoring='f1').mean()
classDict[model_name] = cvScore

If your classes are unbalanced, when you do Kfold you should keep the proportion between the two targets.
Having folds unbalanced can lead to very poor results
check Stratified K-Folds cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns
stratified folds. The folds are made by preserving the percentage of
samples for each class.
There are also a lot of techniques to handle unbalanced dataset. Based on the context:
upsampling the minority class (using for example the resample from sklearn)
under sampling the majority class (also this lib has some useful tools to do both under\up sampling)
handle the unbalance with your specific ML model
For example, in SVC, there is an argument when you create the model , class_weight='balanced'
clf_3 = SVC(kernel='linear',
class_weight='balanced', # penalize
probability=True)
which will penalize more the errors on minority class.
You can change your config as such:
"SVM": {
'kernel': ["poly"],
'C': np.linspace(0, 15, 30),
'class_weight': 'balanced'
}
For LogisticRegression you can set the weights instead, reflecting the proportion of your classes
LogisticRegression(class_weight={0:1, 1:10}) # if problem is a binary one
changing the grid search dict in such way:
"LogisticRegression": {
'C': np.linspace(0, 15, 30),
'penalty': ["l1", "l2", "elasticnet", "none"],
'class_weight':{0:1, 1:10}
}
Anyway the approach depends on the used model. For neural network for example, you can change the loss function to penalize the minority class with a weighted calculation (the same of the logistic regression)

How to do an evaluation of Logistic Regression with imbalanced dataset using sklearn?

I make Logistic Regression using python scikit-learn. I have an imbalanced dataset with 2/3 of datapoints having label y=0 and 1/3 having label y=1.
I do a stratified splitting:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, stratify=y)
My grid for the hyperprameter-search is:
grid = {
'penalty': ['l1', 'l2', 'elasticnet'],
'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
Then I do a grid search including class_weight='balanced':
grid_search = GridSearchCV(
estimator=LogisticRegression(
max_iter=200,
random_state=1111111111,
class_weight='balanced',
multi_class='auto',
fit_intercept=True
),
param_grid=grid,
scoring=score,
cv=5,
refit=True
)
My first question is regarding the score. This is the method for choosing what is the "best" classifier in the GridSearchCV, to find the best hyper-parameters. Since I performed the LogisticRegression with class_weight='balanced', should I use the classic score='accuracy', or do I still need to use score='balanced_accuracy'? And why?
So I go on and find the best classifier:
best_clf = grid_search.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)
And now I want to calculate evaluation metrics, for example also the accuracy (again) and the f1-score.
Second question: Do I here need to use the "normal" accuracy/f1 or the balanced/weighted accuracy/f1?
"Normal":
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label=1, average='binary')
Or balanced/weighted:
acc_weighted = balanced_accuracy_score(y_test, y_pred, sample_weight=y_weights)
f1_weighted = f1_score(y_test, y_pred, sample_weight=y_weights, average='weighted')
IF I should be using the balanced/weighted version, my third question regards the parameter sample_weight=y_weights. How should I set the weights? To receive a balance (although as I said I am not sure if I already have a balance achieved or not setting class_weight='balanced'), I should scale the label y=0 with 1/3 and y=1 with 2/3, right? Like this:
y_weights = [x*(1/3)+(1/3) for x in y_test]
Or should I enter here the real distribution and scale label y=0 with 2/3 and label y=1 with 1/3? Like this:
y_weights = [x*(-1/3)+(2/3) for x in y_test]
My final question is: For evaluation, what would be the baseline accuracy that I compare my accuracy to?
0.33 (class 1), 0.5 (after balancing), or 0.66 (class 0)?
Edit: With baseline I mean a model that naively classifies all data as "1" or a model that classifies all data as "0". A problem is that I don't know if I can choose freely. For example, I get an accuracy or a balanced_accuracy of 0.66. If I compare with Baseline "always 1" (acc 0.33 (?)), my model is better. If I compare with baseline "always 0" (acc 0.66 (?)), my model is worse.
Thank you all very much for helping me.

leave one out testing

I am currently working with a small dataset of 20x300. Since I have so few datapoints, I was wondering if I could use an approach similar to leave-one-out cross-validation but for testing.
Here's what I was thinking:
train/test split the data, with only one data point in the test set.
train the model on training data, potentially with
grid_search/cross-validation
use the best model from step 2 to
make a prediction on the one data point and save the prediction in
an array
repeat the previous steps until all the data points have
been in the test set
calculate your preferred metric of choice
(accuracy, f1-score, auc, etc) using these predictions
The pros of this approach would be to:
You don't have to "reserve/waste" data for testing so you can train
with more datapoints.
The cons would be:
This approach suffers from potential(?) data leakage.
You are calculating an accuracy metric from a bunch of predictions
that potentially came from different models, due to the gridsearches, so I'm not sure how accurate it is going to be.
I have tried the standard approaches of train/test splitting but since I need to take out at least 5 points for testing, then I don't have enough points for training and the ROC AUC becomes very bad.
Here's come code so you can see what I'm talking about.
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
param_grid = {'C': [1e2, 5e2, 1e3, 5e3, 1e4, 5e4, 1e5, 5e6, 1e6],
'gamma': [0.00001, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1],
'degree': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
'kernel': ['rbf', 'linear', 'poly'],
'class_weight': ['balanced', None]}
model = SVC(probability = True)
skf = StratifiedKFold(n_splits=num_cv_folds)
grid_search = GridSearchCV(model, param_grid,
n_jobs=-1,
cv=skf,
scoring='roc_auc',
verbose=0,
refit=True,
iid=False)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict_proba(X_test)
y_preds.append(y_pred_test[0][0])
fpr, tpr, thresholds = roc_curve(y, y_preds_, pos_label=1)
auc_roc1 = auc(fpr, tpr)
I would really appreciate some feedback about whether this approach is actually feasibly or not and why.

XGBoost early stopping cv versus GridSearchCV

I am trying XGBoost to solve a regression problem. In the process of hyperparameter tuning, XGBoost's early stopping cv never stops for my code/data, whatever the parameter num_boost_round is set to be. Also, it produces poorer RMSE scores than GridSearchCV. What am I doing wrong here?
And, if I am not doing anything wrong, what advantages then early stopping cv offers over GridSearchCV?
GridSearchCV:
import math
def RMSE(y_true, y_pred):
rmse = math.sqrt(mean_squared_error(y_true, y_pred))
print 'RMSE: %2.3f' % rmse
return rmse
scorer = make_scorer(RMSE, greater_is_better=False)
cv_params = {'max_depth': [2,8], 'min_child_weight': [1,5]}
ind_params = {'learning_rate': 0.01, 'n_estimators': 1000,
'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'reg_alpha':0, 'reg_lambda':1} #regularization => L1 : alpha, L2 : lambda
optimized_GBM = GridSearchCV(xgb.XGBRegressor(**ind_params),
cv_params,
scoring = scorer,
cv = 5, verbose=1,
n_jobs = 1)
optimized_GBM.fit(train_X, train_Y)
optimized_GBM.grid_scores_
Output:
[mean: -62.42736, std: 5.18004, params: {'max_depth': 2, 'min_child_weight': 1},
mean: -62.42736, std: 5.18004, params: {'max_depth': 2, 'min_child_weight': 5},
mean: -57.11358, std: 3.62918, params: {'max_depth': 8, 'min_child_weight': 1},
mean: -57.12148, std: 3.64145, params: {'max_depth': 8, 'min_child_weight': 5}]
XGBoost CV:
our_params = {'eta': 0.01, 'max_depth':8, 'min_child_weight':1,
'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'reg:linear', 'booster':'gblinear',
'eval_metric':'rmse',
'silent':False}
num_rounds=1000
cv_xgb = xgb.cv(params = our_params,
dtrain = train_mat,
num_boost_round = num_rounds,
nfold = 5,
metrics = ['rmse'], # Make sure you enter metrics inside a list or you may encounter issues!
early_stopping_rounds = 100, # Look for early stopping that minimizes error
verbose_eval = True)
print cv_xgb.shape
print cv_xgb.tail(5)
Output:
(1000, 4)
test-rmse-mean test-rmse-std train-rmse-mean train-rmse-std
995 89.937926 0.263546 89.932823 0.062540
996 89.937773 0.263537 89.932671 0.062537
997 89.937622 0.263526 89.932517 0.062535
998 89.937470 0.263516 89.932364 0.062532
999 89.937317 0.263510 89.932210 0.062525

I have the same issue with XGboost ignoring num_boost_rounds (when early stopping is specified) and continuing to fit. I would wager that this is a bug.
As for the advantages of early stopping over GridSearchCV:
The advantage is that you don't have to try a series of values for num_boost_rounds, but you automatically stop at the best.
Early stopping is designed to find the optimum number of boosting iterations. If you specify a very large number for num_boost_round (i.e. 10000) and the best number of trees turns out to be 5261 it will stop at 5261+early_stopping_rounds, giving you a model that is pretty close to the optimum.
If you wanted to find the same optimum using GridSearchCV without early stopping rounds you would have to try many different values of num_boost_rounds (i.e. 100,200,300,...,5000,5100,5200,5300,...etc...). This would take a much longer time.
The property that early stopping is exploiting is that there is an optimal number of boosting steps after which the validation error while start to increase. So ....
why doesn't it work for your case?
impossible to say precisely without having the data, but it is probably because of a combination of the following:
num_boost_round is too small (and you run into the bug where xgboost resets and starts over, creating an neverending loop)
early_stopping_rounds is too large (maybe your data has a strongly oscillating convergence behavior. Try a smaller value and see whether the CV error is good enough)
something might be strange about your validation data
Why are you seeing different results between GridSearchCV and xgboost.cv?
Difficult to tell without having a fully working example, but have you checked all the default values for the variables that you only specify in one of the two interfaces (like 'reg_alpha':0, 'reg_lambda':1, 'objective': 'reg:linear', 'booster':'gblinear') and whether your definition of RMSE exactly matches xgboost's definition?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.