Python Xgboost GridSearchCV killed, how to fix? - python

I'm new to xgboost on Python and today I was trying to follow the tutorial here: https://jessesw.com/XG-Boost/.
Then I tried xgboost using my own data, it works fine without using gridsearch. Then I followed the tutorial to do the gridsearch but looks like it does not work. This is my code:
cv_params = {'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 500, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'reg:linear'}
optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params),
cv_params,
cv=5, n_jobs=2, verbose=2)
optimized_GBM.fit(train_x, train['label'])
And I got this output:
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] max_depth=3, min_child_weight=1................................
//anaconda/bin/python.app: line 3: 906 Killed: 9 //anaconda/python.app/Contents/MacOS/python "$#"
Any advice would be appreciated!

In my case the reason was in colsample_by_tree. It was 0.1, when the total number of features was less than 10.
And the reason of fail was an assertion, when tree is trying to learn on dataset with n samples and 0 features.

Related

GridSearchCV not giving the most optimal settings?

I'm working on a XGBoost model and I also tried the GridSearchCV from Scikit learn. After I did a search for most optimal parameter settings, I got this result:
Fitting 4 folds for each of 2304 candidates, totalling 9216 fits
Best parameters found: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 11, 'n_estimators': 200, 'subsample': 0.9}
Now, after training a model with these settings and doing a prediction on unseen testdata (train/test set used), I got a result. As a test I was changing some settings and then I get a better result than with the most optimal parameters from the grid search.
Is this because the test set is different from the training data? If I have another testset, are those settings also different for the best score? I think both answers can be answered with yes, but how are other people working with this effect?
Because you get the results from the grid search, but do you always use these settings or are you doing the same as I do? What will be your final setting for the model you want to deploy?
Hope to receive some inspirational thoughts:)
My final code for train/test after manual fine tuning:
xgb_skl_tuned_2 = xgb.XGBRegressor(
colsample_bytree = 0.7,
subsample = 0.9,
learning_rate = 0.3,
max_depth = 5,
min_child_weight = 13,
gamma = 10,
n_estimators = 50
)
xgb_skl_tuned_2.fit(X_train_2,y_train_2)
preds_2 = xgb_skl_tuned_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, preds_2, squared=False)
print('Model RMSE: {}'.format(mse))
Also checked this thread: parameters tuning with GridsearchCV not giving best result

Tune XGB Parameters

I am working on a project with a dataset of aircraft engines and their lifetime. I need to use XGBRegressor to have the highest success rate of my model on my validation data.
I am having trouble understanding the XGBRegressor documentation, I was wondering if you know how I could optimize the search for optimal parameters instead of testing everything by hand.
I attached a part of my code related to XGB.
from xgboost import XGBRegressor
xgb = XGBRegressor(learning_rate = 0.3, max_depth = 7, n_estimators = 230, subsample = 0.7, colsample_bylevel = 0.7, colsample_bytree = 0.7, min_child_weight = 4, reg_alpha = 10, reg_lambda = 10)
xgb.fit(X_train, y_train)
The following answer will help you achieve this; you can add more Hyperparameters or more classifiers to test with different approaches. if you set cv=5 it will do 5-fold cross validation; but if you have a specific validation and only want to get perfect results you can add pass this to cv:
indices = np.arange(len(X_train))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
cv_indices=[(train_idx, test_idx)]
otherwise use: cv=5 to do 5-fold CV while searching for parameters.
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {
"XGB": XGBRegressor()
}
params = {
"XGB": {'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5], "n_estimators": [300, 600],
"learning_rate": [0.001, 0.01, 0.1],
}
}
for classifier_name in dict_classifiers.keys() & params:
print("training: ", classifier_name)
gridSearch = GridSearchCV(
estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=cv_indices)
gridSearch.fit(X_train, # shoud have shape of (n_samples, n_features)
y_train.reshape((-1))) #this should be an array with shape (n_samples,)
print(gridSearch.best_score_, gridSearch.best_params_)

Gridsearch with LightGBM verbose displays just one line

I've been running a Randomized Grid Search in sklearn with LightGBM in Sagemaker, but when I run the fit line, it only displays one message that says Fitting 3 folds for each of 100 candidates, totalling 300 fits and nothing more, no messages showing the process or metrics.
Here's the code that I am using:
fit_params={#'boosting_type': 'gbdt',
#"objective":'binary',
"eval_metric" : 'auc',
"eval_set" : [(X_test,y_test)],
'eval_names': ['valid'],
'verbose': 60,
#'is_unbalance':True,
#'n_estimators':10000,
"early_stopping_rounds":500,
'categorical_feature': 'auto'}
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(6, 50),
'max_depth': sp_randint(3, 9),
'min_child_samples': sp_randint(150, 600),
'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
'learning_rate': sp_uniform(loc=0.001, scale=0.01),
'subsample': sp_uniform(loc=0.2, scale=0.8),
'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
'reg_alpha': [0, 1e-1, 1e-2, 5e-2, 0.5, 0.25, 1, 2, 5, 7, 10, 50, 100],
'reg_lambda': [0, 1e-1, 1e-2, 5e-2, 0.5, 0.25, 1, 5, 10, 20, 50, 100],
'class_weight': [None, 'balanced']
}
n_HP_points_to_test = 100
clf1 = lgb.LGBMClassifier(boosting_type='gbdt', metric='None', objective='binary',
is_unbalance=True, n_estimators=10000, random_state=314, n_jobs=40)
gs1 = RandomizedSearchCV(
estimator=clf1, param_distributions=param_test,
n_iter=n_HP_points_to_test,
scoring='roc_auc',
cv=3,
refit=True,
random_state=314,
n_jobs=30,
verbose=100)
And the final line to fit and launch the Search:
gs1.fit(X_train, y_train, **fit_params)
I've read other questions and they say that the output is being printed on the log terminal, but I don't seem to find it in Sagemaker neither on my local machine.
Has any of you come across this situation when searching for the optimal parameters? Do you have another piece of code to run this GridSearchCV or RandomizedSearchCV that works for you and LightGBM?
Forgot to mention that the fitting never finishes, I even coded a line to print the best_params_ to a .txt file, and it never showed, besides all the cores that I used, like 40, were still busy on htop, but 15 hours for this process seems kinda excessive.
Thank you so much in advance!!

How to deal with overfitting of xgboost classifier?

I use xgboost to do a multi-class classification of spectrogram images(data link: automotive target classification). The class number is 5, training data includes 20000 samples(each class 5000 samples), test data includes 5000 samples(each class 1000 samples), the original image size is 144*400. This is my code snippet:
train_data, train_label, test_data, test_label = load_data(data_dir, resampleX=4, resampleY=5)
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
cv_params = {'n_estimators': [100,200,300,400,500], 'learning_rate': [0.01, 0.1]}
other_params = {'learning_rate': 0.1, 'n_estimators': 100,
'max_depth': 5, 'min_child_weight': 1, 'seed': 27, 'nthread': 6,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0,
'reg_alpha': 0, 'reg_lambda': 1,
'objective': 'multi:softmax', 'num_class': 5}
model = XGBClassifier(**other_params)
classifier = GridSearchCV(estimator=model, param_grid=cv_params, cv=3, verbose=1, n_jobs=6)
classifier.fit(train_data, train_label)
print("The best parameters are %s with a score of %0.2f" % (classifier.best_params_, classifier.best_score_))
During hyperparameter tunning, according to https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, I tuned n_estimators at first with GridSearchCV(n_estimators=[100,200,300,400,500]) using training data, then test with test data. Then I tried GridSearchCV with both 'n_estimators' and 'learning_rate' also.
The best hyperparameter is n_estimators=500+ 'learning_rate=0.1' with best_score_=0.83, when I use this best estimator to classify, the training data I get 100% correct result, but the test data only gets precison of [0.864 0.777 0.895 0.856 0.882] and recall of [0.941 0.919 0.764 0.874 0.753]. I guess with n_estimators=500 is overfitting, but I don't know how to choose this n_estimator and learning_rate at this step.
For reducing dimensionality, I tried PCA but more than n_components>3500 is needed to achieve 95% variance, so I use downsampling instead as shown in code.
Sorry for the incomplete info, hope this time is clear. Many thanks!
Why not try Optuna for XGBoost hyperparameter tuning, with pruning and with early_stopping_rounds parameter of XGBoost ?
Here is a notebook of mine as a guide only. XGBoost version must be 1.6 though, as early_stopping_rounds is run differently (fit() method) in XGBoost versions below 1.6.
https://www.kaggle.com/code/josephramon/sba-optuna-xgboost

GridSeachCV with multiple scoring functions?

Depending on the scoring function you pass to GridSearchCV the results for grid.best_estomator_ might differ. I am wondering whether it is possible to run a single GridSearch in sklearn and in the output get several scores(or true values for the scoring function)?
Something like:
clf = GridSearchCV(model,param_grid,scoring=['mean_square_error','r2_score'])
And as output get:
clf.grids_cores_:
[MSE mean: -0.00000, R2 mean: -0.01975,: {'max_depth': 2, 'learning_rate': 0.05, 'min_child_weight': 4, 'n_estimators': 25}
MSE mean: -0.00001, R2 mean: -0.01975,: {'max_depth': 3, 'learning_rate': 0.05, 'min_child_weight': 4, 'n_estimators': 25},
MSE mean: -0.00002, R2 mean: -0.01975,: {'max_depth': 4, 'learning_rate': 0.05, 'min_child_weight': 4, 'n_estimators': 25}, etc)
The idea is to get a score for every valuation metric at every combination of model hyperparameters. Assume that I have 10 different scoring functions for GridSearchCV. It will be extremely time consuming to run GridSearchCV 10 times to see which model parameters are best for every scoring function. The idea is to run it only once and get a number(score) for every scoring function within grid_scores_
It seems that it was almost implemented to sklearn in 2015, unfortunately the project was never finished: https://github.com/scikit-learn/scikit-learn/pull/2759
I'm looking for a way of doing this on my own.

Categories