I've been running a Randomized Grid Search in sklearn with LightGBM in Sagemaker, but when I run the fit line, it only displays one message that says Fitting 3 folds for each of 100 candidates, totalling 300 fits and nothing more, no messages showing the process or metrics.
Here's the code that I am using:
fit_params={#'boosting_type': 'gbdt',
#"objective":'binary',
"eval_metric" : 'auc',
"eval_set" : [(X_test,y_test)],
'eval_names': ['valid'],
'verbose': 60,
#'is_unbalance':True,
#'n_estimators':10000,
"early_stopping_rounds":500,
'categorical_feature': 'auto'}
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(6, 50),
'max_depth': sp_randint(3, 9),
'min_child_samples': sp_randint(150, 600),
'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
'learning_rate': sp_uniform(loc=0.001, scale=0.01),
'subsample': sp_uniform(loc=0.2, scale=0.8),
'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
'reg_alpha': [0, 1e-1, 1e-2, 5e-2, 0.5, 0.25, 1, 2, 5, 7, 10, 50, 100],
'reg_lambda': [0, 1e-1, 1e-2, 5e-2, 0.5, 0.25, 1, 5, 10, 20, 50, 100],
'class_weight': [None, 'balanced']
}
n_HP_points_to_test = 100
clf1 = lgb.LGBMClassifier(boosting_type='gbdt', metric='None', objective='binary',
is_unbalance=True, n_estimators=10000, random_state=314, n_jobs=40)
gs1 = RandomizedSearchCV(
estimator=clf1, param_distributions=param_test,
n_iter=n_HP_points_to_test,
scoring='roc_auc',
cv=3,
refit=True,
random_state=314,
n_jobs=30,
verbose=100)
And the final line to fit and launch the Search:
gs1.fit(X_train, y_train, **fit_params)
I've read other questions and they say that the output is being printed on the log terminal, but I don't seem to find it in Sagemaker neither on my local machine.
Has any of you come across this situation when searching for the optimal parameters? Do you have another piece of code to run this GridSearchCV or RandomizedSearchCV that works for you and LightGBM?
Forgot to mention that the fitting never finishes, I even coded a line to print the best_params_ to a .txt file, and it never showed, besides all the cores that I used, like 40, were still busy on htop, but 15 hours for this process seems kinda excessive.
Thank you so much in advance!!
Related
I'm trying to tune parameters for XGBoost Text Classification.
parameters = {
'max_depth': range(2, 11, 2),
'n_estimators': range(200, 250, 10),
'learning_rate': [0.1, 0.01, 0.09],
'min_child_weight':range(0,20,4)
}
grid_search = GridSearchCV(
estimator=model,
param_grid=parameters,
scoring = 'roc_auc',
n_jobs = 10,
cv = 10,
verbose=True
)
However the code runs for hours but gives no results. Any recommendations about the ranges or any other things will be appreciated!
I am working on a project with a dataset of aircraft engines and their lifetime. I need to use XGBRegressor to have the highest success rate of my model on my validation data.
I am having trouble understanding the XGBRegressor documentation, I was wondering if you know how I could optimize the search for optimal parameters instead of testing everything by hand.
I attached a part of my code related to XGB.
from xgboost import XGBRegressor
xgb = XGBRegressor(learning_rate = 0.3, max_depth = 7, n_estimators = 230, subsample = 0.7, colsample_bylevel = 0.7, colsample_bytree = 0.7, min_child_weight = 4, reg_alpha = 10, reg_lambda = 10)
xgb.fit(X_train, y_train)
The following answer will help you achieve this; you can add more Hyperparameters or more classifiers to test with different approaches. if you set cv=5 it will do 5-fold cross validation; but if you have a specific validation and only want to get perfect results you can add pass this to cv:
indices = np.arange(len(X_train))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
cv_indices=[(train_idx, test_idx)]
otherwise use: cv=5 to do 5-fold CV while searching for parameters.
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {
"XGB": XGBRegressor()
}
params = {
"XGB": {'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5], "n_estimators": [300, 600],
"learning_rate": [0.001, 0.01, 0.1],
}
}
for classifier_name in dict_classifiers.keys() & params:
print("training: ", classifier_name)
gridSearch = GridSearchCV(
estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=cv_indices)
gridSearch.fit(X_train, # shoud have shape of (n_samples, n_features)
y_train.reshape((-1))) #this should be an array with shape (n_samples,)
print(gridSearch.best_score_, gridSearch.best_params_)
I'm using xgb and have hypertuned my parameters using hyperopt, however when I plot the the train set and validation set after fitting my model, I noticed that the lines intersect with each other, what does that mean? Also the validation line doesn't start near the training line.
I'm using early_stopping_rounds = 20 when I fit my model prior to plotting this graph.
The hyperparameters I got from HyperOpt are as follows:
{'booster': 'gbtree',
'colsample_bytree': 0.8814444518931106,
'eta': 0.0712456143241873,
'eval_metric': 'ndcg',
'gamma': 0.8925113465433823,
'max_depth': 8,
'min_child_weight': 5,
'objective': 'rank:pairwise',
'reg_alpha': 2.2193560083517383,
'reg_lambda': 1.8600142721064354,
'seed': 0,
'subsample': 0.9818535865621624}
I thought Hyperopt should be giving me the best parameters. What can I possibly change to improve this?
Edit
I changed n_estimators from 527 to 160, and it is giving me this graph now.
But I'm not sure if this graph is okay? Any advice is much appreciated!
I have 2 regressors:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
params = {
'num_leaves': [7, 14, 21, 28, 31, 50],
'learning_rate': [0.1, 0.03, 0.003],
'max_depth': [-1, 3, 5],
'n_estimators': [50, 100, 200, 500],
}
grid = GridSearchCV(lgb.LGBMRegressor(random_state=0), params, scoring='r2', cv=5)
grid.fit(X_train, y_train)
reg = lgb.LGBMRegressor(random_state=0)
As you see, I defined a random_state for both regressors. GridSearchCV must find the best params for estimator to increace its scroring. But
r2_score(y_train, grid.predict(X_train)) # output is 0.69
r2_score(y_train, reg.predict(X_train)) # output is 0.84
So, how can find best params for LGBMRegressor?
Based on the documentation here, after calling grid.fit() you can find the best estimator (ready model) and params here:
grid.best_estimator_
grid.best_params_
FYI: random_state works just for random cases (when shuffling for example).
In your case the params for models are different and the results of your metric R2 are accordingly different.
So, I believe you would have to script it like:
params = {
'num_leaves': [7, 14, 21, 28, 31, 50],
'learning_rate': [0.1, 0.03, 0.003],
'max_depth': [-1, 3, 5],
'n_estimators': [50, 100, 200, 500],
}
grid = GridSearchCV(lgb.LGBMRegressor(random_state=0), params, scoring='r2', cv=5)
grid.fit(X_train, y_train)
reg = lgb.LGBMRegressor(random_state=0)
reg.fit(X_train,y_train)
lgbm_tuned = grid.best_estimator_
r2_tuned = grid.best_params_
r2_regular = r2_score(y_train, reg.predict(X_train))
when r2_tuned is the best score found with Grid Search, lgbm_tuned is your model defined with the best parameters and r2_regular is your score with default parameters.
It is weird to find a worst result after gridsearch, specially when the parameters for the gridsearch includes the default parameters for LightGBM.
I'm new to xgboost on Python and today I was trying to follow the tutorial here: https://jessesw.com/XG-Boost/.
Then I tried xgboost using my own data, it works fine without using gridsearch. Then I followed the tutorial to do the gridsearch but looks like it does not work. This is my code:
cv_params = {'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 500, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'reg:linear'}
optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params),
cv_params,
cv=5, n_jobs=2, verbose=2)
optimized_GBM.fit(train_x, train['label'])
And I got this output:
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] max_depth=3, min_child_weight=1................................
//anaconda/bin/python.app: line 3: 906 Killed: 9 //anaconda/python.app/Contents/MacOS/python "$#"
Any advice would be appreciated!
In my case the reason was in colsample_by_tree. It was 0.1, when the total number of features was less than 10.
And the reason of fail was an assertion, when tree is trying to learn on dataset with n samples and 0 features.