LightGBM hyperparameter tuning RandomizedSearchCV - python

I have a dataset with the following dimensions for training and testing sets:
X_train = (58149, 9)
y_train = (58149,)
X_test = (24921, 9)
y_test = (24921,)
The code that I have for RandomizedSearchCV using LightGBM classifier is as follows:
# Parameters to be used for RandomizedSearchCV-
rs_params = {
# 'bagging_fraction': [0.6, 0.66, 0.7],
'bagging_fraction': sp_uniform(0.5, 0.8),
'bagging_frequency': sp_randint(5, 8),
# 'feature_fraction': [0.6, 0.66, 0.7],
'feature_fraction': sp_uniform(0.5, 0.8),
'max_depth': sp_randint(10, 13),
'min_data_in_leaf': sp_randint(90, 120),
'num_leaves': sp_randint(1200, 1550)
}
# Initialize a RandomizedSearchCV object using 5-fold CV-
rs_cv = RandomizedSearchCV(estimator=lgb.LGBMClassifier(), param_distributions=rs_params, cv = 5, n_iter=100)
# Train on training data-
rs_cv.fit(X_train, y_train)
When I execute this code, it gives me the following error:
LightGBMError: Check failed: bagging_fraction <=1.0 at
/__w/1/s/python-package/compile/src/io/config_auto.cpp, line 295.
Any idea as to what's going wrong?

I have removed sp_uniform and sp_randint from your code and it is working well
from sklearn.model_selection import RandomizedSearchCV
import lightgbm as lgb
np.random.seed(0)
d1 = np.random.randint(2, size=(100, 9))
d2 = np.random.randint(3, size=(100, 9))
d3 = np.random.randint(4, size=(100, 9))
Y = np.random.randint(7, size=(100,))
X = np.column_stack([d1, d2, d3])
rs_params = {
'bagging_fraction': (0.5, 0.8),
'bagging_frequency': (5, 8),
'feature_fraction': (0.5, 0.8),
'max_depth': (10, 13),
'min_data_in_leaf': (90, 120),
'num_leaves': (1200, 1550)
}
# Initialize a RandomizedSearchCV object using 5-fold CV-
rs_cv = RandomizedSearchCV(estimator=lgb.LGBMClassifier(), param_distributions=rs_params, cv = 5, n_iter=100,verbose=1)
# Train on training data-
rs_cv.fit(X, Y,verbose=1)
And according to the documentation
bagging_fraction will be <=0 || >=1.
Add verbose=1 so that you can see fittings of your model,
verbose gives us the information of your model.

Related

TypeError: 'Pipeline' object is not callable function with Optuna

Trying to run an Optuna study that has a function with a Pipeline in. I, kind of, understand the error but have no idea what the solution is...
Trying to run the following code... It works fine when running XGBClassifier on preprocessed data that doesn't need to run through a pipeline..
def objective(trial):
"""Define the objective function"""
params = {
'max_depth': trial.suggest_int('max_depth', 1, 9),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
'subsample': trial.suggest_loguniform('subsample', 0.01, 1.0),
'colsample_bytree': trial.suggest_loguniform('colsample_bytree', 0.01, 1.0),
'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 1.0),
'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 1.0),
'eval_metric': 'mlogloss',
'use_label_encoder': False
}
# Define model
xgbmodel = XGBClassifier(random_state = 1)
# Bundle preprocessing and modeling code in a pipeline
xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
# ('skb', SelectKBest(chi2, k = 10)),
('xgbmodel', xgbmodel)
])
# Fit the random search model
start_time = timer(None) # timing startes from this point for "start_time" variable
# Fit the model
optuna_model = xgb_pipeline(**params)
optuna_model.fit(X_train, y_train)
# Make predictions
y_pred = optuna_model.predict(X_valid)
# Evaluate predictions
accuracy = accuracy_score(y_valid, y_pred)
return accuracy
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
Get an error that starts..
[W 2023-01-11 19:30:05,914] Trial 2 failed because of the following error: TypeError("'Pipeline' object is not callable")
So, I got it to work. I explain below, my understanding of what each change does...
def objective(trial):
"""Define the objective function"""
params = {
'xgbmodel__max_depth': trial.suggest_int('max_depth', 1, 9),
'xgbmodel__learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
'xgbmodel__n_estimators': trial.suggest_int('n_estimators', 50, 500),
'xgbmodel__min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'xgbmodel__gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
'xgbmodel__subsample': trial.suggest_loguniform('subsample', 0.01, 1.0),
'xgbmodel__colsample_bytree': trial.suggest_loguniform('colsample_bytree', 0.01, 1.0),
'xgbmodel__reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 1.0),
'xgbmodel__reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 1.0),
'xgbmodel__eval_metric': 'mlogloss',
'xgbmodel__use_label_encoder': False
}
# Define model
xgbmodel = XGBClassifier()
# Bundle preprocessing and modeling code in a pipeline
xgb_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
# ('skb', SelectKBest(chi2, k = 10)),
('xgbmodel', xgbmodel)
])
# Fit the random search model
# Fit the model
optuna_model = xgb_pipeline.set_params(**params)
optuna_model.fit(X_train, y_train)
# Make predictions
y_pred = optuna_model.predict(X_valid)
# Evaluate predictions
accuracy = accuracy_score(y_valid, y_pred)
return accuracy
In the params, you need to add the xgbmodel__ This tells the script which step in the pipeline to apply the parameters to. So in this case, the second step 'xgbmodel'.
Then before fitting the model to the train data, you set the parameters using the *set_params(*params) method. This gives an error on it's own, you need to add the class that the method applies to, in this case the pipeline - xgb_pipeline. Hopefully, I have used right terminology.

How can i adjust ranges for parameters for GridSearchCV?

I'm trying to tune parameters for XGBoost Text Classification.
parameters = {
'max_depth': range(2, 11, 2),
'n_estimators': range(200, 250, 10),
'learning_rate': [0.1, 0.01, 0.09],
'min_child_weight':range(0,20,4)
}
grid_search = GridSearchCV(
estimator=model,
param_grid=parameters,
scoring = 'roc_auc',
n_jobs = 10,
cv = 10,
verbose=True
)
However the code runs for hours but gives no results. Any recommendations about the ranges or any other things will be appreciated!

Tune XGB Parameters

I am working on a project with a dataset of aircraft engines and their lifetime. I need to use XGBRegressor to have the highest success rate of my model on my validation data.
I am having trouble understanding the XGBRegressor documentation, I was wondering if you know how I could optimize the search for optimal parameters instead of testing everything by hand.
I attached a part of my code related to XGB.
from xgboost import XGBRegressor
xgb = XGBRegressor(learning_rate = 0.3, max_depth = 7, n_estimators = 230, subsample = 0.7, colsample_bylevel = 0.7, colsample_bytree = 0.7, min_child_weight = 4, reg_alpha = 10, reg_lambda = 10)
xgb.fit(X_train, y_train)
The following answer will help you achieve this; you can add more Hyperparameters or more classifiers to test with different approaches. if you set cv=5 it will do 5-fold cross validation; but if you have a specific validation and only want to get perfect results you can add pass this to cv:
indices = np.arange(len(X_train))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
cv_indices=[(train_idx, test_idx)]
otherwise use: cv=5 to do 5-fold CV while searching for parameters.
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {
"XGB": XGBRegressor()
}
params = {
"XGB": {'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5], "n_estimators": [300, 600],
"learning_rate": [0.001, 0.01, 0.1],
}
}
for classifier_name in dict_classifiers.keys() & params:
print("training: ", classifier_name)
gridSearch = GridSearchCV(
estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=cv_indices)
gridSearch.fit(X_train, # shoud have shape of (n_samples, n_features)
y_train.reshape((-1))) #this should be an array with shape (n_samples,)
print(gridSearch.best_score_, gridSearch.best_params_)

How to perform nested Cross Validation (LightGBM Regression) with Bayesian Hyperparameter optimization and TimeSeriesSplit?

I want to do predictions with a Regression model.
I try to optimize my LightGBM model for the best hyperparameters while aiming for the lowest generalization RMSE score without overfitting/underfitting.
All examples I've seen use Classifications and split randomly without awareness for Time Series data + use GridSearch which are all not applicable to my problem.
How can I get bayesian hyperparameter optimization for my final model while using nested CV and TimeSeriesSplit?
My code for simple CV so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
from hyperopt import fmin, tpe, hp, Trials
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
... import data via pandas ...
y = df["target"] # predictor y
features = df.drop("target", axis=1).columns
X = df[traffic_features] # features X
days = len(df)- 60 # 2 Months for test data / ~20%
X_train, X_test = X[:days], X[days:]
y_train, y_test = y[:days], y[days:]
# hyperopt
random_state = 42
def lightgbm_cv(params, random_state=random_state, cv=cvTSS, X=X_train, y=y_train):
params = {
'n_estimators': int(params['n_estimators']),
'max_depth': int(params['max_depth']),
'learning_rate': params['learning_rate'],
'min_child_weight': params['min_child_weight'],
'feature_fraction': params['feature_fraction'],
'bagging_fraction': params['bagging_fraction'],
'bagging_freq': int(params['bagging_freq']),
'num_leaves': int(params['num_leaves']),
'max_bin': int(params['max_bin']),
'num_iterations': int(params['num_iterations']),
'objective': 'rmse',
}
# we use this params to create a new LGBM Regressor
model = lgb.LGBMRegressor(random_state=random_state, **params)
# and then conduct the cross validation with the same folds as before
score = -cross_val_score(model, X, y, cv=cv, scoring="neg_root_mean_squared_error", n_jobs=-1).mean()
print(score)
return score
space={
'n_estimators': hp.quniform('n_estimators', 100, 10_000, 1),
'max_depth' : hp.quniform('max_depth', 2, 100, 1),
'learning_rate': hp.loguniform('learning_rate', -5, 2),
'min_child_weight': hp.choice('min_child_weight', np.arange(1, 8, 1, dtype=int)),
'feature_fraction': hp.quniform('feature_fraction', 0.1, 1, 0.1),
'bagging_fraction': hp.quniform('bagging_fraction', 0.1, 1, 0.1),
'bagging_freq': hp.quniform('bagging_freq', 1, 1_000, 1),
"num_leaves": hp.quniform('num_leaves', 10, 1_000, 1),
"max_bin": hp.quniform('max_bin', 10, 2_000, 1),
"num_iterations": hp.quniform('num_iterations', 100, 10_000, 1),
'objective': 'rmse',
#'verbose': 0,
}
# trials will contain logging information
trials = Trials()
cvTSS = TimeSeriesSplit(max_train_size=None, n_splits=10) #
n_iter = 100
best=fmin(fn=lightgbm_cv, # function to optimize
space=space,
algo=tpe.suggest, # optimization, hyperotp will select its parameters automatically
max_evals=n_iter, # maximum number of iterations
trials=trials, # logging
stratified = False,
rstate=np.random.RandomState(random_state) # fixing random state for the reproducibility
)
# computing the score on the test set - some parameters from "space" are missing here, not important atm
model = lgb.LGBMRegressor(random_state=random_state, n_estimators=int(best['n_estimators']),
max_depth=int(best['max_depth']),learning_rate=best['learning_rate'])
model.fit(X_train, y_train)
tpe_test_score = mean_squared_error(y_test, model.predict(X_test), squared=False)
print("Best RMSE {:.3f} params {}".format( lightgbm_cv(best), best))

Operations on Booster (XGBoost)

I need some help to do a bagging aggregation of different XGBoost models (with types Booster). The idea is after to store one model, the final one, into a pickle file.
I start by creating a dummy dataframe:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import KFold
import pickle
dummy_df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
dummy_df['D'] = -15 * dummy_df['A'] + 32 * dummy_df['B']
X = dummy_df.drop('D', axis=1)
y = dummy_df['D']
I establish some parameters I'd like to test (resulting for instance from a gridsearch):
params = {'eta': 0.06, # learning rate
'tree_method': "auto",#considering my dummy df, might be more interesting to use "gblinear" of course...
'max_depth': 3,
'subsample': 0.75,
'colsample_bytree': 0.75,
'colsample_bylevel': 0.75,
'min_child_weight': 5,
'alpha': 10,
'objective': 'reg:linear',
'eval_metric': 'rmse',
'random_state': 99,
'silent': True}
Finally, I create my cross-validation scheme:
accu = 0
n_splits = 5
folds = KFold(n_splits=n_splits, shuffle=True, random_state=1)
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, y)):
train_x, train_y = X.iloc[train_idx], y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], y.iloc[valid_idx]
dtrain = xgb.DMatrix(train_x, train_y)
dvalid = xgb.DMatrix(valid_x, valid_y)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
model = xgb.train(params, dtrain, 2500, watchlist, maximize=False, early_stopping_rounds=40, verbose_eval=50)
if accu == 0:
model_to_save = model
accu += 1
else:
model_to_save += model
It trains properly for the first and second iterations in my for loop, but when needs to add the 2 first iterations (final line), I get the following error:
TypeError: unsupported operand type(s) for +=: 'Booster' and 'Booster'
Is there any way in Python to add 2 Boosters? And also to divide a Booster by an integer since I'll have to divide at the end model_to_save by n_splits?
PS: Storing all the XGBoost models is not an option considering other constraints I can face later on.
params = {'eta': 0.06, # learning rate
'tree_method': "auto",#considering my dummy df, might be more interesting to use "gblinear" of course...
'max_depth': 3,
'subsample': 0.75,
'colsample_bytree': 0.75,
'colsample_bylevel': 0.75,
'min_child_weight': 5,
'alpha': 10,
'objective': 'reg:linear',
'eval_metric': 'rmse',
'random_state': 99,
'silent': True}

Categories