Operations on Booster (XGBoost) - python

I need some help to do a bagging aggregation of different XGBoost models (with types Booster). The idea is after to store one model, the final one, into a pickle file.
I start by creating a dummy dataframe:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import KFold
import pickle
dummy_df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
dummy_df['D'] = -15 * dummy_df['A'] + 32 * dummy_df['B']
X = dummy_df.drop('D', axis=1)
y = dummy_df['D']
I establish some parameters I'd like to test (resulting for instance from a gridsearch):
params = {'eta': 0.06, # learning rate
'tree_method': "auto",#considering my dummy df, might be more interesting to use "gblinear" of course...
'max_depth': 3,
'subsample': 0.75,
'colsample_bytree': 0.75,
'colsample_bylevel': 0.75,
'min_child_weight': 5,
'alpha': 10,
'objective': 'reg:linear',
'eval_metric': 'rmse',
'random_state': 99,
'silent': True}
Finally, I create my cross-validation scheme:
accu = 0
n_splits = 5
folds = KFold(n_splits=n_splits, shuffle=True, random_state=1)
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, y)):
train_x, train_y = X.iloc[train_idx], y.iloc[train_idx]
valid_x, valid_y = X.iloc[valid_idx], y.iloc[valid_idx]
dtrain = xgb.DMatrix(train_x, train_y)
dvalid = xgb.DMatrix(valid_x, valid_y)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
model = xgb.train(params, dtrain, 2500, watchlist, maximize=False, early_stopping_rounds=40, verbose_eval=50)
if accu == 0:
model_to_save = model
accu += 1
else:
model_to_save += model
It trains properly for the first and second iterations in my for loop, but when needs to add the 2 first iterations (final line), I get the following error:
TypeError: unsupported operand type(s) for +=: 'Booster' and 'Booster'
Is there any way in Python to add 2 Boosters? And also to divide a Booster by an integer since I'll have to divide at the end model_to_save by n_splits?
PS: Storing all the XGBoost models is not an option considering other constraints I can face later on.

params = {'eta': 0.06, # learning rate
'tree_method': "auto",#considering my dummy df, might be more interesting to use "gblinear" of course...
'max_depth': 3,
'subsample': 0.75,
'colsample_bytree': 0.75,
'colsample_bylevel': 0.75,
'min_child_weight': 5,
'alpha': 10,
'objective': 'reg:linear',
'eval_metric': 'rmse',
'random_state': 99,
'silent': True}

Related

LightGBM model predicts infinite values

I have a real state dataset from Belgium (available here) and I want to use the LightGBM model for predicting the price column.
My code is as follows:
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_city = pd.get_dummies(data.city_name, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')
one_hot_region=pd.get_dummies(data.region, prefix ="region")
#removing categorical features
data.drop(['city_name','state_of_the_building','province','region'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_city,one_hot_state_of_the_building,one_hot_province,one_hot_region],axis=1)
x=data.drop('price',axis=1)
y=data.price
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
hyper_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': ['l1','l2'],
'learning_rate': 0.005,
'feature_fraction': 0.9,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'verbose': 0,
"max_depth": 8,
"num_leaves": 128,
"max_bin": 512,
"num_iterations": 100000
}
gbm = lgb.LGBMRegressor(**hyper_params)
gbm.fit(x_train, y_train,
eval_set=[(x_test, y_test)],
eval_metric='l1',
early_stopping_rounds=1000)
y_pred = gbm.predict(x_train, num_iteration=gbm.best_iteration_)
# Basic RMSE
from sklearn.metrics import mean_squared_log_error
print('The rmse of prediction is:', round(mean_squared_log_error(y_pred, y_train) ** 0.5, 5))
The RMSE is 0.19. When I run the following code:
test_pred = np.expm1(gbm.predict(x_test, num_iteration=gbm.best_iteration_))
x_test["Price"] = test_pred
x_test.to_csv("results.csv", columns=["Column1", "Price"], index=False)
I get a warning:
/tmp/ipykernel_50181/3204387303.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
x_test["Price"] = test_pred
and when I look at my "results.csv" file it's full of Inf in the predicted column.

Tune XGB Parameters

I am working on a project with a dataset of aircraft engines and their lifetime. I need to use XGBRegressor to have the highest success rate of my model on my validation data.
I am having trouble understanding the XGBRegressor documentation, I was wondering if you know how I could optimize the search for optimal parameters instead of testing everything by hand.
I attached a part of my code related to XGB.
from xgboost import XGBRegressor
xgb = XGBRegressor(learning_rate = 0.3, max_depth = 7, n_estimators = 230, subsample = 0.7, colsample_bylevel = 0.7, colsample_bytree = 0.7, min_child_weight = 4, reg_alpha = 10, reg_lambda = 10)
xgb.fit(X_train, y_train)
The following answer will help you achieve this; you can add more Hyperparameters or more classifiers to test with different approaches. if you set cv=5 it will do 5-fold cross validation; but if you have a specific validation and only want to get perfect results you can add pass this to cv:
indices = np.arange(len(X_train))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
cv_indices=[(train_idx, test_idx)]
otherwise use: cv=5 to do 5-fold CV while searching for parameters.
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {
"XGB": XGBRegressor()
}
params = {
"XGB": {'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5], "n_estimators": [300, 600],
"learning_rate": [0.001, 0.01, 0.1],
}
}
for classifier_name in dict_classifiers.keys() & params:
print("training: ", classifier_name)
gridSearch = GridSearchCV(
estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=cv_indices)
gridSearch.fit(X_train, # shoud have shape of (n_samples, n_features)
y_train.reshape((-1))) #this should be an array with shape (n_samples,)
print(gridSearch.best_score_, gridSearch.best_params_)

How to perform nested Cross Validation (LightGBM Regression) with Bayesian Hyperparameter optimization and TimeSeriesSplit?

I want to do predictions with a Regression model.
I try to optimize my LightGBM model for the best hyperparameters while aiming for the lowest generalization RMSE score without overfitting/underfitting.
All examples I've seen use Classifications and split randomly without awareness for Time Series data + use GridSearch which are all not applicable to my problem.
How can I get bayesian hyperparameter optimization for my final model while using nested CV and TimeSeriesSplit?
My code for simple CV so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
from hyperopt import fmin, tpe, hp, Trials
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
... import data via pandas ...
y = df["target"] # predictor y
features = df.drop("target", axis=1).columns
X = df[traffic_features] # features X
days = len(df)- 60 # 2 Months for test data / ~20%
X_train, X_test = X[:days], X[days:]
y_train, y_test = y[:days], y[days:]
# hyperopt
random_state = 42
def lightgbm_cv(params, random_state=random_state, cv=cvTSS, X=X_train, y=y_train):
params = {
'n_estimators': int(params['n_estimators']),
'max_depth': int(params['max_depth']),
'learning_rate': params['learning_rate'],
'min_child_weight': params['min_child_weight'],
'feature_fraction': params['feature_fraction'],
'bagging_fraction': params['bagging_fraction'],
'bagging_freq': int(params['bagging_freq']),
'num_leaves': int(params['num_leaves']),
'max_bin': int(params['max_bin']),
'num_iterations': int(params['num_iterations']),
'objective': 'rmse',
}
# we use this params to create a new LGBM Regressor
model = lgb.LGBMRegressor(random_state=random_state, **params)
# and then conduct the cross validation with the same folds as before
score = -cross_val_score(model, X, y, cv=cv, scoring="neg_root_mean_squared_error", n_jobs=-1).mean()
print(score)
return score
space={
'n_estimators': hp.quniform('n_estimators', 100, 10_000, 1),
'max_depth' : hp.quniform('max_depth', 2, 100, 1),
'learning_rate': hp.loguniform('learning_rate', -5, 2),
'min_child_weight': hp.choice('min_child_weight', np.arange(1, 8, 1, dtype=int)),
'feature_fraction': hp.quniform('feature_fraction', 0.1, 1, 0.1),
'bagging_fraction': hp.quniform('bagging_fraction', 0.1, 1, 0.1),
'bagging_freq': hp.quniform('bagging_freq', 1, 1_000, 1),
"num_leaves": hp.quniform('num_leaves', 10, 1_000, 1),
"max_bin": hp.quniform('max_bin', 10, 2_000, 1),
"num_iterations": hp.quniform('num_iterations', 100, 10_000, 1),
'objective': 'rmse',
#'verbose': 0,
}
# trials will contain logging information
trials = Trials()
cvTSS = TimeSeriesSplit(max_train_size=None, n_splits=10) #
n_iter = 100
best=fmin(fn=lightgbm_cv, # function to optimize
space=space,
algo=tpe.suggest, # optimization, hyperotp will select its parameters automatically
max_evals=n_iter, # maximum number of iterations
trials=trials, # logging
stratified = False,
rstate=np.random.RandomState(random_state) # fixing random state for the reproducibility
)
# computing the score on the test set - some parameters from "space" are missing here, not important atm
model = lgb.LGBMRegressor(random_state=random_state, n_estimators=int(best['n_estimators']),
max_depth=int(best['max_depth']),learning_rate=best['learning_rate'])
model.fit(X_train, y_train)
tpe_test_score = mean_squared_error(y_test, model.predict(X_test), squared=False)
print("Best RMSE {:.3f} params {}".format( lightgbm_cv(best), best))

GridSearchCV - FitFailedWarning: Estimator fit failed

I am running this:
# Hyperparameter tuning - Random Forest #
# Hyperparameters' grid
parameters = {'n_estimators': list(range(100, 250, 25)), 'criterion': ['gini', 'entropy'],
'max_depth': list(range(2, 11, 2)), 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5],
'class_weight': [{0: 1, 1: i} for i in np.arange(1, 4, 0.2).tolist()], 'min_samples_split': list(range(2, 7))}
# Instantiate random forest
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0)
# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring='balanced_accuracy',
cv=5, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)
and I am receiving this warning:
.../anaconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536:
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
TypeError: '<' not supported between instances of 'str' and 'int'
Why is this and how can I fix it?
I had similar issue of FitFailedWarning with different details, after many runs I found, the parameter value passing has the error, try
parameters = {'n_estimators': [100,125,150,175,200,225,250],
'criterion': ['gini', 'entropy'],
'max_depth': [2,4,6,8,10],
'max_features': [0.1, 0.2, 0.3, 0.4, 0.5],
'class_weight': [0.2,0.4,0.6,0.8,1.0],
'min_samples_split': [2,3,4,5,6,7]}
This will pass for sure, for me it happened in XGBClassifier, somehow the values datatype mixing up
One more is if the value exceeds the range, for example in XGBClassifier 'subsample' paramerters max value is 1.0, if it is set as 1.1, FitFailedWarning will occur
For me this was giving same error but after removing none from max_dept it is fitting properly.
param_grid={'n_estimators':[100,200,300,400,500],
'criterion':['gini', 'entropy'],
'max_depth':['None',5,10,20,30,40,50,60,70],
'min_samples_split':[5,10,20,25,30,40,50],
'max_features':[ 'sqrt', 'log2'],
'max_leaf_nodes':[5,10,20,25,30,40,50],
'min_samples_leaf':[1,100,200,300,400,500]
}
code which is running properly:
param_grid={'n_estimators':[100,200,300,400,500],
'criterion':['gini', 'entropy'],
'max_depth':[5,10,20,30,40,50,60,70],
'min_samples_split':[5,10,20,25,30,40,50],
'max_features':[ 'sqrt', 'log2'],
'max_leaf_nodes':[5,10,20,25,30,40,50],
'min_samples_leaf':[1,100,200,300,400,500]
}
I too got same error and when I passed hyperparameters as in MachineLearningMastery, I got output without warning...
Try this way if anyone get similar issues...
# grid search logistic regression model on the sonar dataset
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = LogisticRegression()
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
Make sure the y-variable is an int, not bool or str.
Change your last line of code to make the y series a 0 or 1, for example:
classifiers_grid.fit(X, list(map(int, y)))

How to get the params from a saved XGBoost model

I'm trying to train a XGBoost model using the params below:
xgb_params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'lambda': 0.8,
'alpha': 0.4,
'max_depth': 10,
'max_delta_step': 1,
'verbose': True
}
Since my input data is too big to be fully loaded into the memory, I adapt the incremental training:
xgb_clf = xgb.train(xgb_params, input_data, num_boost_round=rounds_per_batch,
xgb_model=model_path)
The code for prediction is
xgb_clf = xgb.XGBClassifier()
booster = xgb.Booster()
booster.load_model(model_path)
xgb_clf._Booster = booster
raw_probas = xgb_clf.predict_proba(x)
The result seemed good. But when I tried to invoke xgb_clf.get_xgb_params(), I got a param dict in which all params were set to default values.
I can guess that the root cause is when I initialized the model, I didn't pass any params in. So the model was initialized using the default values but when it predicted, it used an internal booster that had been fitted using some pre-defined params.
However, I wonder is there any way that, after I assign a pre-trained booster model to a XGBClassifier, I can see the real params that are used to train the booster, but not those which are used to initialize the classifier.
You seem to be mixing the sklearn API with the functional API in your code, if you stick to either one you should get the parameters to persist in the pickle. Here's an example using the sklearn API.
import pickle
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_digits
digits = load_digits(2)
y = digits['target']
X = digits['data']
xgb_params = {
'objective': 'binary:logistic',
'reg_lambda': 0.8,
'reg_alpha': 0.4,
'max_depth': 10,
'max_delta_step': 1,
}
clf = xgb.XGBClassifier(**xgb_params)
clf.fit(X, y, eval_metric='auc', verbose=True)
pickle.dump(clf, open("xgb_temp.pkl", "wb"))
clf2 = pickle.load(open("xgb_temp.pkl", "rb"))
assert np.allclose(clf.predict(X), clf2.predict(X))
print(clf2.get_xgb_params())
which produces
{'base_score': 0.5,
'colsample_bylevel': 1,
'colsample_bytree': 1,
'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 1,
'max_depth': 10,
'min_child_weight': 1,
'missing': nan,
'n_estimators': 100,
'objective': 'binary:logistic',
'reg_alpha': 0.4,
'reg_lambda': 0.8,
'scale_pos_weight': 1,
'seed': 0,
'silent': 1,
'subsample': 1}
If you are training like this -
dtrain = xgb.DMatrix(x_train, label=y_train)
model = xgb.train(model_params, dtrain, model_num_rounds)
Then the model returned is a Booster.
import json
json.loads(model.save_config())
the model.save_config() function lists down model parameters in addition to other configurations.
To add to #ytsaig's answer, if you are using early_stopping_rounds argument in clf.fit() method then certain additional parameters are generated but not returned as part of clf.get_xgb_params() method. These can be accessed directly as follows: clf.best_score, clf.best_iteration and clf.best_ntree_limit.
Ref: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit

Categories