python LightGBM reproductibility issue - python

I am running a lightGBM on a classification problem, with crossvalidation (using sklearn) to get the optimal hyper parameters values.
Although I specified the random_state when create the model object, rerunning the grid search results in different optimal parameters each time.
import lightgbm as lab
from sklearn.model_selection import GridSearchCV
lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt',
objective='multiclass',
num_class= 4,
random_state = 42,
)
grid = {
'num_leaves': [60, 70, 80, 100, 120],
'feature_fraction': [0.5, 0.7],
'bagging_fraction': [0.7, 0.8],
'num_trees':[50, 80, 100],
'C': [0, 0.3, 0.5, 1]
}
GBM_grid_search = GridSearchCV(estimator = lgb_estimator,
param_grid = grid,
scoring = 'f1_micro',
cv = 15,
n_jobs = 2)
lgb_model_trained = GBM_grid_search.fit(X=X_train,
y=y_train)
My training data is splitted using a random seed so no issues on that side
What's causing the randomness ? And how to solve this ?

AFAIK, setting the random seed (random_state in LGBMClassifier) does not result in reproducibility if LightGBM is working in parallel (n_jobs>1). If you need reproducibility and want to use all your n cores, you should find or create a method to run n instances of LightGBM with n_jobs=1 each. The simplest way: having 2 cores you may run 2 instances of your original code, the first one with bagging_fraction=0.7, the second one with bagging_fraction=0.8, each one with n_jobs=1. Or simply set n_jobs=1 without any other changes, if reproducibility is more important than speed.

Related

Configuration of GridSearchCV for AdaBoost and its base learner

I'm running grid search on AdaBoost with DecisionTreeClassifier as its base learner to get the best parameters for AdaBoost and DecisionTree.
The search on a dataset (130000, 22) has been running for 18 hours so I'm wondering if it's just another typical day of waiting for training or maybe there might be an issue with the set up.
Is the base-learner, grid search, training and params set up correctly?
ada_params = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"base_estimator__min_samples_leaf": [*np.arange(100,1500,100)],
"base_estimator__max_depth": [5,10,13,15],
"base_estimator__max_features": [5,10,15],
"n_estimators": [500, 700, 1000, 1500],
"learning_rate": [0.001, 0.01, 0.1, 0.3]
}
dt_base_learner = DecisionTreeClassifier(random_state = 42, max_features="auto", class_weight = "balanced")
ada_clf = AdaBoostClassifier(base_estimator = dt_base_learner)
ada_search = GridSearchCV(ada_clf, param_grid=ada_params, scoring = 'f1', cv=kf)
ada_search.fit(scaled_X_train, y_train)
If I am not mistaken, your GridSearch tests 14 * 4 * 3 * 4 * 4 = 2,688 different model configuration, each for a crossvalidation of an unknown number of splits. You should definitely try to reduce the number of combinations in the GridSearchCV or go for RandomizedSearchCV or BayesSearchCV from skopt.
Gridsearch will not finish until all joins are done, check the RandomSearchcv documentation and increase the joins a few at a time (n_iter) and put "-1" in "n_jobs" to parallelize as much as possible
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

GridSearchCV not giving the most optimal settings?

I'm working on a XGBoost model and I also tried the GridSearchCV from Scikit learn. After I did a search for most optimal parameter settings, I got this result:
Fitting 4 folds for each of 2304 candidates, totalling 9216 fits
Best parameters found: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 11, 'n_estimators': 200, 'subsample': 0.9}
Now, after training a model with these settings and doing a prediction on unseen testdata (train/test set used), I got a result. As a test I was changing some settings and then I get a better result than with the most optimal parameters from the grid search.
Is this because the test set is different from the training data? If I have another testset, are those settings also different for the best score? I think both answers can be answered with yes, but how are other people working with this effect?
Because you get the results from the grid search, but do you always use these settings or are you doing the same as I do? What will be your final setting for the model you want to deploy?
Hope to receive some inspirational thoughts:)
My final code for train/test after manual fine tuning:
xgb_skl_tuned_2 = xgb.XGBRegressor(
colsample_bytree = 0.7,
subsample = 0.9,
learning_rate = 0.3,
max_depth = 5,
min_child_weight = 13,
gamma = 10,
n_estimators = 50
)
xgb_skl_tuned_2.fit(X_train_2,y_train_2)
preds_2 = xgb_skl_tuned_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, preds_2, squared=False)
print('Model RMSE: {}'.format(mse))
Also checked this thread: parameters tuning with GridsearchCV not giving best result

Different results obtained on a regression model fitted on a database from different versions of XGBoost

I have wrote a code in python to do some regression work with XGBoost but when I run the code on two different computers with two different versions of XGBoost and Python, the results are drastically different. My code is long but I would like to show some parts of it here. The parts I am presenting here are the hyperparameter tuning using xgb.cv() command and fitting and prediction using Scikit's XGBRegressor with the optimized parameters obtained by the hyperparameter tuning. The parameters that will be tuned are stored in the following list with an initial arbitary value:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing
from model_functions import GaussRankScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt
import xgboost as xgb
from xgboost import XGBRegressor
import shap
import operator
from sklearn.model_selection import GridSearchCV
import joblib
import plotly.graph_objs as go
import scipy as sp
import seaborn as sns
from numpy import asarray
from sklearn.svm import SVR
from scipy.stats import ttest_ind
from sklearn.impute import SimpleImputer
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
import scipy.stats as stats
from yellowbrick.regressor import residuals_plot, ResidualsPlot
import sys
from scipy.stats import pearsonr
params = {
'max_depth': 6,
'min_child_weight': 1,
'learning_rate': 0.3,
'subsample': 1,
'colsample_bytree': 1,
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'booster': 'gbtree',
'nthread': -1,
'validate_parameters':'True',
'alpha': 0.2,
'lambda': 0.001,
'colsample_bylevel': 0.9,
'verbose': 0,
'gamma': 0.01,
'max_delta_step': 0.1,
'silent': 0
}
The parameter tuning is done as mentioned below using a for loop. In every loop, two of the parameters are tuned except for learning rate and gamma where they are optimized individually. Each two parameters will be optimized in a for loop and the list of parameters would be updated with the best value optimized for them at the end of each loop. The loops are similar with the only difference between them being the parameters optimized. xgb.cv() is used for the cross validation part of the process. The evaluation metrics used to choose the best value for each parameter is RMSE. The following is the loop that is responsible for optimizing learning rate (AKA eta):
df_x = dfnum.iloc[:,:-1]
df_y = dfnum.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(df_x, df_y,
test_size=0.1,
random_state=42)
"Converting features' distributions to normal distribution"
gauss_scaler = GaussRankScaler()
X_trainnum = gauss_scaler.fit_transform(X_train)
X_testnum = gauss_scaler.transform(X_test)
"Scaling all the features to be between 0 and 1"
scaler = preprocessing.MinMaxScaler()
X_trainnum = scaler.fit_transform(X_trainnum)
X_testnum = scaler.transform(X_testnum)
num_boost_round = 999
dtrain = xgb.DMatrix(X_trainnum, label=y_train)
dtest = xgb.DMatrix(X_testnum, label=y_test)
min_rmse = float("Inf")
best_params = None
for learning_rate in [.3, .2, .1, .05, .01, .005]:
params['learning_rate'] = learning_rate
cv_results = xgb.cv(
params,
dtrain,
num_boost_round=num_boost_round,
seed=42,
nfold=3,
metrics=['rmse'],
early_stopping_rounds=10
)
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_params = learning_rate
print('')
print("Best parameter: learning_rate = {}, RMSE: {}".format(best_params, min_rmse))
print('')
params['learning_rate'] = best_params
After all the parameters are tuned in the above fashion, the updated and optimized list of parameters is passed through the XGBRegressor and the model is fitted on the database at hand:
print('Fitting the model')
best_model = XGBRegressor(**params,early_stopping_rounds=10,num_boost_round=999)
best_model.fit(X_trainnum, y_train)
joblib.dump(best_model,'best_model_grid')
y_pred = best_model.predict(X_testnum)
y_pred1 = best_model.predict(X_trainnum)
I am using Python and XGBoost through Anaconda on both of my machines (personal lap top and my office PC). The XGBoost version on my lap top is 0.90 and the Python version is 3.7.10. My office PC on the other hand, runs the 3.8.11 version of Python and 1.42 version of XGBoost.
When running my code on my personal lap top with the older version of Python and XGBoost, the code runs smoothly without any warnings or errors. However, when it is ran on my office PC with newer versions of the Python and XGBoost, at each step of the for loops containing xgb.cv() command, designed to do the hyperparameter tuning, I receive this error message:
[13:38:44] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573:
Parameters: { "early_stoppage" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
The error messages would then change to:
Hyperparameter tuning.
[13:38:47] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573:
Parameters: { "silent", "verbose" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
And finally, when the model is fitted with XGBRegressor, it changes to:
[15:08:00] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573:
Parameters: { "early_stopping_rounds", "num_boost_round", "silent", "verbose" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
The results obtained on the older versions of the algorithm and programing language used in this project, are much better than the results obtained through the newer version. The older version yields much better results than the newer version. The difference is very Significant. The database I work with consists of 11 numerical features and a numerical target feature.
I have researched and browsed on this website and other sources and sought help from a number of data analyst experts on this, but unfortunately I have not been able to find a solution or a reason for this problem.
I would be really thankful and appreciative if someone could help me with this issue
I will focus on this snippet of code:
best_model = XGBRegressor(**params,early_stopping_rounds=10,num_boost_round=999)
The correct version should be:
# Removed `verbose`, `eval_metric`. Replaced `nthread` with `n_jobs`.
# Replaced objective to "reg:squarederror" since you are using regression instead of classification.
params = {
'max_depth': 6,
'min_child_weight': 1,
'eta': 0.3,
'subsample': 1,
'colsample_bytree': 1,
"objective": "reg:squarederror",
'booster': 'gbtree',
'n_jobs': 10,
'validate_parameters':'True',
'alpha': 0.2,
'lambda': 0.001,
'colsample_bylevel': 0.9,
'gamma': 0.01,
'max_delta_step': 0.1,
}
# notice the `n_estimators`
model = XGBRegressor(**params, n_estimators=999)
# Passed `early_stopping_rounds`, `verbose`, `eval_metric` here.
# Replaced the `eval_metric` to `rmse` since you are using regression instead of classification.
# Added `eval_set` since you need to carry out evaluation.
model.fit(
X,
y,
early_stopping_rounds=10,
verbose=True,
eval_metric="rmse",
eval_set=[(X, y)],
)
You can find the document of the estimator here https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn .

Retrieving specific classifiers and data from GridSearchCV

I am running a Python 3 classification script on a server using the following code:
# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()
# define KNN parameters
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_jobs': [-1],
'weights': ['uniform', 'distance']}]
# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')
# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)
I then save the GridSearchCV object using pickle:
# save model
with open('knn_models.pickle', 'wb') as f:
pickle.dump(knn_models, f)
So I can test the classifiers on smaller datasets on my local machine by running:
knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_
Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:
pull the original data out of the GridSearchCV object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required)
try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e. k = 3, 5, 7
retrieve y_pred i.e. the predictions for each validation set for all of the new classifiers that I tested above
GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_ returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_ attribute.
Adapting the example from the documentation to the knn classifier with your own knn_parameters grid (but removing n_jobs, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3 for simplicity, we have:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd
iris = load_iris()
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'weights': ['uniform', 'distance']}]
knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)
clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs entry from the knn_parameters grid and asking instead for n_jobs=-1 in the GridSearchCV object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1 to your final model, you can easily manipulate the best_estimator_ to do so:
clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
weights='uniform')
This actually answers your second question, since you can similarly manipulate the best_estimator_ to change other hyperparameters, too.
So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_ attribute, which you can even import to a pandas dataframe for easier inspection:
cv_results = pd.DataFrame.from_dict(clf.cv_results_)
For example, the cv_results dataframe includes a column rank_test_score which, as its name clearly implies, contains the rank of each parameter combination:
cv_results['rank_test_score']
# result:
0 481
1 481
2 145
3 145
4 1
...
571 1
572 145
573 145
574 433
575 1
Name: rank_test_score, Length: 576, dtype: int32
Here 1 means best, and you can readily see that there are more than one combinations ranked as 1 - so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_ is just the first of these occurrences - here the combination number 4:
cv_results.iloc[4]
# result:
mean_fit_time 0.000669559
std_fit_time 1.55811e-05
mean_score_time 0.00474652
std_score_time 0.000488042
param_algorithm auto
param_leaf_size 5
param_n_neighbors 5
param_weights uniform
params {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score 0.98
split1_test_score 0.98
split2_test_score 0.98
mean_test_score 0.98
std_test_score 0
rank_test_score 1
Name: 4, dtype: object
which you can easily see that has the same parameters with our best_estimator_ above. But now you can inspect all the "best" models, simply by:
cv_results.loc[cv_results['rank_test_score']==1]
which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576 models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.
Hopefully these will be enough to get you started...

How to correctly compute the optimal C and gamma for my SVM?

I am trying to compute the optimal C and Gamma for my SVM. When trying to run my script I get this error:
ValueError: Invalid parameter max_features for estimator SVC. Check the list of available parameters withestimator.get_params().keys().
I went through the docs to understand what n_estimators actually means so that I know what values I should fill in there. But it is not totally clear to me. Could someone tell me what this value should be so that I can run my script in order to find the optimal C and gamma?
my code:
if __name__=='__main__':
fname = "/home/John/labels.csv"
labels = pd.read_csv(fname, header=None).as_matrix()[:, 1]
labels = map(itemgetter(1),
map(os.path.split,
map(os.path.dirname, labels)))
fname = "/home/John/reps.csv"
embeddings = pd.read_csv(fname, header=None).as_matrix()
le = LabelEncoder().fit(labels)
labelsNum = le.transform(labels)
nClasses = len(le.classes_)
svcClassifier = SVC(kernel='rbf', probability=True, C=10, gamma=10)
#classifier = OneVsRestClassifier(svcClassifier).fit(embeddings, labelsNum)
param_grid = {
'n_estimators': [200, 700],
'max_features': ['auto', 'sqrt', 'log2']
}
CV_rfc = GridSearchCV(estimator=svcClassifier, param_grid=param_grid, cv= 5)
CV_rfc.fit(embeddings, labelsNum)
print CV_rfc.best_params_
After trying I manually found out that in my case C=10 and gamma=10 give the best results. I would however like to use this function to find out what the optimal values should be.
My code is insired by this post: How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)
The SVC class has no argument max_features or n_estimators as these are arguments of the RandomForest you used as a base for your code. If you want to optimize the model regarding C and gamma you can try to use:
param_grid = {
'C': [0.1, 0.5, 1.0],
'gamma': [0.1, 0.5, 1.0]
}
Furhtermore, I also recommend you to search for the optimal kernel, which can be rbf, linear or poly in the sklearn framework.
Edit: The values here are just arbitray and meant to illustrate the general approach. You should add many different values here, which depend on your situation. And whose range also depends on your situation.

Categories