The predictions from StackingRegressor (Sklearn) are not reproducible - python

I am working on training a regression model using StackingRegressor and I found out that the prediction from this model is not consistent while I am using the same random_state.
Here is my code:
random_seed = 42
mdl_lgbm = lightgbm.LGBMRegressor(colsample_bytree=0.6,
learning_rate=0.05,
max_depth=6,
min_child_samples=227,
min_child_weight=10,
n_estimators=1800,
num_leaves=45,
reg_alpha=0,
reg_lambda=1,
subsample=0.6,
n_jobs=-1,
random_state=random_seed)
mdl_xgb = xgb.XGBRegressor(subsample=0.5,
n_estimators=900,
min_child_weight=8,
max_depth=6,
learning_rate=0.03,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=2,
reg_lambda=50,
objective='reg:squarederror',
random_state=random_seed)
mdl_rf = RandomForestRegressor(bootstrap=True,
max_depth=110,
max_features='auto',
min_samples_leaf=5,
min_samples_split=5,
n_estimators=1430,
n_jobs=-1,
random_state=random_seed)
# Base models
base_mdl_names = {
'XGB': mdl_xgb,
'LGBM': mdl_lgbm,
'RF': mdl_rf,
}
final_estimator = xgb.XGBRegressor(subsample=0.3,
n_estimators=1200,
min_child_weight=2,
max_depth=5,
learning_rate=0.06,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=1,
reg_lambda=0.1,
objective='reg:squarederror',
random_state=random_seed)
base_estimators = list()
for name, mdl in base_mdl_names.items():
base_estimators.append((name, mdl))
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=5,
passthrough=True)
stacked_mdl.fit(X_train, y_train)
Please note that I am not changing X_train. When I use the trained model for the prediction, the results are not reproducable. I mean that if I re-train the model, the results will be different while every input is the same. Any clue why this happening?

StackingRegressor use cv (cross validation). Therefore you also have to set its random_state in order to have exactly the same cross validation split at each run.
You should do as follow:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, random_state=random_seed, shuffle=True)
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=kfold,
passthrough=True)

Related

KNN algorithm with GridSearchCV

Im trying to create a KNN model with GridSearchCV but am getting an error pertaining to param_grid: "ValueError: Invalid parameter classifier_leaf_size for estimator KNeighborsClassifier(). Check the list of available parameters with estimator.get_params().keys().".
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
scores = cross_val_score(knn, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
print('Training Accuracy - KNN Classification: ', knn.score(x_train, y_train))
print('Testing Accuracy - KNN Classification', knn.score(x_test, y_test))
plt.show()
#classification report
cr = classification_report(y_test, y_pred)
print(cr, "\n")
#grid
estimator_KNN = KNeighborsClassifier(algorithm='auto')
knn_grid_set_up = {'n_neigbors': (1,10,1),
'classifier_leaf_size': (20,40,1), 'p': (1,2),
'classifier_weights': ('uniform', 'distance')
}
grid_search_KNN = GridSearchCV(
estimator=estimator_KNN,
param_grid=knn_grid_set_up,
scoring = 'accuracy',
n_jobs = -1,
cv = 5
)
knn_grid.fit(x_train, y_train)
What causes the error? I've read documentation, have tried various methods, but still can't understand what is going on

show overfitting with sklearn & random forest

I followed this tutorial to create a simple image classification script:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
train_data = scipy.io.loadmat('extra_32x32.mat')
# extract the images and labels from the dictionary object
X = train_data['X']
y = train_data['y']
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
....
clf = RandomForestClassifier()
print(clf)
start_time = time.time()
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
It gave me an accuracy of approximately 0.7.
Is there someway to visualize or show where/when/if the model is overfitting? I believe this can be shown by training the model until we see that the accuracy of training is increasing and the validation data is decreasing. But how can I do so in the code?
There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate]. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.
Another option is to use a library like Optuna, which will test various hyperparameters for you and you could use the methods mentioned above.

Results from GridSearchCV/RandomizedSearchCV cannot be reproduced by running a single model using the same parameters

I am running RandomizedSearchCV with 5-folds in order to find best parameters. I have a hold-out set (X_test) that I use to predict. My portion of code is:
svc= SVC(class_weight=class_weights, random_state=42)
Cs = [0.01, 0.1, 1, 10, 100, 1000, 10000]
gammas = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
param_grid = {'C': Cs,
'gamma': gammas,
'kernel': ['linear', 'rbf', 'poly']}
my_cv = TimeSeriesSplit(n_splits=5).split(X_train)
rs_svm = RandomizedSearchCV(SVC(), param_grid, cv = my_cv, scoring='accuracy',
refit='accuracy', verbose = 3, n_jobs=1, random_state=42)
rs_svm.fit(X_train, y_train)
y_pred = rs_svm.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
clfreport = classification_report(y_test, y_pred)
print (rs_svm.best_params_)
The result is classification report:
Now, I am interested in reproducing this result using a run-alone model (no randomizedsearchCV) with the selected parameters:
from sklearn.model_selection import TimeSeriesSplit
tcsv=TimeSeriesSplit(n_splits=5)
for train_index, test_index in tcsv.split(X_train):
train_index_ = int(train_index.shape[0])
test_index_ = int(test_index.shape[0])
X_train_, y_train_ = X_train[0:train_index_],y_train[0:train_index_]
X_test_, y_test_ = X_train[test_index_:],y_train[test_index_:]
class_weights = compute_class_weight('balanced', np.unique(y_train_), y_train_)
class_weights = dict(enumerate(class_weights))
svc= SVC(C=0.01, gamma=0.1, kernel='linear', class_weight=class_weights, verbose=True,
random_state=42)
svc.fit(X_train_, y_train_)
y_pred_=svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred_)
clfreport = classification_report(y_test, y_pred_)
In my understanding, the clfreports should be identical but my result after this run are:
Does anyone have any suggestions why that might be happening?
Given your 1st code snippet, where you use RandomizedSearchCV to find the best hyperparameters, you don't need to do any splitting again; so, in your 2nd snippet, you should just fit using the found hyperparameters and the class weights using the whole of your training set, and then predict on your test set:
class_weights = compute_class_weight('balanced', np.unique(y_train), y_train)
class_weights = dict(enumerate(class_weights))
svc= SVC(C=0.01, gamma=0.1, kernel='linear', class_weight=class_weights, verbose=True, random_state=42)
svc.fit(X_train, y_train)
y_pred_=svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
clfreport = classification_report(y_test, y_pred)
The discussion in Order between using validation, training and test sets might be useful for clarifying the procedure...

How to get SHAP values of the model averaged by folds?

This is way I can values from single fold trained model
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='auc', verbose=100, early_stopping_rounds=200)
import shap # package used to calculate Shap values
# Create object that can calculate shap values
explainer = shap.TreeExplainer(clf)
# Calculate Shap values
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
As you know result from different fold might be different - how to average this shap_values?
Because we have such rule:
It is fine to average the SHAP values from models with the same output
trained on the same input features, just make sure to also average the
expected_value from each explainer. However, if you have
non-overlapping test sets then you can't average the SHAP values from
the test sets since they are for different samples. You could just
explain the SHAP values for the whole dataset using each of your
models and then average that into a single matrix. (It's fine to
explain examples in your training set, just remember you may be
overfit to them)
So we need here some holdout dataset to follow that rule. I did something like this to get erything to work as expected:
shap_values = None
from sklearn.model_selection import cross_val_score, StratifiedKFold
(X_train, X_test, y_train, y_test) = train_test_split(df[feat], df['target'].values,
test_size=0.2, shuffle = True,stratify =df['target'].values,
random_state=42)
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
folds_idx = [(train_idx, val_idx)
for train_idx, val_idx in folds.split(X_train, y=y_train)]
auc_scores = []
oof_preds = np.zeros(df[feat].shape[0])
test_preds = []
for n_fold, (train_idx, valid_idx) in enumerate(folds_idx):
train_x, train_y = df[feat].iloc[train_idx], df['target'].iloc[train_idx]
valid_x, valid_y = df[feat].iloc[valid_idx], df['target'].iloc[valid_idx]
clf = lgb.LGBMClassifier(nthread=4, boosting_type= 'gbdt', is_unbalance= True,random_state = 42,
learning_rate= 0.05, max_depth= 3,
reg_lambda=0.1 , reg_alpha= 0.01,min_child_samples= 21,subsample_for_bin= 5000,
metric= 'auc', n_estimators= 5000 )
clf.fit(train_x, train_y,
eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric='auc', verbose=False, early_stopping_rounds=100)
explainer = shap.TreeExplainer(clf)
if shap_values is None:
shap_values = explainer.shap_values(X_test)
else:
shap_values += explainer.shap_values(X_test)
oof_preds[valid_idx] = clf.predict_proba(valid_x)[:, 1]
auc_scores.append(roc_auc_score(valid_y, oof_preds[valid_idx]))
print( 'AUC: ', np.mean(auc_scores))
shap_values /= 10 # number of folds
shap.summary_plot(shap_values, X_test)

LightGBM Error - length not same as data

I am using lightGBM for finding feature importance but I am getting error LightGBMError: b'len of label is not same with #data' .
X.shape
(73147, 12)
y.shape
(73147,)
Code:
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(X, y_train, early_stopping_rounds=100, eval_set = [(X_test, y_test)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
See screenshot below:
You seem to have a typo in your code; instead of
model.fit(X, y_train, [...])
it should be
model.fit(X_train, y_train, [...])
As it is now, it is understandable that the length of X and y_train is not the same, hence your error.

Categories