show overfitting with sklearn & random forest

show overfitting with sklearn & random forest - python

I followed this tutorial to create a simple image classification script:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
train_data = scipy.io.loadmat('extra_32x32.mat')
# extract the images and labels from the dictionary object
X = train_data['X']
y = train_data['y']
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
....
clf = RandomForestClassifier()
print(clf)
start_time = time.time()
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
It gave me an accuracy of approximately 0.7.
Is there someway to visualize or show where/when/if the model is overfitting? I believe this can be shown by training the model until we see that the accuracy of training is increasing and the validation data is decreasing. But how can I do so in the code?

There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate]. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.

Another option is to use a library like Optuna, which will test various hyperparameters for you and you could use the methods mentioned above.

Related

learning curve of multiclass classification task

I'm trying to do a multiclass classification using multiple machine learning using this function that I have created:
def model_roc(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
pipeline1 = imbpipeline(steps = [['pca' , PCA()],
['smote', SMOTE('not majority',random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,max_iter=1000)]])
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
param_grid1 = {'classifier__penalty': ['l1', 'l2'],'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search1 = GridSearchCV(estimator=pipeline1,
param_grid=param_grid1,
scoring=make_scorer(roc_auc_score, average='weighted',multi_class='ovo', needs_proba=True),
cv=stratified_kfold,
n_jobs=-1)
print("#"*100)
print(pipeline1['classifier'])
grid_search1.fit(X_train, y_train)
cv_score1 = grid_search1.best_score_
test_score1 = grid_search1.score(X_test, y_test)
print('cv_score',cv_score1, 'test_score',test_score1)
return
I have 2 questions:
Can I get multiple metrics from the same function(ROC/Accuracy/precision and F1_score) as it is imbalanced data with multiple classes?
I need to plot the learning curve, but I don't know how to do this out of my function.

to plot a learning curve of roc_auc_score
def plot_learning_curves(model,X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
train_scores ,test_scores = [],[]
for m in range(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict =model.predict(X_train[:m])
y_test_predict = model.predict(X_test)
train_scores.append(roc_auc_score(y_train,y_train_predict))
test_scores.append(roc_auc_score(y_test,y_test_predict))
plt.plot(train_scores,"r-+",linewidth=2,label="train")
plt.plot(test_scores,"b-",linewidth=2,label="test")
we iterate m from 1 to the whole length of the training set, slicing m records from the training data and training the model on this subset.
The model's roc_auc_score will be calculated and added to a list, showing the progress of it's roc_auc_score as the model trains on more and more instances.
-modified from ageron-handsonml2

The predictions from StackingRegressor (Sklearn) are not reproducible

I am working on training a regression model using StackingRegressor and I found out that the prediction from this model is not consistent while I am using the same random_state.
Here is my code:
random_seed = 42
mdl_lgbm = lightgbm.LGBMRegressor(colsample_bytree=0.6,
learning_rate=0.05,
max_depth=6,
min_child_samples=227,
min_child_weight=10,
n_estimators=1800,
num_leaves=45,
reg_alpha=0,
reg_lambda=1,
subsample=0.6,
n_jobs=-1,
random_state=random_seed)
mdl_xgb = xgb.XGBRegressor(subsample=0.5,
n_estimators=900,
min_child_weight=8,
max_depth=6,
learning_rate=0.03,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=2,
reg_lambda=50,
objective='reg:squarederror',
random_state=random_seed)
mdl_rf = RandomForestRegressor(bootstrap=True,
max_depth=110,
max_features='auto',
min_samples_leaf=5,
min_samples_split=5,
n_estimators=1430,
n_jobs=-1,
random_state=random_seed)
# Base models
base_mdl_names = {
'XGB': mdl_xgb,
'LGBM': mdl_lgbm,
'RF': mdl_rf,
}
final_estimator = xgb.XGBRegressor(subsample=0.3,
n_estimators=1200,
min_child_weight=2,
max_depth=5,
learning_rate=0.06,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=1,
reg_lambda=0.1,
objective='reg:squarederror',
random_state=random_seed)
base_estimators = list()
for name, mdl in base_mdl_names.items():
base_estimators.append((name, mdl))
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=5,
passthrough=True)
stacked_mdl.fit(X_train, y_train)
Please note that I am not changing X_train. When I use the trained model for the prediction, the results are not reproducable. I mean that if I re-train the model, the results will be different while every input is the same. Any clue why this happening?

StackingRegressor use cv (cross validation). Therefore you also have to set its random_state in order to have exactly the same cross validation split at each run.
You should do as follow:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, random_state=random_seed, shuffle=True)
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=kfold,
passthrough=True)

Can i know train and test accuracy individually if am using (cross_val_score)

I used a random-forest classifier to classify my dataset; I want to use cross-validation; my problem is I could not find a way to know the accuracy of train and test splits, so is this possible?
here is the code I used, which tell me about the
X_test, X_rem, y_test, y_rem = train_test_split(X,Y, test_size=0.1)
X_valid, X_train, y_valid, y_train = train_test_split(X_rem,y_rem, train_size=0.80)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = RandomForestClassifier(n_estimators =400)
scoresranP = cross_val_score(model, X_rem, y_rem, cv=cv, n_jobs=-1)
print('Accuracy of Random Forest: %.3f (%.3f)' % (mean(scoresranP), std(scoresranP)))
If I am right, in this case, the "scoresranP" will give me the training accuracy, so can I get the test accuracy using the test split?
I am wrong. Can anyone tell me if I can using cross_val_score with three splits (train, valid, test)?
I really appreciate any help you can provide.

If I am correct Random Forest does not gives you a training accuracy as it is a regression learning method.

Getting error while running in jupyter notebook

ERROR
Invalid parameter C for estimator
DecisionTreeClassifier(class_weight=None, criterion='gini',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with estimator.get_params().keys().
CODE
def train(X_train,y_train,X_test):
# Scaling features
X_train=preprocessing.scale(X_train)
X_test=preprocessing.scale(X_test)
Cs = 10.0 ** np.arange(-2,3,.5)
gammas = 10.0 ** np.arange(-2,3,.5)
param = [{'gamma': gammas, 'C': Cs}]
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X_train, y_train)
cvk = skf
classifier = DecisionTreeClassifier()
clf = GridSearchCV(classifier,param_grid=param,cv=cvk)
clf.fit(X_train,y_train)
print("The best classifier is: ",clf.best_estimator_)
clf.best_estimator_.fit(X_train,y_train)
# Estimate score
scores = model_selection.cross_val_score(clf.best_estimator_, X_train,y_train, cv=5)
print (scores)
print('Estimated score: %0.5f (+/- %0.5f)' % (scores.mean(), scores.std() / 2))
title = 'Learning Curves (SVM, rbf kernel, $\gamma=%.6f$)' %clf.best_estimator_.gamma
plot_learning_curve(clf.best_estimator_, title, X_train, y_train, cv=5)
plt.show()
# Predict class
y_pred = clf.best_estimator_.predict(X_test)
return y_test,y_pred

It looks like you are making the param an array with a single dictionary inside. param needs to be just a dictionary:
EDIT:
Looking into this further, as mentioned by #DzDev, passing an array containing a single dictionary is also a valid way to pass in parameters.
It appears that your issue is that you are mixing the concepts of two different types of estimators. You are passing in the parameters for svm.SVC but are sending in a DecisionTreeClassifier estimator. So it turns out the error is exactly as it says, 'C' is not a valid parameter. You should update to either using a svm.SVC estimator or updates your parameters to be correct for the DecisionTreeClassifier.

Make a graph of accuracy by epoch with mlpclassifier in scikit-learn

I would like to test the accuracy by epoch in scikit-learn. However, so far, i have been unsuccessful.
This is my code part of classifying with mlpclassifier:
NUM_EPOCHS = 1000
LOG_FOR_EVERY = 10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = MLPClassifier(hidden_layer_sizes=(18, 175, 256), batch_size=528,
learning_rate_init=0.0001, beta_1=0.001,
beta_2=0.001, max_iter=1, warm_start=True)
for i in range(NUM_EPOCHS):
clf.fit(X_train, y_train.ravel())
I have also made a graph with this result but i need to make it continuous and further increase accuracy.
Why is the accuracy not increasing?

Please try with :
clf.fit(X_train, y_train.ravel(), epochs=NUM_EPOCHS)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

show overfitting with sklearn & random forest - python

Another option is to use a library like Optuna, which will test various hyperparameters for you and you could use the methods mentioned above.

Related

learning curve of multiclass classification task

The predictions from StackingRegressor (Sklearn) are not reproducible

Can i know train and test accuracy individually if am using (cross_val_score)

Getting error while running in jupyter notebook

Make a graph of accuracy by epoch with mlpclassifier in scikit-learn

Categories

Resources