K-fold Cross Validation Queries

K-fold Cross Validation Queries - python

I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link -
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)

The main reason your cv_score appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score and auc calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"], ddf_predictions and ddf_predprob) into the calculations. The performCV section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.

Related

Cross validation and logistic regression

I am analyzing a dataset from kaggle and want to apply a logistic regression model to predict something. This is the data: https://www.kaggle.com/code/mohamedadelhosny/stroke-prediction-data-analysis-challenge/data
I split the data into train and test, and want to use cross validation to inssure highest accuracy possible. I did some pre-processing and used the dummy function over catigorical features, got to a certain point in the code, and and I don't know how to proceed. I cant figure out how to use the results of the cross validation, it's not so straight forward.
This is what I got so far:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
X = data_Enco.iloc[:, data_Enco.columns != 'stroke'].values # features
Y = data_Enco.iloc[:, 6] # labels
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# prepare the cross-validation procedure
cv = KFold(n_splits=10, random_state=1, shuffle=True)
logisticModel = LogisticRegression(class_weight='balanced')
# evaluate model
scores = cross_val_score(logisticModel, scaled_X_train, Y_train, scoring='accuracy', cv=cv)
print('average score = ', np.mean(scores))
print('std of scores = ', np.std(scores))
average score = 0.7483538453549359
std of scores = 0.0190400919099899
So far so good.. I got the results of the model for each 10 splits. But now what? how do I build a confusion matrix? how do I calculate the recall, precesion..? I have the right code without performing cross validation, I just dont know how to adapt it.. how do I use the scores of the cross_val_score function ?
logisticModel = LogisticRegression(class_weight='balanced')
logisticModel.fit(scaled_X_train, Y_train) # Train the model
predictions_log = logisticModel.predict(scaled_X_test)
## Scoring the model
logisticModel.score(scaled_X_test,Y_test)
## Confusion Matrix
Y_pred = logisticModel.predict(scaled_X_test)
real_data = Y_test
print('Observe the difference between the real data and the data predicted by the knn classifier:\n')
print('Predictions: ',Y_pred,'\n\n')
print('Real Data:m', real_data,'\n')
cmtx = pd.DataFrame(
confusion_matrix(real_data, Y_pred, labels=[0, 1]),
index = ['real 0: ', 'real 1:'], columns = ['pred 0:', 'pred 1:']
)
print(cmtx)
print('Accuracy score is: ',accuracy_score(real_data, Y_pred))
print('Precision score is: ',precision_score(real_data, Y_pred))
print('Recall Score is: ',recall_score(real_data, Y_pred))
print('F1 Score is: ',f1_score(real_data, Y_pred))

The performance of a model on the training dataset is not a good estimator of the performance on new data because of overfitting.
Cross-validation is used to obtain an estimation of the performance of your model on new data, i.e. without overfitting. And you correctly applied it to compute the mean and variance of the accuracy of your model. This should be a much better approximation of the accuracy on your test dataset than the accuracy on your training dataset. And that is it.
However, cross-validation is usually used to do model selection. Say you have two logistic regression models that use different sets of independent variables. E.g., one is using only age and gender while the other one is using age, gender, and bmi. Or you want to compare logistic regression with an SVM model.
I.e. you have several possible models and you want to decide which one is best. Of course, you cannot just compare the training dataset accuracies of all the models because those are spoiled by overfitting. And if you use the performance on the test dataset for choosing the best model, the test dataset becomes part of the training, you will have leakage, and thus the performance on the test dataset cannot be used anymore for a final, untainted performance measure. That is why cross-validation is used which creates those splits that contain different versions of validation sets.
So the idea is to
apply cross-validation to each of your candidate models,
use the scores of those cross-validations to choose the best model,
retrain that best model on the complete training dataset to get a final version of your best model, and
to finally apply this final version to the test dataset to obtain some untainted evaluation.
But note, that those three steps are for model selection. However, you have only a single model, the logistic regression, so there is nothing to select from. If you fit your model, let's call it m(p) where p denotes the parameters, to e.g. five folds of CV, you get five different fitted versions m(p1), m(p2), ..., m(p5) of the same model.
So if you have only one model, you fit it to the complete training dataset, maybe use CV to have an additional estimate for the performance on new data, but that's it. But you have already done this. There is no "selection of best model", that is only for if you have several models as described above, like e.g. logistic regression and SVM.

is it overfitting or data leakage problem?

I have applied Sklearn DecisionTreeClassifier() on a personalized dataset to perform binary classification (class 0 and class 1).
Initially classes were not balanced I tried to balance them using :
rus = RandomUnderSampler(random_state=42, replacement=True)
data_rus, target_rus = rus.fit_resample(X, y)
So my dataset was balanced with 186404 samples for class 1 and 186404 samples for class 2. The training samples were : 260965 and the testing samples were : 111843
I calculated the accuracy using sklearn.metrics and I got the next result:
clf=tree.DecisionTreeClassifier("entropy",random_state = 0)
clf.fit(data_rus, target_rus)
accuracy_score(y_test,clf.predict(X_test)) # I got 100% for both training and testing
clf.score(X_test, y_test) # I got 100% for both training and testing
So, I got 100% as accuracy for both testing and training phases I am sure that the result is abnormal and I could not understand if it is an overfitting or data leakage despite I had shuffled my data before splitting it. Then I have decided to plot both training and testing accuracy using
sklearn.model_selection.validation_curve
I got the next figure and I could not interpret it :
I tried two other classification algorithms : Logistic Regression and SVM, I have got the next testing accuracy: 99,84 and 99,94%, respectively.
Update
In my original dataset I have 4 categorical columns I mapped then using the next code:
DataFrame['Color'] = pd.Categorical(DataFrame['Color'])
DataFrame['code_Color'] = DataFrame.Color.cat.codes
After using the RandomUnderSampler to under sample my original data so to get a class balance I splitted the data into train and test datasets train_test_split of sklearn
Any idea could be helpful for me please!

High Validation AUC - Low Test AUC

I am working on a dataset called HR Attrition from kaggle (In class competition) it contains 1628 rows and 27 columns.
Most of the features are categorical in nature, I am using Random Forest and validating using Stratified K fold (10 folds) and my validation AUC is pretty high, around 0.98-99. On submitting I cant get an AUC of more than 0.85 which is a huge deviation. I have tried many things like PCA and feature selection but my validation is not trustworthy the submission score doesn`t improve.
train_data = pd.read_csv('train.csv')
# label encoding
lbl = LabelEncoder()
cat_feats = [f for f in train_data.columns if train_data[f].dtype == object]
for f in cat_feats:
train_data[f] = lbl.fit_transform(train_data[f])
test_data[f] = lbl.transform(test_data[f])
train_id = train_data.Id
train_data = train_data.drop(['Behaviour','Id'],axis = 1) # behaviour has
# only 1 value
X = train_data.drop('Attrition',axis = 1)
y = train_data.Attrition
# Standard Scaling
skf = StratifiedKFold(n_splits = 10,random_state=42,shuffle=True)
numeric = ['Age','MonthlyIncome','EmployeeNumber']
# target encoding
categorical = [f for f in X.columns if f not in numeric]
pre_pipe = make_column_transformer((TargetEncoder(),categorical),
(StandardScaler(),numeric))
pipe_rf = make_pipeline(pre_pipe,RandomForestClassifier())
print('RF:',np.mean(cross_val_score(X=X,y=y,cv=skf,estimator=pipe_rf,scoring='accuracy')))
Using target encoding my validation gave me an average of 98% accuracy (the data is balanced so using accuracy the AUC is almost 1) but the submission score is at max 85%. What should I do?

I am just being naive here because generally cross validation score shouldn’t be that far from test score.
I just want to make sure we are talking the same metrics.
The cross-validation scores return accuracy
Maybe the competition is on AUC(Area under curve)
Accuracy can be 98% but AUC can still be only 85%
If you want auc in cross validation predict update the last line with
print('RF:',np.mean(cross_val_score(X=X,y=y,cv=skf,estimator=pipe_rf,scoring= ‘roc_auc')))

Why all the true positives are classified as true negatives in the machine learning model?

I fit a random forest model for the data. I divided my dataset into training and testing in the ratio of 70:30 and trained the model. I got an accuracy of 80% for the test data. Then I took a benchmark dataset and tested the model with that dataset. That dataset only contained data with true labels(1). But when I get the prediction for the benchmark dataset using the model all the true positives are classified as true negatives. Accuracy is 90%. Why is that? Is there a way to interpret this?
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred_benchmark=classifier.predict(XBench_test)
print("Accuracy on test data: {:.4f}".format(classifier.score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(classifier.score(XBench_test, YBench_test))) \*This gives 90%*\

I'll take a shot at providing a better way to interpret your results. In cases where you have an imbalanced data set accuracy is not going to be a good way to measure your performance.
Here is a common example:
Imagine you have a disease that is present in only .01% of people. If you predict no one has the disease you have an accuracy of 99.99% but your model is not a good model.
In this example it appears your benchmark data set (commonly referred to as a test dataset) has imbalanced classes and you are getting an accuracy of 90% when you call the classifier.score method. In this case, accuracy is not a good way to interpret the model. You should instead look at other metrics.
Other common metrics may be to look at precision and recall to determine how your model is performing. In this case since all True positives are predicted as negative your precision AND your recall would be 0, meaning your model is not differentiating very well.
Going further if you have imbalanced classes it may be better to check different thresholds of scores and look at metrics like ROC_AUC. These metrics look at the probability scores outputted by the model (predict_proba for sklearn) and test different thresholds. Perhaps your model works well at a lower threshold and the positive cases consistently score higher than the negative cases.
Here is an additional article about ROC_AUC.
Sci-kit learn has a few different metric scores you can use they are located here.
Here is one way you could implement ROC AUC into your code.
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
#use predict_proba
y_pred=classifier.predict_proba(X_test)
y_pred_benchmark=classifier.predict_proba(XBench_test)
from sklearn.metrics import roc_auc_score
## instead of measuring accuracy use ROC AUC)
print("Accuracy on test data: {:.4f}".format(roc_auc_score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(roc_auc_score(XBench_test, YBench_test))) \*This gives 90%*\

Why are the grid_scores_ higher than the score for full training set? (sklearn, Python, GridSearchCV)

I'm building a logistic regression model as follows:
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'roc_auc')
I looked at the roc_auc score for the best estimator:
grid_search_object.best_score_
Out[195]: 0.94505225726738229
However, when I used the best estimator to score the full training set, I got a worse score:
grid_search_object.best_estimator_.score(X,Y)
Out[196]: 0.89636762322433028
How can this be? What am I doing wrong?
Edit: Nevermind. I'm an idiot. grid_search_object.best_estimator_.score calculates accuracy, not auc_roc. Right?
But if that is the case, how does GridSearchCV compute the grid_scores_? Does it pick the best decision threshold for each parameter, or is the decision threshold always at 0.5? For area under the ROC curve, decision threshold doesn't matter, but it does for say, f1_score.

If you evaluated the best_estimator_ on the full training set it is not surprising that the scores are different from the best_score_, even if the scoring methods are the same:
The best_score_ is the average over your cross-validation fold scores of the best model (best in exactly that sense: scores highest on average over folds).
When scoring on the whole training set, your score may be higher or lower than this. Especially if you have some sort of temporal structure in your data and you are using the wrong data splitting, scores on the full set can be worse.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.