High Validation AUC - Low Test AUC - python

I am working on a dataset called HR Attrition from kaggle (In class competition) it contains 1628 rows and 27 columns.
Most of the features are categorical in nature, I am using Random Forest and validating using Stratified K fold (10 folds) and my validation AUC is pretty high, around 0.98-99. On submitting I cant get an AUC of more than 0.85 which is a huge deviation. I have tried many things like PCA and feature selection but my validation is not trustworthy the submission score doesn`t improve.
train_data = pd.read_csv('train.csv')
# label encoding
lbl = LabelEncoder()
cat_feats = [f for f in train_data.columns if train_data[f].dtype == object]
for f in cat_feats:
train_data[f] = lbl.fit_transform(train_data[f])
test_data[f] = lbl.transform(test_data[f])
train_id = train_data.Id
train_data = train_data.drop(['Behaviour','Id'],axis = 1) # behaviour has
# only 1 value
X = train_data.drop('Attrition',axis = 1)
y = train_data.Attrition
# Standard Scaling
skf = StratifiedKFold(n_splits = 10,random_state=42,shuffle=True)
numeric = ['Age','MonthlyIncome','EmployeeNumber']
# target encoding
categorical = [f for f in X.columns if f not in numeric]
pre_pipe = make_column_transformer((TargetEncoder(),categorical),
(StandardScaler(),numeric))
pipe_rf = make_pipeline(pre_pipe,RandomForestClassifier())
print('RF:',np.mean(cross_val_score(X=X,y=y,cv=skf,estimator=pipe_rf,scoring='accuracy')))
Using target encoding my validation gave me an average of 98% accuracy (the data is balanced so using accuracy the AUC is almost 1) but the submission score is at max 85%. What should I do?

I am just being naive here because generally cross validation score shouldn’t be that far from test score.
I just want to make sure we are talking the same metrics.
The cross-validation scores return accuracy
Maybe the competition is on AUC(Area under curve)
Accuracy can be 98% but AUC can still be only 85%
If you want auc in cross validation predict update the last line with
print('RF:',np.mean(cross_val_score(X=X,y=y,cv=skf,estimator=pipe_rf,scoring= ‘roc_auc')))

Related

All probability values are less than 0.5 on unseen data

I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1.
Here is the Histogram of predicted probabilities on test data:
with majority values at 0 - 0.2 and 0.9 to 1, which is much accurate.
But when I try to predict the probability values for unseen data or let's say all data points for which value of 0 or 1 is unknown, the predicted probabilities values are between 0 to 0.5 only for class 1. Why is that so? Aren't the values should be from 0.5 to 1?
Here is the histogram of predicted probabilities on unseen data:
I am using sklearn RandomforestClassifier in python. The code is below:
#Read the CSV
df=pd.read_csv('path/df_all.csv')
#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})
#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']
# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=100387, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])
y = df1['probabilities']
X = df1.iloc[:,1:138]
#Change interfere values to category
y_01=y.astype('category')
#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)
#Model
model=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)
#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#X1_train is same dataframe with but with only 15 varible
rf.fit(X1_train,y_train)
#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))
#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))
#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487
#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities.
IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]
The results shown for your test data are not valid; you perform a mistaken procedure that has two serious consequences, which invalidate them.
The mistake here is that you perform the minority class upsampling before splitting to train & test sets, which should not be the case; you should first split into training and test sets, and then perform the upsampling only to the training data and not to the test ones.
The first reason why such a procedure is invalid is that, this way, some of the duplicates due to upsampling will end up both to the training and the test splits; the result being that the algorithm is tested with some samples that have already been seen during training, which invalidates the very fundamental requirement of a test set. For more details, see own answer in Process for oversampling data for imbalanced binary classification; quoting from there:
I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...
The second reason is that this procedure shows biased performance measures in a test set that is no longer representative of reality: remember, we want our test set to be representative of the real unseen data, which of course will be imbalanced; artificially balancing our test set and claiming that it has X% accuracy when a great part of this accuracy will be due to the artificially upsampled minority class makes no sense, and gives misleading impressions. For details, see own answer in Balance classes in cross validation (the rationale is identical for the case of train-test split, as here).
The second reason is why your procedure would still be wrong even if you had not performed the first mistake, and you had proceeded to upsample the training and test sets separately after splitting.
I short, you should remedy the procedure, so that you first split into training & test sets, and then upsample your training set only.

Do I need to create a new classifier for each fold in K-Fold Cross Validation?

I am trying to train a classifier to detect imperatives.
There are 2000 imperatives and 2000 non-imperatives in my data.
I used 10% of 4000 (400) to be my Test set, and the rest of 3600 sentences as Training set for the classifiers.
I tried to apply the concept of K-Fold Cross Validation.
Part of my code is below:
featuresets = [(document_features(d, word_features), c) for (d, c) in train]
train_set, test_set = featuresets[360:], featuresets[:360]
#first 360 (first 10% of the data)sentences be the first test_set
classifier = nltk.NaiveBayesClassifier.train(train_set)
a=nltk.classify.accuracy(classifier, test_set)
train_set2, test_set2= featuresets[:360]+featuresets[720:],
featuresets[360:720] #second 10% of the sentences to be the second test_set
classifier2 = classifier.train(train_set2)
b=nltk.classify.accuracy(classifier2, test_set2)
train_set3, test_set3 = featuresets[:720]+featuresets[1080:],
featuresets[720:1080]
#Third 10% of the data be the third test_set
classifier3 = classifier2.train(train_set3)
c=nltk.classify.accuracy(classifier3, test_set3)
train_set4, test_set4 = featuresets[:1080]+featuresets[1440:],
featuresets[1080:1440]
#Fourth 10% of the data be the Fourth test_set
classifier4 = classifier3.train(train_set4)
d=nltk.classify.accuracy(classifier4, test_set4)
I repeated the same training act for 10 times (I only showed 4 times in my code) because 10 different parts of data need to be validation data at least once for K-folds cross validation.
The question I have here is I don't know if each time I should create a new classifier
(classifier = nltk.NaiveBayesClassifier.train(train_set)), train it and calculate the average accuracy score from each of the individual classifiers to be accuracy score. Or I should just train the previously trained classifier with the new data (just like what I do now) so the last classifier will be the one trained 10 times?

K-fold Cross Validation Queries

I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link -
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)
The main reason your cv_score appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score and auc calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"], ddf_predictions and ddf_predprob) into the calculations. The performCV section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.

Any difference between H2O and Scikit-Learn metrics scoring?

I tried to use H2O to create some machine learning models for binary classification problem, and the test results are pretty good. But then I checked and found something weird. I tried to print the prediction of the model for the test set out of curiosity. And I found out that my model actually predicts 0 (negative) all the time, but the AUC is around 0.65, and precision is not 0.0. Then I tried to use Scikit-learn just to compare the metrics scores, and (as expected) they’re different. The Scikit learn yielded 0.0 precision and 0.5 AUC score, which I think is correct. Here's the code that I used:
model = h2o.load_model(model_path)
predictions = model.predict(Test_data).as_data_frame()
# H2O version to print the AUC score
auc = model.model_performance(Test_data).auc()
# Python version to print the AUC score
auc_sklearn = sklearn.metrics.roc_auc_score(y_true, predictions['predict'].tolist())
Any thought? Thanks in advance!
There is no difference between H2O and scikit-learn scoring, you just need to understand how to make sense of the output so you can compare them accurately.
If you'll look at the data in predictions['predict'] you'll see that it's a predicted class, not a raw predicted value. AUC uses the latter, so you'll need to use the correct column. See below:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# Generate predictions on a test set
pred = model.predict(test)
Examine the output:
In [4]: pred.head()
Out[4]:
predict p0 p1
--------- -------- --------
0 0.715077 0.284923
0 0.778536 0.221464
0 0.580118 0.419882
1 0.316875 0.683125
0 0.71118 0.28882
1 0.342766 0.657234
1 0.297636 0.702364
0 0.594192 0.405808
1 0.513834 0.486166
0 0.70859 0.29141
[10 rows x 3 columns]
Now compare to sklearn:
from sklearn.metrics import roc_auc_score
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
# 0.78170751032654806
Here you see that they are approximately the same. AUC is an approximate method, so you'll see differences after a few decimal places when you compare different implementations.

Why are the grid_scores_ higher than the score for full training set? (sklearn, Python, GridSearchCV)

I'm building a logistic regression model as follows:
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'roc_auc')
I looked at the roc_auc score for the best estimator:
grid_search_object.best_score_
Out[195]: 0.94505225726738229
However, when I used the best estimator to score the full training set, I got a worse score:
grid_search_object.best_estimator_.score(X,Y)
Out[196]: 0.89636762322433028
How can this be? What am I doing wrong?
Edit: Nevermind. I'm an idiot. grid_search_object.best_estimator_.score calculates accuracy, not auc_roc. Right?
But if that is the case, how does GridSearchCV compute the grid_scores_? Does it pick the best decision threshold for each parameter, or is the decision threshold always at 0.5? For area under the ROC curve, decision threshold doesn't matter, but it does for say, f1_score.
If you evaluated the best_estimator_ on the full training set it is not surprising that the scores are different from the best_score_, even if the scoring methods are the same:
The best_score_ is the average over your cross-validation fold scores of the best model (best in exactly that sense: scores highest on average over folds).
When scoring on the whole training set, your score may be higher or lower than this. Especially if you have some sort of temporal structure in your data and you are using the wrong data splitting, scores on the full set can be worse.

Categories