I tried to use H2O to create some machine learning models for binary classification problem, and the test results are pretty good. But then I checked and found something weird. I tried to print the prediction of the model for the test set out of curiosity. And I found out that my model actually predicts 0 (negative) all the time, but the AUC is around 0.65, and precision is not 0.0. Then I tried to use Scikit-learn just to compare the metrics scores, and (as expected) they’re different. The Scikit learn yielded 0.0 precision and 0.5 AUC score, which I think is correct. Here's the code that I used:
model = h2o.load_model(model_path)
predictions = model.predict(Test_data).as_data_frame()
# H2O version to print the AUC score
auc = model.model_performance(Test_data).auc()
# Python version to print the AUC score
auc_sklearn = sklearn.metrics.roc_auc_score(y_true, predictions['predict'].tolist())
Any thought? Thanks in advance!
There is no difference between H2O and scikit-learn scoring, you just need to understand how to make sense of the output so you can compare them accurately.
If you'll look at the data in predictions['predict'] you'll see that it's a predicted class, not a raw predicted value. AUC uses the latter, so you'll need to use the correct column. See below:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# Generate predictions on a test set
pred = model.predict(test)
Examine the output:
In [4]: pred.head()
Out[4]:
predict p0 p1
--------- -------- --------
0 0.715077 0.284923
0 0.778536 0.221464
0 0.580118 0.419882
1 0.316875 0.683125
0 0.71118 0.28882
1 0.342766 0.657234
1 0.297636 0.702364
0 0.594192 0.405808
1 0.513834 0.486166
0 0.70859 0.29141
[10 rows x 3 columns]
Now compare to sklearn:
from sklearn.metrics import roc_auc_score
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
# 0.78170751032654806
Here you see that they are approximately the same. AUC is an approximate method, so you'll see differences after a few decimal places when you compare different implementations.
Related
I am training with machine learning classification prediction algorithms.
I am testing different methods between logistic regression, Knn or predictions based on KMeans centroids to assign caterory.
Everything worked perfectly except Kmeans inverted the labels 0 and 1. The results are still correct, just that the categories no longer correspond.
The confusion matrix is therefore reversed between True and False and also my accuracy score instead of 99%, it is now at 1%
Cluster 0 has to be the one related to the False and the cluster 1 for True. In addition, statistically the True outnumber the False in this dataset but maybe not in another one.
Is there any solution to fix the labels before or reassign the Kmeans cluster labels?
I don't have have this issue with Knn or logistic regression whose categories correspond well to 0 and 1.
Here is my code for a dataframe 1500 rows, 6 columns in order to predict the category between 0 and 1, either between True or False:
# Kmeans model initialization
km = KMeans(n_clusters=2)
km.fit(X_train_std)
# centroids definition
centroid = km.cluster_centers_
c_km = pd.DataFrame(centroid, columns=X_name)
# prediction pour 2 clusters
y_pred_km = km.predict(X_test_std)
# model training
pred['pred_km'] = y_pred_km
pred['is_genuine_km'] = pred['pred_km'].apply(lambda x: True if x >0 else False)
# plot the confusion matrix & accuracy score
fig, ax = plt.subplots(1,1)
cm_km = metrics.confusion_matrix(y_test, y_pred_km)
cm_display_km = metrics.ConfusionMatrixDisplay(cm_km, display_labels=['False', 'True'])
cm_display_km.plot(ax=ax)
ax.set_title('K-Means Confusion Matrix \n Accuracy = %0.3f' % metrics.accuracy_score(y_test, y_pred_km))
plt.show()
I assume you use scikit-learn. In this case, you can pass km = KMeans(n_clusters=2, random_state=42) to the function to seed the random number generator, so it delivers the same clustering in each run.
See KMeans documentation for the random_state parameter:
Use an int to make the randomness deterministic.
I am running a Convolutional Neural Network. After it finishes running, I use some metrics to evaluate the performance of the model. 2 of the metrics are the auc and roc_auc_score from sklearn
AUC function: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html?highlight=auc#sklearn.metrics.auc
AUROC function: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score
The code I am using is the following:
print(pred)
fpr, tpr, thresholds = metrics.roc_curve(true_classes, pred, pos_label=1)
print("-----AUC-----")
print(metrics.auc(fpr, tpr))
print("----ROC AUC-----")
print(metrics.roc_auc_score(true_classes, pred))
Where true_classes is a table which is of the form : [0 1 0 1 1 0] where 1 is the positive label and 0 the negative.
And pred is the predictions of the model:
prediction = classifier.predict(test_final)
prediction1 = []
predictions = []
for preds in prediction:
prediction1.append(preds[0])
pred = prediction1
However I am getting the same AUC and ROC AUC value no matter how many times I run the test (What I mean by that is that AUC and ROC AUC values in each test are the same. Not that they remain the same on all the tests. For example for test 1 I get AUC = 0.987 and ROC_AUC = 0.987 and for test 2 I get AUC = 0.95 and ROC_AUC = 0.95) . Am I doing something wrong? Or is it normal?
As per documentation linked, metrics.auc is a general case method to calculate area under a curve from points of that curve.
metrics.roc_auc_score is a specific case method used to calculate Area Under Curve for ROC curve.
You would not expect to see different results if you're using the same data to calculate both, as metrics.roc_auc_score will do the same thing as metrics.auc and, most likely, use the metrics.auc method itself, under the hood (i.e. use the general method for the specific task of calculating Area under ROC curve).
I have variable, and I need to predict its value as close as possible, but not greater than it. For example, given y_true = 9000, I want y_pred to be any value within range [0,9000] as close to 9000 as possible. And if y_true = 8000 respectively y_pred should be [0,8000]. That is, I want to make some kind of restriction on the predicted value. That threshold is individual for each pair of prediction and target variable from the sample. if y_true = [8750,9200,8900,7600] that y_pred should be [<=8750,<=9200,<=8900,<=7600]. The only task is to predict exactly no more and get closer. everywhere zero is considered the correct answer, but I just need to get as close as possible
data, target = np.array(data),np.array(df_tar)
X_train,X_test,y_train,y_test=train_test_split(data,target)
gbr = GradientBoostingRegressor(max_depth=1,n_estimators=100)
%time gbr.fit(X_train,np.ravel(y_train))
print(gbr.score(X_test,y_test),gbr.score(X_train,y_train))
Due to the complexity of actually changing and coming up with a model that can take this approach you desire into sklearn's function and apply it, I strongly suggest you pass this filter after the prediction, and replace all predicted values over 9000 to 9000. And afterwards, manually compute the score, which I believe is mse in this scenario.
Here is a full workinge example of my approach:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error as mse
import numpy as np
X = [[8500,9500],[9200,8700],[8500,8250],[5850,8800]]
y = [8750,9200,8900,7600]
data, target = np.array(X),np.array(y)
gbr = GradientBoostingRegressor(max_depth=1,n_estimators=100)
gbr.fit(data,np.ravel(target))
predictions = gbr.predict(data)
print(predictions) ## The original predicitions
Output:
[8750.14958301 9199.23464805 8899.87846735 7600.73730159]
Perform the replacement:
fixed_predictions = np.array([z if y>z else y for y,z in zip(target,predictions)])
print(fixed_predictions)
[8750. 9199.23464805 8899.87846735 7600. ]
Compute the new score:
score = mse(target,predictions)
print(score)
Output:
10000.145189724533
I am working on a dataset called HR Attrition from kaggle (In class competition) it contains 1628 rows and 27 columns.
Most of the features are categorical in nature, I am using Random Forest and validating using Stratified K fold (10 folds) and my validation AUC is pretty high, around 0.98-99. On submitting I cant get an AUC of more than 0.85 which is a huge deviation. I have tried many things like PCA and feature selection but my validation is not trustworthy the submission score doesn`t improve.
train_data = pd.read_csv('train.csv')
# label encoding
lbl = LabelEncoder()
cat_feats = [f for f in train_data.columns if train_data[f].dtype == object]
for f in cat_feats:
train_data[f] = lbl.fit_transform(train_data[f])
test_data[f] = lbl.transform(test_data[f])
train_id = train_data.Id
train_data = train_data.drop(['Behaviour','Id'],axis = 1) # behaviour has
# only 1 value
X = train_data.drop('Attrition',axis = 1)
y = train_data.Attrition
# Standard Scaling
skf = StratifiedKFold(n_splits = 10,random_state=42,shuffle=True)
numeric = ['Age','MonthlyIncome','EmployeeNumber']
# target encoding
categorical = [f for f in X.columns if f not in numeric]
pre_pipe = make_column_transformer((TargetEncoder(),categorical),
(StandardScaler(),numeric))
pipe_rf = make_pipeline(pre_pipe,RandomForestClassifier())
print('RF:',np.mean(cross_val_score(X=X,y=y,cv=skf,estimator=pipe_rf,scoring='accuracy')))
Using target encoding my validation gave me an average of 98% accuracy (the data is balanced so using accuracy the AUC is almost 1) but the submission score is at max 85%. What should I do?
I am just being naive here because generally cross validation score shouldn’t be that far from test score.
I just want to make sure we are talking the same metrics.
The cross-validation scores return accuracy
Maybe the competition is on AUC(Area under curve)
Accuracy can be 98% but AUC can still be only 85%
If you want auc in cross validation predict update the last line with
print('RF:',np.mean(cross_val_score(X=X,y=y,cv=skf,estimator=pipe_rf,scoring= ‘roc_auc')))
I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link -
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)
The main reason your cv_score appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score and auc calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"], ddf_predictions and ddf_predprob) into the calculations. The performCV section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.