How to improve Precision without downing the Recall in a unbalanced dataset? - python

I have to use a Decision Tree for binary classification on a unbalanced dataset(50000:0, 1000:1). To have a good Recall (0.92) I used RandomOversampling function found in module Imblearn and pruning with max_depth parameter.
The problem is that the Precision is very low (0.44), I have too many false positives.
I tried to train a specific classifier to deal with borderline instances that generate false positives.
First I splitted dataset in train and test sets(80%-20%).
Then I splitted train in train2 and test2 sets (66%,33%).
I used a dtc(#1) to predict test2 and i took only the instances predicted as true.
Then I trained a dtc(#2) on all these datas with the goal of build a classifier able to distinguish borderline cases.
I used the dtc(#3) trained on first oversampled train set to predict official test set and got Recall=0.92 and Precision=0.44.
Finally I used dtc(#2) only on datas predicted as true by dtc(#3) with hope to distinguish TP from FP but it doesn't work too good. I got Rec=0.79 and Prec=0.69.
x_train, X_test, y_train, Y_test =train_test_split(df2.drop('k',axis=1), df2['k'], test_size=test_size, random_state=0.2)
x_res, y_res=ros.fit_resample(x_train,y_train)
df_to_trick=df2.iloc[x_train.index.tolist(),:]
#....split in 0.33-0.66, trained and tested
confusion_matrix(y_test,predicted1) #dtc1
array([[13282, 266],
[ 18, 289]])
#training #dtc2 only on (266+289) datas
confusion_matrix(Y_test,predicted3) #dtc3 on official test set
array([[9950, 294],
[ 20, 232]])
confusion_matrix(true,predicted4)#here i used dtc2 on (294+232) datas
array([[204, 90],
[ 34, 198]])
I have to choose between dtc3 (Recall=0.92, Prec=0.44) or the entire cervellotic process with (Recall=0.79, Prec=0.69).
Do you have any ideas to improve these metrics? My goal is about (0.8/0.9).

Keep in mind that precision and recall are based on the threshold that you choose (i.e. in sklearn the default threshold is 0.5 - any class with a prediction probability > 0.5 is classified as positive) and that there will always be a trade-off between favoring precision over recall. ...
I think in the case you are describing (trying to fine-tune your classifier given your model's performance limitations) you can choose a higher or lower threshold to cut-off which has a more favorable precision-recall tradeoff ...
The below code can help you visualize how your precision and recall change as you move your decision threshold:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.figure(figsize=(8, 8))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
Other suggestions to improve your model's performance is to either use alternative pre-processing methods - SMOTE instead of Random Oversampling or choosing a more complex classifier (a random forrest/ ensemble of trees or a boosting approach ADA Boost or Gradient based boosting)

Related

RandomForestClassifier with n_estimators=1 and min_impurity_decrease=0 yields 100% train accuracy on complex dataset

I have a RandomForestClassifier in sklearn with the following parameters:
clf = RandomForestClassifier(n_estimators=1,
min_impurity_decrease=0,
max_features=3,
bootstrap=False,
random_state=j,
criterion='entropy',
warm_start=False)
Features and Labels:
My feature matrix X (pd.DataFrame) has shape (100,110), i.e. 110 features.
The labels are vectors of length 10, i.e. I am trying to predict 10 targets.
The feature matrix contains a feature x_i for i=1,...,10 from which I construct the labels:
y_i = np.sign(X['x_i'].diff(1).shift(-1))
The labels are in {-1,1} (no label value 0 possible) and they are the one-step-ahead change of the feature x_i for i=1,...,10. The label vector at each time t is then given as the set of all labels at time t:
y[t] = {y_i[t] | i=1,...,10}
While training the classifier, I noticed that the training accuracy is 100% for min_impurity_decrease=0, and the test accuracy is between 50-60%. This happens even in the extreme case of n_estimators=1, which is just a single tree.
When I increase min_impurity_decrease>0, the training accuracy decreases, while the test accuracy remains roughly the same.
It seems to me that there is some kind of leakage going on during the training, because of which the classifier achieves such high training accuracy and overfits.
Strangely, even if I entirely remove the features x_i from the feature matrix, and just use them to construct the labels, the high training accuracy persists. Finally, even if I only predict the label for a single target x_i, i.e. y_i, the training accuracy is still 100%. Also I checked my dataset several times, it should be ok. I am out of explanations.
Why this high accuracy is achieved even with a single tree? It seems highly illogical to me.
Set the max_depth argument to something other than the None, which is default. It expands the tree until you rich 100% accuracy, that's the reason. It obviously overfits. Use max_depth=2, 4, 8, etc.

Why roc_auc produces weird results in sklearn?

I have a binary classification problem where I use the following code to get my weighted avarege precision, weighted avarege recall, weighted avarege f-measure and roc_auc.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1 we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2.
However, when it comes to 'roc_auc' feature setting 2 is better than feature setting 1. I found this weird becuase every other metric was better with feature setting 1.
On one hand, I suspect that this happens since I am using weighted scores for precision, recall and f-measure and not with roc_auc. Is it possible to do weighted roc_auc for binary classification in sklearn?
What is the real problem for this weird roc_auc results?
It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
Probabilistic classifiers (like RF here) produce probability outputs p in [0, 1].
To get hard class predictions (0/1), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5 then class=1, else class=0.
Metrics like accuracy, precision, recall, and f1-score are calculated over the hard class predictions 0/1, i.e after the threshold has been applied.
In contrast, AUC measures the performance of a binary classifier averaged over the range of all possible thresholds, and not for a particular threshold.
So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

Why all the true positives are classified as true negatives in the machine learning model?

I fit a random forest model for the data. I divided my dataset into training and testing in the ratio of 70:30 and trained the model. I got an accuracy of 80% for the test data. Then I took a benchmark dataset and tested the model with that dataset. That dataset only contained data with true labels(1). But when I get the prediction for the benchmark dataset using the model all the true positives are classified as true negatives. Accuracy is 90%. Why is that? Is there a way to interpret this?
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred_benchmark=classifier.predict(XBench_test)
print("Accuracy on test data: {:.4f}".format(classifier.score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(classifier.score(XBench_test, YBench_test))) \*This gives 90%*\
I'll take a shot at providing a better way to interpret your results. In cases where you have an imbalanced data set accuracy is not going to be a good way to measure your performance.
Here is a common example:
Imagine you have a disease that is present in only .01% of people. If you predict no one has the disease you have an accuracy of 99.99% but your model is not a good model.
In this example it appears your benchmark data set (commonly referred to as a test dataset) has imbalanced classes and you are getting an accuracy of 90% when you call the classifier.score method. In this case, accuracy is not a good way to interpret the model. You should instead look at other metrics.
Other common metrics may be to look at precision and recall to determine how your model is performing. In this case since all True positives are predicted as negative your precision AND your recall would be 0, meaning your model is not differentiating very well.
Going further if you have imbalanced classes it may be better to check different thresholds of scores and look at metrics like ROC_AUC. These metrics look at the probability scores outputted by the model (predict_proba for sklearn) and test different thresholds. Perhaps your model works well at a lower threshold and the positive cases consistently score higher than the negative cases.
Here is an additional article about ROC_AUC.
Sci-kit learn has a few different metric scores you can use they are located here.
Here is one way you could implement ROC AUC into your code.
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
#use predict_proba
y_pred=classifier.predict_proba(X_test)
y_pred_benchmark=classifier.predict_proba(XBench_test)
from sklearn.metrics import roc_auc_score
## instead of measuring accuracy use ROC AUC)
print("Accuracy on test data: {:.4f}".format(roc_auc_score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(roc_auc_score(XBench_test, YBench_test))) \*This gives 90%*\

Inverse ROC-AUC value?

I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?
Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).

Using XGboost_Regressor in Python results in very good training performance but poor in prediction

I have been trying to use XGBregressor in python. It is by far one of the best ML techniques I have used.However, in some data sets I have very high training R-squared, but it performs really poor in prediction or testing. I have tried playing with gamma, depth, and subsampling to reduce the complexity of the model or to make sure its not overfitted but still there is a huge difference between training and testing. I was wondering if someone could help me with this:
Below is the code I am using:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=100)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
xgb = xgboost.XGBRegressor(colsample_bytree=0.7,
gamma=0,
learning_rate=0.01,
max_depth=1,
min_child_weight=1.5,
n_estimators=100000,
reg_alpha=0.75,
reg_lambda=0.45,
subsample=0.8,
seed=1000)
Here is the performance in training vs testing:
Training :
MAE: 0.10 R^2: 0.99
Testing:
MAE: 1.47 R^2: -0.89
XGBoost tends to overfit the data , so reduce the n_estimators and n_depth and use that particular iteration where the train loss and val loss does not have much difference between them.
The issue here is overfitting. You need to tune some of the parameters(Source).
set n_estimators to 80-200 if the size of data is high (of the order of lakh), 800-1200 is if it is medium-low
learning_rate: between 0.1 and 0.01
subsample: between 0.8 and 1
colsample_bytree: number of columns used by each tree. Values from 0.3 to 0.8 if you have many feature vectors or columns , or 0.8 to 1 if you only few feature vectors or columns.
gamma: Either 0, 1 or 5
Since max_depth you have already taken very low, so you can try to tune above parameters. Also, if your dataset is very small then the difference in training and test is expected. You need to check whether within training and test data a good split of data is there or not. For example, in test data whether you have almost equal percentage of Yes and No for the output column.
You need to try various option. certainly xgboost and random forest will give overfit model for less data. You can try:-
1.Naive bayes. Its good for less data set but it considers the weigtage of all feature vector same.
Logistic Regression - try to tune the regularisation parameter and see where your recall score max. Other things in this are calsss weight = balanced.
Logistic Regression with Cross Validation - this is good for small data as well. Last thing which I told earlier also, check your data and see its not biased towards one kind of result. Like if the result is yes in 50 cases out of 70, it is highly biased and you may not get high accuracy.

Categories