I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.
When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.
There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.
Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.
I fit a random forest model for the data. I divided my dataset into training and testing in the ratio of 70:30 and trained the model. I got an accuracy of 80% for the test data. Then I took a benchmark dataset and tested the model with that dataset. That dataset only contained data with true labels(1). But when I get the prediction for the benchmark dataset using the model all the true positives are classified as true negatives. Accuracy is 90%. Why is that? Is there a way to interpret this?
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred_benchmark=classifier.predict(XBench_test)
print("Accuracy on test data: {:.4f}".format(classifier.score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(classifier.score(XBench_test, YBench_test))) \*This gives 90%*\
I'll take a shot at providing a better way to interpret your results. In cases where you have an imbalanced data set accuracy is not going to be a good way to measure your performance.
Here is a common example:
Imagine you have a disease that is present in only .01% of people. If you predict no one has the disease you have an accuracy of 99.99% but your model is not a good model.
In this example it appears your benchmark data set (commonly referred to as a test dataset) has imbalanced classes and you are getting an accuracy of 90% when you call the classifier.score method. In this case, accuracy is not a good way to interpret the model. You should instead look at other metrics.
Other common metrics may be to look at precision and recall to determine how your model is performing. In this case since all True positives are predicted as negative your precision AND your recall would be 0, meaning your model is not differentiating very well.
Going further if you have imbalanced classes it may be better to check different thresholds of scores and look at metrics like ROC_AUC. These metrics look at the probability scores outputted by the model (predict_proba for sklearn) and test different thresholds. Perhaps your model works well at a lower threshold and the positive cases consistently score higher than the negative cases.
Here is an additional article about ROC_AUC.
Sci-kit learn has a few different metric scores you can use they are located here.
Here is one way you could implement ROC AUC into your code.
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
#use predict_proba
y_pred=classifier.predict_proba(X_test)
y_pred_benchmark=classifier.predict_proba(XBench_test)
from sklearn.metrics import roc_auc_score
## instead of measuring accuracy use ROC AUC)
print("Accuracy on test data: {:.4f}".format(roc_auc_score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(roc_auc_score(XBench_test, YBench_test))) \*This gives 90%*\
Suppose I want to predict the percentage likelihood (1-100%) that a 3rd year student graduates college.
I have a training data set with 100 observations, all of which contain examples of students classified to be "Highly likely to Graduate".
I have another data set consisting of say 500 observations (where we don't know if any have graduated).
My question is: How would I go about getting a probability value for all 500 students that describes how likely they are to graduate based on a number of features (anywhere between 1-5 features such as grade scores, living on campus or off campus, etc.) on a model that was trained from the first dataset? What approaches would you suggest?
I would recommend you to use OneClassSVM which is an unsupervised outlier detection. Since your training data contains only samples from one class i.e. "Highly likely to Graduate" training a Logistic Regression or a Neural Network may not work here. It's better to consider that whatever data you have are not outliers, and the other category which is not likely to graduate as outliers. Once you fit an OneClassSVM model you can use the decision_function to get the signed distance to the separating hyperplane, which will be positive for an inlier and negative for an outlier. Then on top of it you can just you a sigmoid function to get the probabilities out. I have shown an example below:
from sklearn.svm import OneClassSVM
X = [[0], [0.44], [0.45], [0.46], [1]]
clf = OneClassSVM(gamma='auto').fit(X)
def sigmoid(x):
return 1/(1+np.exp(-x))
prob = clf.decision_function([[0.455]]) # Not an outlier
sigmoid(prob)
#array([0.50027839])
prob = clf.decision_function([[5]]) # An outlier
sigmoid(prob)
#array([0.11356841])
I am building a churn prediction model with logistic regression in python. My model accuracy is 0.47 and only predicts 0s. The realized y variable is actually 81 zeros and 92 ones.
The data set I have is only a few features and 220 users(records). If I set a reference time, it is even less(about 123 records for the training set and 173 for the testing set). So I think the sample size is too small to use logistic regression. But I still tried because this is just a sample test so I only got this small data set. (Theoretically there is more data)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('Accuracy: {:.2f}'.format(logreg.score(x_test, y_test)))
Even if I don't test the model, meaning I use the whole data set to build the model, when I predict the future churn it still returns only 0s.
is it that my sample size is too small, or because the accuracy is less than 0.5 so it just returns one value(0 here) ? Or I did something wrong in the code?
Thanks very much!
There are several potential causes for heavily biased prediction from a logistic regression model. For the purpose of informing general audience, I will list the most common ones even though some of them don't apply to your case.
(Skewed output distribution) Your training data has biased, imbalanced label distribution. If your training contains, for example, 1 positive and 100000 negatives, the bias/intercept term in the regression will be very small. After applying the link function the predictions can be practically zero.
(Sparsity) The feature space is large and your dataset is small, leading to a sparse training data. Therefore most new incoming instances of data point aren't seen before. In the worse case, in which all features are factor, unseen factor values result in zeros because the correct one-hot column cannot be identified.
(Skewed input distribution) The feature space is small and your dataset is dense around a small region. If it turns out at that region there are more zeros, the predictions are always gonna be zero even for future instances of input. For example, my data X has two columns, gender and age. It turns out most of my data points are 30 years old male, and 80 out of 100 30-year-old males like ice-cream, in a 101 data-point dataset. The model will predict 30-year-old males like ice-cream for future input, which are usually for 30-year-old males assuming similar input distribution.
You should check the distribution of score using the predict_proba function, and check the distribution of input features using something like pairplot.
I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?
Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).