I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.
When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.
There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.
Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.
I'm training a binary classifier using python and the popular scikit-learn module's SVM class. After training I use the predict method to make a classification as laid out in sci-kit's SVC documentation.
I would like to know more about the significance of my sample features to the resulting classification made by the trained decision_function (support vectors). Any strategies for evaluating feature significance when making predictions with such a model are welcome.
Thanks!
Andre
So, how do we interpret feature significance for a given sample's classification?
I think using a linear kernel is the most straightforward way to first approach this because of the significance/relative simplicity of the svc.coef_ attribute of a trained model. check out Bitwise's answer.
Below I will train a linear kernel SVM using scikit training data. Then we will look at the coef_ attribute. I will include a simple plot showing how the dot product of the classifier's coefficients and training feature data divide the resulting classes.
from sklearn import svm
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data # training features
y = data.target # training labels
lin_clf = svm.SVC(kernel='linear')
lin_clf.fit(X,y)
scores = np.dot(X, lin_clf.coef_.T)
b0 = y==0 # boolean or "mask" index arrays
b1 = y==1
malignant_scores = scores[b1]
benign_scores = scores[b1]
fig = plt.figure()
fig.suptitle("score breakdown by classification", fontsize=14, fontweight='bold')
score_box_plt = ply.boxplot(
[malignant_scores, benign_scores],
notch=True,
labels=list(data.target_names),
vert=False
)
plt.show(score_box_plt)
As you can see we do seem to have accessed the appropriate intercept and coefficient values. There is obvious separation of class scores with our decision boundary hovering around 0.
Now that we have a scoring system based on our linear coefficients we can easily investigate how each feature contributed to final classification. Here we display each features effect on the final score of that sample.
## sample we're using X[2] --> classified benign, lin_clf score~(-20)
lin_clf.predict(X[2].reshape(1,30))
contributions = np.multiply(X[2], lin_clf.coef_.reshape((30,)))
feature_number = np.arange(len(contributions)) +1
plt.bar(feature_number, contributions, align='center')
plt.xlabel('feature index')
plt.ylabel('score contribution')
plt.title('contribution to classification outcome by feature index')
plt.show(feature_contrib_bar)
We can also simply sort this same data to get a contribution-ranked list of features for a given classification to see which feature contributed the most to the score we are assessing the composition of.
abs_contributions = np.flip(np.sort(np.absolute(contributions)), axis=0)
feat_and_contrib = []
for contrib in abs_contributions:
if contrib not in contributions:
contrib = -contrib
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
else:
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
# sorted by max abs value. each row a tuple:;(feature index, contrib)
feat_and_contrib
From that ranked list we can see that the top five feature indices that contributed to the final score (of around -20 along with a classification 'benign') were [0, 22, 13, 2, 21] which correspond to the feature names in our data set; ['mean radius', 'worst perimeter', 'area error', 'mean perimeter', 'worst texture'].
Suppose You have Bag of word Featurization and you want to know which words are important
for classification then use this code for linear svm
weights = np.abs(lr_svm.coef_[0])
sorted_index = np.argsort(wt)[::-1]
top_10 = sorted_index[:10]
terms = text_vectorizer.get_feature_names()
for ind in top_10:
print(terms[ind])
You can use SelectFromModel in sklearn to get the names of the most relevant features of your model. Here is an example of extracting the features for LassoCV.
You can also check out this example which makes use of coef_ attribute in SVM to visualize the top most features.
Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))
I am doing a task of text classification(7000 texts evenly distributed by 10 labels). And by exploring SVM and and Logistic Regression
clf1 = svm.LinearSVC()
clf1.fit(X, y)
clf1.predict(X_test)
score1 = clf1.score(X_test,y_true)
clf2 = linear_model.LogisticRegression()
clf2.fit(X, y)
clf2.predict(X_test)
score2 = clf2.score(X_test,y_true)
I got two accuracies, score1 and score2 I guess whether I could improve my accuracy by developing an ensemble system combining the outputs of the two classifiers above.
I have learnt knowledge on ensemble by myself and I know there are bagging,boosting,and stacking.
However, I do not know how to use the scores predicted from my SVM and Logistic Regression in ensemble. Could anyone give me some ideas or show me some example code?
You can just multiply the probabilities, or use another combination rule.
In order to do that in a more generic way (try several rules)
you can use brew.
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = [clf1, clf2]
ens = Ensemble(classifiers=clfs)
# Since you have only 2 classifiers 'majority_vote' is note an option,
# rule = ['mean', 'majority_vote', 'max', 'min', 'median']
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)
Also, keep in mind that the classifiers should be diverse enough to give a good combination result.
If you had fewer features, I'd say you should check out some Dynamic Classifier/Ensemble Selection (also provided in brew) but since you probably have many features, euclidean distance probably do not make sense to get the region of competence of each classifier. Best thing is to check out by hand which kind of labels each classifiers tends to get right based on the confusion matrix.
I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.
I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?!
The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...
A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?
Here's the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib
Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl')
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl')
...
#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)
Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).
I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).
"...the probability outputs from predict_proba are not to be taken too seriously"
I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.