I am doing a task of text classification(7000 texts evenly distributed by 10 labels). And by exploring SVM and and Logistic Regression
clf1 = svm.LinearSVC()
clf1.fit(X, y)
clf1.predict(X_test)
score1 = clf1.score(X_test,y_true)
clf2 = linear_model.LogisticRegression()
clf2.fit(X, y)
clf2.predict(X_test)
score2 = clf2.score(X_test,y_true)
I got two accuracies, score1 and score2 I guess whether I could improve my accuracy by developing an ensemble system combining the outputs of the two classifiers above.
I have learnt knowledge on ensemble by myself and I know there are bagging,boosting,and stacking.
However, I do not know how to use the scores predicted from my SVM and Logistic Regression in ensemble. Could anyone give me some ideas or show me some example code?
You can just multiply the probabilities, or use another combination rule.
In order to do that in a more generic way (try several rules)
you can use brew.
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = [clf1, clf2]
ens = Ensemble(classifiers=clfs)
# Since you have only 2 classifiers 'majority_vote' is note an option,
# rule = ['mean', 'majority_vote', 'max', 'min', 'median']
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
ensemble_clf.predict(X)
Also, keep in mind that the classifiers should be diverse enough to give a good combination result.
If you had fewer features, I'd say you should check out some Dynamic Classifier/Ensemble Selection (also provided in brew) but since you probably have many features, euclidean distance probably do not make sense to get the region of competence of each classifier. Best thing is to check out by hand which kind of labels each classifiers tends to get right based on the confusion matrix.
Related
I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.
When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.
There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.
Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.
As mentioned in the title, Im using SelectFromModel from sklearn to select features for both my random forest and gradient boosting classification models.
#feature selection performed on training dataset to prevent overfitting
sel = SelectFromModel(GradientBoostingClassifier(n_estimators=10, learning_rate=0.25,max_depth=1, max_features = 15, random_state=0).fit(X_train_bin, y_train))
sel.fit(X_train_bin, y_train)
#returns a boolean array to indicate which features are of importance (above the mean threshold)
sel.get_support()
#shows the names of the selected features
selected_feat= X_train_bin.columns[(sel.get_support())]
selected_feat
The boolean array that is returned for random forest and gradient boosting model are COMPLETELY different. random forest feature selection tells me to drop an additional 4 columns (out of 25 features) and the feature selection on the gradient boosting model is telling me to drop nearly everything. What is happening here?
EDIT: I'm trying to compare the performance of these 2 models on my dataset. Should I move the threshold so I at least have approximately the same amount of features to train on?
There's no reason for them to select the same variables. GradientBoostingClassifier builds each tree to improve on the error of the previous step, while RandomForestClassifier trains independent trees that have nothing to do with each others' errors.
Another reason why they may select different features is criterion, which is entropy for Random Forests and Friedman MSE for Gradient Boosting. Finally, it could be because both algorithms select random subsets of features when making each split. Hence, they did not compare the same variables in the same order, which will naturally yield different importances.
I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.
Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.
GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.
Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))
I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).
The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).
The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.
My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?
PS : This question is kind of linked with an other.
scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.
scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.
After doing some researchs, this is what I came out with :
from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict
names = db_train.iloc[:,1:].columns.tolist()
# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
class_weight="auto",
criterion='gini',
bootstrap=True,
max_features=10,
min_samples_split=1,
min_samples_leaf=6,
max_depth=3,
n_jobs=-1)
scores = defaultdict(list)
# -- Fit the model (could be cross-validated)
rf = model_rf.fit(X_train, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))
for i in range(X_train.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True))
Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]
The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.
I took my inspiration from this (great) notebook.
All suggestions/comments are most welcome !