xgboost: Sample Weights for Imbalanced Data?

xgboost: Sample Weights for Imbalanced Data? - python

I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report

First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)

Related

imbalabced data set score after smote

Is it correct to use 'accuracy' as a metric for an imbalanced data set after using oversampling methods such as SMOTE or we have to use other metrics such as AUROC or other presicion-recall related metrics?

You can use accuracy for the dataset after using SMOTE since now it shouldn't be imbalanced as far as I know. You should try the other metrics though for a more detailed evaluation (classification_report_imbalenced combines some metrics)

SMOTE and similar imbalance treatment techniques will be only be applied to you training data. When you have a largely imbalanced data set, say 99% against 1%, accuracy on the TEST set might still give you a value of 99% by always choosing the larger class.
Therefore, you should definitely switch to another metric.
Popular variants are the F1 score, but there is also a balanced version of the accuracy, see scikit-learn BA page.
As mentioned by #Nocry, applying several evaluation measures, might give you a better feeling. For example, check how accuracy (the regular variant) and balanced accuracy perform with and without using SMOTE, then you should see the difference.

Classification: Tweet Sentiment Analysis - Order of steps

I am currently working on a tweet sentiment analysis and have a few questions regarding the right order of the steps. Please assume that the data was already preprocessed and prepared accordingly. So this is how I would proceed:
use train_test_split (80:20 ratio) to withhold a test
data set.
vectorize x_train since the tweets are not numerical.
In the next steps, I would like to identify the best classifier. Please assume those were already imported. So I would go on by:
hyperparameterization (grid-search) including a cross-validation approach.
In this step, I would like to identify the best parameters of each
classifier. For KNN the code is as follows:
model = KNeighborsClassifier()
n_neighbors = range(1, 10, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors, weights=weights ,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_tf, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
compare the accuracy (depending on the best hyperparameters) of the classifiers
choose the best classifier
take the withheld test data set (from train_test_split()) and use the best classifier on the test data
Is this the right approach or would you recommend changing something (e. g. doing the cross-validation alone and not within the hyperparametrization)? Does it make sense to test the test data as the final step or should I do it earlier to assess the accuracy for an unknown data set?

There are lots of ways to do this and people have strong opinions about it and I'm not always convinced they fully understand what they advocate.
TL;DR: Your methodology looks great and you're asking sensible questions.
Having said that, here are some things to consider:
Why are you doing train-test split validation?
Why are you doing hyperparameter tuning?
Why are you doing cross-validation?
Yes, each of these techniques are good at doing something specific; but that doesn't necessarily mean they should all be part of the same pipeline.
First off, let's answer these questions:
Train-Test Split is useful for testing your classifier's inference abilities. In other words, we want to know how well a classifier performs in general (not on the data we used for training). The test portion allows us to evaluate our classifier without using our training portion.
Hyperparameter-Tuning is useful for evaluating the effect of hyperparameters on the performance of a classifier. For it to be meaningful, we must compare two (or more) models (using different hyperparameters) but trained preferably using the same training portion (to eliminate selection bias). What do we do once we know the best performing hyperparameters? Will this set of hyperparameters always perform optimally? No. You will see that, due to the stochastic nature of classification, one hyperparameter set may work best in experiment A then another set of hyperparameters may work best on experiment B. Rather, hyperparameter tuning is good for generalizing about which hyperparameters to use when building a classifier.
Cross-validation is used to smooth out some of the stochastic randomness associated with building classifiers. So, a machine learning pipeline may produce a classifier that is 94% accurate using 1 test-fold and 83% accuracy using another test-fold. What does it mean? It might mean that 1 fold contains samples that are easy. Or it might mean that the classifier, for whatever reason, is actually better. You don't know because it's a black box.
Practically, how is this helpful?
I see little value in using test-train split and cross-validation. I use cross-validation and report accuracy as an average over the n-folds. It is already testing my classifier's performance. I don't see why dividing your training data further to do another round of train-test validation is going to help. Use the average. Having said that, I use the best performing model of the n-fold models created during cross-validation as my final model. As I said, it's black-box, so we can't know which model is best but, all else being equal, you may as well use the best performing one. It might actually be better.
Hyperparameter-tuning is useful but it can take forever to do extensive tuning. I suggest adding hyperparameter tuning to your pipeline but only test 2 sets of hyperparameters. So, keep all your hyperparameters constant except 1. e.g. Batch size = {64, 128}. Run that, and you'll be able to say with confidence, "Oh, that made a big difference: 64 works better than 128!" or "Well, that was a waste of time. It didn't make much difference either way." If the difference is small, ignore that hyperparameter and try another pair. This way, you'll slowly tack towards optimal without all the wasted time.
In practice, I'd say leave the extensive hyperparameter-tuning to academics and take a more pragmatic approach.
But yeah, you're methodology looks good as it is. I think you thinking about what you're doing and that already puts you a step ahead of the pack.

Unstable accuracy of Gaussian Mixture Model classifier from sklearn

I have some data (MFCC features for speaker recognition), from two different speakers. 60 vectors of 13 features for each person (in total 120). Each of them has their label (0 and 1). I need to show the results on confusion matrix. But GaussianMixture model from sklearn is unstable. For each program run i receive different scores (sometimes accuracy is 0.4, sometimes 0.7 ...). I don't know what I am doing wrong, because analogically i created SVM and k-NN models and they are working fine (stable accuracy around 0.9). Do you have any idea what am I doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))

Short answer: you should simply not use a GMM for classification.
Long answer...
From the answer to a relevant thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis in the original):
Gaussian Mixture is not a classifier. It is a density estimation
method, and expecting that its components will magically align with
your classes is not a good idea. [...] GMM simply tries to fit mixture of Gaussians
into your data, but there is nothing forcing it to place them
according to the labeling (which is not even provided in the fit
call). From time to time this will work - but only for trivial
problems, where classes are so well separated that even Naive Bayes
would work, in general however it is simply invalid tool for the
problem.
And a comment by the respondent himself (again, emphasis in the original):
As stated in the answer - GMM is not a classifier, so asking if you
are using "GMM classifier" correctly is impossible to answer. Using
GMM as a classifier is incorrect by definition, there is no "valid"
way of using it in such a problem as it is not what this model is
designed to do. What you could do is to build a proper generative
model per class. In other words construct your own classifier where
you fit one GMM per label and then use assigned probability to do
actual classification. Then it is a proper classifier. See
github.com/scikit-learn/scikit-learn/pull/2468
(For what it may worth, you may want to notice that the respondent is a research scientist in DeepMind, and the very first person to be awarded the machine-learning gold badge here at SO)
To elaborate further (and that's why I didn't simply flag the question as a duplicate):
It is true that in the scikit-learn documentation there is a post titled GMM classification:
Demonstration of Gaussian mixture models for classification.
which I guess did not exist back in 2017, when the above response was written. But, digging into the provided code, you will realize that the GMM models are actually used there in the way proposed by lejlot above; there is no statement in the form of classifier.fit(X_train, y_train) - all usage is in the form of classifier.fit(X_train), i.e. without using the actual labels.
This is exactly what we would expect from a clustering-like algorithm (which is indeed what GMM is), and not from a classifier. It is true again that scikit-learn offers an option for providing also the labels in the GMM fit method:
fit (self, X, y=None)
which you have actually used here (and again, probably did not exist back in 2017, as the above response implies), but, given what we know about GMMs and their usage, it is not exactly clear what this parameter is there for (and, permit me to say, scikit-learn has its share on practices that may look sensible from a purely programming perspective, but which made very little sense from a modeling perspective).
A final word: although fixing the random seed (as suggested in a comment) may appear to "work", trusting a "classifier" that gives a range of accuracies between 0.4 and 0.7 depending on the random seed is arguably not a good idea...

In sklearn, the labels of clusters in gmm do not mean anything. So, each time you run a gmm, the labels may vary. It might be one reason the results are not robust.

When do feature selection in imblearn pipeline with cross-validation and grid search

Currently I am building a classifier with heavily imbalanced data. I am using the imblearn pipeline to first to StandardScaling, SMOTE, and then the classification with gridSearchCV. This ensures that the upsampling is done during the cross-validation. Now I want to include feature_selection into my pipeline. How should I include this step into the pipeline?
model = Pipeline([
('sampling', SMOTE()),
('classification', RandomForestClassifier())
])
param_grid = {
'classification__n_estimators': [10, 20, 50],
'classification__max_depth' : [2,3,5]
}
gridsearch_model = GridSearchCV(model, param_grid, cv = 4, scoring = make_scorer(recall_score))
gridsearch_model.fit(X_train, y_train)
predictions = gridsearch_model.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

It does not necessarily make sense to include feature selection in a pipeline where your model is a random forest(RF). This is because the max_depth and max_features arguments of the RF model essentially control the amounts of features included when building the individual trees (the max depth of n just says that each tree in your forest will be built for n nodes, each with a split consisting of a combination of max_features amount of features). Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
You can simply investigate your trained model for the top ranked features. When training an individual tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. So then you actually don't need to retrain the forest for different feature sets, because the feature importance (already computed in the sklearn model) tells you all the info you'd need.
P.S. I would not waste time grid searching n_estimators either, because more trees will result in better accuracy. More trees means more computational cost and after a certain number of trees, the improvement is too small, so maybe you have to worry about that, but otherwise you will gain performance from a large-ish number of n_estimator and you're not really in trouble of overfitting either.

do you mean feature selection form sklearn? https://scikit-learn.org/stable/modules/feature_selection.html
You can run it in the beginning. You will basically adjust your columns of X (X_train, and X_test accordingly). It is important that you fit your feature selection only with the training data (as your test data should be unseen at that point in time).
How should I include this step into the pipeline?
so you should run it before your code.

There is no "how" as if there is a concrete recipe, it depends on your goal.
If you want to check which set of features gives you the best performance (according to your metrics, here recall), you could use sklearn's sklearn.feature_selection.RFE (Recursive Feature Elimination) or it's cross validation variant sklearn.feature_selection.RFECV.
The first one fit's your model with whole set of features, measures their importance and prunes the least impactful ones. This operation continues until the desired number of features are left. It is quite computationally intensive though.
Second one starts with all features and removes step features every time trying out all possible combinations of learned models. This continues until min_features_to_select is hit. It is VERY computationally intensive, way more than the first one.
As this operation is rather infeasible to use in connection with hyperparameters search, you should do it with a fixed set of defaults before GridSearchCV or after you have found some suitable values with it. In the first case, features choice will not depend on the hyperparams you've found, while for the second case the influence might be quite high. Both ways are correct but would probably yield different results and models.
You can read more about RFECV and RFE in this StackOverflow answer.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)

While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.