Python machine learning, feature selection - python

I am working on a classification task related to written text and I wonder how important it is to perform some kind of "feature selection" procedure in order to improve the classification results.
I am using a number of features (around 40) related to the subject, but I am not sure if all the features are really relevant or not and in which combinations. I am experementing with SVM (scikits) and LDAC (mlpy).
If a have a mix of relevant and irrelevant features, I assume I will get poor classification results. Should I perform a "feature selection procedure" before classification?
Scikits has an RFE procedure that is tree-based that is able to rank the features. Is it meaningful to rank the features with a tree-based RFE to choose the most important features and to perform the actual classification with SVM (non linear) or LDAC? Or should I implement some kind of wrapper method using the same classifier to rank the features (trying to classify with different groups of features would be very time consuming)?

Just try an see if it improves the classification score as measured with cross validation. Also before trying RFE, I would try less CPU intensive schemes such as univariate chi2 feature selection.

Having 40 features is not too bad. Some machine-learning is impeded by irrelevant features, but many things are quite robust to them (eg naive Bayes, SVM, decision trees). You probably don't need to do feature selection unless you decide to add many more features in.
It's not a bad idea to throw away useless features, but don't waste your own mental time on trying that out unless you have a particular motivation to.

Related

Feature importance with LightGBM

I have trained a model using several algorithms, including Random Forest from skicit-learn and LightGBM. and these model performs similarly in term of accuracy and other stats.
The issue is the inconsistent behavior between these two algorithms in terms of feature importance. I used default parameters and I know that they are using different method for calculating the feature importance but I suppose the highly correlated features should always have the most influence to the model's prediction. Random Forest makes more sense to me because the highly correlated features appear at top while it is not the case for LightGBM.
Is there a way to explain for this behavior and does this result with LightGBM is trustworthy to be presented?
Random Forest feature importance
LightGBM feature importance
Correlation with target
I have had a similar issue. The default feature importance for LGBM is based on 'split', and when I changed this to 'gain', the plots gave similar results.
Well, GBM is often shown to perform better especially when you comparing with random forest. Especially when comparing it with LightGBM. A properly-tuned LightGBM will most likely win in terms of performance and speed compared with random forest.
GBM advantages :
More developed. A lot of new features are developed for modern GBM model (xgboost, lightgbm, catboost) which affect its performance, speed, and scalability.
GBM disadvantages :
Number of parameters to tune
Tendency to overfit easily
If you aren't completely sure the hyperparameters are tuned correctly for the LightGBM, stick with the Random Forest; this will be easier to use and maintain.

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.

Python classification define feature importance

I am wondering if it is possbile to define feature importances/weights in Pyhton Classification methods? For example:
model = tree.DecisionTreeClassifier(feature_weight = ...)
I've seen in RandomForest there is an attribute feature_importance, which shows the importance of features based on analysis. But is it possible that I could define the feature importance for analysis in advance?
Thank you very much for your help in advance!
The feature importance determination in random forest classifiers uses a random forest-specific method (invert all binary tests over the feature, and get the additional classification error).
Feature importance is thus a concept that relates to the predictive ability of the model, not the training phase. Now, if you want to make it so that your model favours some feature over others, you will have to find some trick that depends on the model.
Regarding sklearn's DecisionTreeClassifier, such a trick does not appear to be trivial. You could custom your class weights, if you know some classes will be more easily predicted by some features that you want to favour; but this seems pretty dirty.
In other types of models, such as ones using kernels, you can do this more easily, by setting hyperparameters which directly relate to features.
If you are trying to limit an overfitting, I would also simply suggest that you remove the features you know to be less important.

What are good metrics to evaluate the performance of a multi-class classifier?

I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?
If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes
First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.

Ensemble feature selection from feature sets

I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.

Categories