SelectKBest(mutual_info_regression) for classification - python

I am new to feature selection and am not sure whether I understood how to use feature selection for sentiment analysis. Does it make sense to try out SelectKBest(f_classif), SelectKBest(f_regression), SelectKBest(mutuial_info_classif) and SelectKBest(f_info_regression) for a classification problem? I would have thought it does not, only SelectKBest with f_classif and mutual_info_classif.

f_classif
ANOVA F-value between label/feature for classification tasks.
mutual_info_classif
Mutual information for a discrete target.
f_regression
F-value between label/feature for regression tasks.
mutual_info_regression
Mutual information for a continuous target.
You have a discrete target, so no regression here. However, I would not use F-test based type of feature selection for large data sets, because it is based on statistical tests, and for large data sets any difference can become statistically significant and the selection peformance is almost none. (But as part of the task, you definitely can compare them.)

Related

How does Tensorflow's Decision Forests handle categorical data?

I'm evaluating two different unsupervised ML algorithms, Isolation Forest and LSTM Autoencoder model, to identify anomalies in a large time series data. This dataset includes mostly categorical data such as Ip Adresses, cloud subscription Ids,tenant Ids, userAgents, and client Application Ids.
When reading a tutorial on an implementation of a Tensorflow's Decision Tree (TF-DF) model, it mentions that the model handles non-label categorical values natively and
there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_present feature.
Does anybody know how Tensorflow handles the categorical features behind the scenes (assuming they do some transformation into a numeric representation)?
Tl;dr: There is a natural way of using categorical features in decision trees/forests that requires no encoding. Tensorflow Decision Forests uses this and a number of standard transformations to handle categorical features.
Tensorflow Decision Forest (TF-DF) constructs decision tree / decision forest models. A single decision tree recursively splits the dataset along its features. Splits along categorical features can naturally be performed through so-called in-set conditions. For instance, a tree can express a condition like userAgents \in \{“Mozilla/5.0”, “InternetExplorer/10.0”\}. Other types of conditions are also possible. Tensorflow Decision Forests (TF-DF) can construct in-set conditions if the dataset contains categorical features.
More specifically, Tensorflow Decision Forests uses the C++ library Yggdrasil Decision Forests (YDF) under the hood for any advanced computations. YDF offers three different algorithms for finding a good categorical split of the data. For example, the Random algorithm will just try out many possible splits at random and pick the best one.
For performance and quality reasons, YDF also preprocesses categorical features: If a categorical value is very rare, YDF may consider it “out-of-dictionary”, the threshold for “rare” being user-configurable. Furthermore, YDF maps the categorical features to integers by decreasing item frequency, with the mapping stored as part of the model. Note that this is purely an internal encoding; the algorithms are aware that a feature is categorical, hence typical issues with integer encodings do not apply.
Finally, Tensorflow Decision Forests (TF-DF) uses Keras, which expects classification tasks to have an integer label. Therefore, TF-DF users have to encode the label themselves or use the built-in pd_dataframe_to_tf_dataset.
Note that this answer only applies to Tensorflow Decision Forests. Other parts of Tensorflow may need manual encoding.

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

What is the difference between xgboost, extratreeclassifier, and randomforrestclasiffier?

I am new to all these methods and am trying to get a simple answer to that or perhaps if someone could direct me to a high level explanation somewhere on the web. My googling only returned kaggle sample codes.
Are the extratree and randomforrest essentially the same? And xgboost uses boosting when it chooses the features for any particular tree i.e. sampling the features. But then how do the other two algorithms select the features?
Thanks!
Extra-trees(ET) aka. extremely randomized trees is quite similar to random forest (RF). Both methods are bagging methods aggregating some fully grow decision trees. RF will only try to split by e.g. a third of features, but evaluate any possible break point within these features and pick the best. However, ET will only evaluate a random few break points and pick the best of these. ET can bootstrap samples to each tree or use all samples. RF must use bootstrap to work well.
xgboost is an implementation of gradient boosting and can work with decision trees, typical smaller trees. Each tree is trained to correct the residuals of previous trained trees. Gradient boosting can be more difficult to train, but can achieve a lower model bias than RF. For noisy data bagging is likely to be most promising. For low noise and complex data structures boosting is likely to be most promising.

Ensemble feature selection from feature sets

I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.

Python machine learning, feature selection

I am working on a classification task related to written text and I wonder how important it is to perform some kind of "feature selection" procedure in order to improve the classification results.
I am using a number of features (around 40) related to the subject, but I am not sure if all the features are really relevant or not and in which combinations. I am experementing with SVM (scikits) and LDAC (mlpy).
If a have a mix of relevant and irrelevant features, I assume I will get poor classification results. Should I perform a "feature selection procedure" before classification?
Scikits has an RFE procedure that is tree-based that is able to rank the features. Is it meaningful to rank the features with a tree-based RFE to choose the most important features and to perform the actual classification with SVM (non linear) or LDAC? Or should I implement some kind of wrapper method using the same classifier to rank the features (trying to classify with different groups of features would be very time consuming)?
Just try an see if it improves the classification score as measured with cross validation. Also before trying RFE, I would try less CPU intensive schemes such as univariate chi2 feature selection.
Having 40 features is not too bad. Some machine-learning is impeded by irrelevant features, but many things are quite robust to them (eg naive Bayes, SVM, decision trees). You probably don't need to do feature selection unless you decide to add many more features in.
It's not a bad idea to throw away useless features, but don't waste your own mental time on trying that out unless you have a particular motivation to.

Categories