How to implement feature selection for categorical variables? - python

I'm having a problem selecting the important feature. The features for the dataset are categorical and numerical. The target variable is False or True. The features for the dataset are about 100, so I need to drop some of the features that are not related to the target variable. Which method can be used other than Random Forest feature importance? I'm using Python. In R I can use Boruta package to select the important features. but I do not know how to do this in Python.

Selecting relevant features can be done by calculating the P-value of the feature relating to the hypothesis, check https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf.

Related

IterativeImputer - sample_posterior

Sklearn implements an imputer called the IterativeImputer. I believe that it works by predicting the values for missing features values in a round robin fashion, using an estimator.
It has an argument called sample_posterior but I can't seem to figure out when I should use it.
sample_posterior boolean, default=False
Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation. Estimator must support
return_std in its predict method if set to True. Set to True if using
IterativeImputer for multiple imputations.
I looked at the source code but it still wasn't clear to me. Should I use this if I have multiple features that I am going to fill using the iterative imputer or should I use this if I plan to use the imputer multiple times like for a training and then validation set?
Even with multiple features, and a training and validation/test set, you don't need sample_posterior. The "multiple imputations" part of the docstring means generating more than one missings-replaced dataset; see e.g. wikipedia.
Normally, IterativeImputer imputes the missing values of a feature using the predictions of a model built on the other features (iteratively, round robin, etc.). If you use a model that produces not just a single prediction but an output distribution (the posterior), then you can sample from that distribution randomly, hence sample_posterior. By running it multiple times, with different random seeds, these random choices are different, and you get multiple imputed datasets.
The documentation on that isn't great, but there's a (somewhat aged) PR for an extended example, and a toy example on SO.

Correlations of feature columns in TensorFlow

I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.

Is feature selection agnostic to regression/classification model chosen?

So as the title suggests, my question is whether feature selection algorithms are independent of the regression/classification model chosen. Maybe some feature selection algorithms are independent and some are not? If so can you name a few of each kind? Thanks.
It depends on the algorithm you use to select features. Filter methods that are done prior to modeling are of course agnostic, as they're using statistical methods like chi-squared or correlation coefficient to get rid of unnecessary features.
If you use embedded methods where features are selected during model creation, it is possible that different models will find value in different feature sets. Lasso, Elastic Net, Ridge Regression are a few examples of these.
It's worth noting that some model types perform well with sparse data or missing values while others do not.

xgboost feature importance of categorical variable

I am using XGBClassifier to train in python and there are a handful of categorical variables in my training dataset. Originally, I planed to convert each of them into a few dummies before I throw in my data, but then the feature importance will be calculated for each dummy, not the original categorical ones. Since I also need to order all of my original variables (including numerical + categorical) by importance, I am wondering how to get importance of my original variables? Is it simply adding up?
You could probably get by with summing the individual categories' importances into their original, parent category. But, unless these features are high-cardinality, my two cents would be to report them individually. I tend to err on the side of being more explicit with reporting model performance/importance measures.

Ensemble feature selection from feature sets

I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.

Categories