Feature selection without class labels column

Feature selection without class labels column - python

I want to perform feature selection on the list of 52 features that I have.
But I do not have the class label column in my dataset.
So how do I select features without dependency on class label.
For example: Select K Best algorithm chooses features based on relationship between features and the class labels. As said in my case I do not have class labels column itself
Thanks in advance.

You are looking for feature selection for unsupervised learning. Some common methods: Removing low-variance features (an implementation here). Removing correlated features (can be implemented using corr() from pandas). Performing Principal Feature Analysis (PFA) (a python implementation here). More advanced methods can be found in Feature Selection for Clustering section of this book.
There is also PCA, which is not feature selection, but extraction, but useful if the ultimate goal is dimensionality reduction.

Related

Can i select 100 features using LDA whne there are only 2 classes?

In Python, can I get 100 best features out of 200k by performing Linear Discriminant Analysis on data having 2 classes?

Although LDA is used for multi-class problems, it can be used in binary classification problems.
You can use LDA for dimensionality reduction which aims to reduce the number of features. Feature selection on the other hand is the process of selecting a subset of features from a set of features.
So it is a kind of feature extraction and not feature selection. This means LDA will create a new set of features and not select the best features.
In essence, the original features no longer exist and new features are constructed from the available data that are not directly comparable to the original data [1].
Check this link for further reading
[1] Linear Discriminant Analysis for Dimensionality Reduction in Python

Implementing Scikit Learn's FeatureHasher for High Cardinality Data

Background: I am working on a binary classification of health insurance claims. The data I am working with has approximately 1 million rows and a mix of numeric features and categorical features (all of which are nominal discrete). The issue I am facing is that several of my categorical features have high cardinality with many values that are very uncommon or unique. I have plotted 8 of my categorical features below which had the highest counts of unique factor levels:
Alternative to Dummy Variables: I have been reading up on feature hashing and understand that this method is an alternative that can be used for a fast and space-efficient way of vectorizing features and is particularity suitable for categorical data with high cardinality. I plan to utilize Scikit Learn's FeatureHasher to perform feature hashing on my categorical features with more than 100 unique feature levels (I will create dummy variables for the remaining categorical features with less than 100 unique feature levels). Before I implement this I have a few questions relating to feature hashing and how it relates to model performance in machine learning:
What is the primary advantage of using feature hashing as opposed to dummying only the most frequently occuring factor levels? I assume there is less information loss with the feature hashing approach but need more clarification on what advantages hashing provides in machine learning algorithms when dealing with high cardinality.
I am interested in evaluating feature importance after evaluating a few separate classification models. Is there a way to evaluate hashed features in the context of how they relate to the original categorical levels? Is there a way to reverse hashes or does feature hashing inevitably lead to loss of model interpretability?
Sorry for the long post and questions. Any feedback/recommendations would be much appreciated!

Feature hashing can support new categories during inference that were not seen in training. With dummy encoding, you can only encode a fixed set of previously seen categories. If you encounter a category not seen in training, you're out of luck.
For feature importance, there are two canonical approaches.
a) Train/evaluate your model with and without each feature to see its effect. This can be computationally expensive.
b) Train/evaluate your model with the feature and also with that feature permuted among all samples.
With feature hashing, each feature expands to multiple columns so b) will be tricky and I haven't found any packages that do permutation importance of feature hashed columns.
So, I think a) is probably your best bet, considering you only have 1 million rows.
Also, you'll probably get better answers on Cross Validated for ML questions on stackoverflow.

Is feature selection agnostic to regression/classification model chosen?

So as the title suggests, my question is whether feature selection algorithms are independent of the regression/classification model chosen. Maybe some feature selection algorithms are independent and some are not? If so can you name a few of each kind? Thanks.

It depends on the algorithm you use to select features. Filter methods that are done prior to modeling are of course agnostic, as they're using statistical methods like chi-squared or correlation coefficient to get rid of unnecessary features.
If you use embedded methods where features are selected during model creation, it is possible that different models will find value in different feature sets. Lasso, Elastic Net, Ridge Regression are a few examples of these.
It's worth noting that some model types perform well with sparse data or missing values while others do not.

How to implement feature selection for categorical variables?

I'm having a problem selecting the important feature. The features for the dataset are categorical and numerical. The target variable is False or True. The features for the dataset are about 100, so I need to drop some of the features that are not related to the target variable. Which method can be used other than Random Forest feature importance? I'm using Python. In R I can use Boruta package to select the important features. but I do not know how to do this in Python.

Selecting relevant features can be done by calculating the P-value of the feature relating to the hypothesis, check https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf.

Ensemble feature selection from feature sets

I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?

I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.