So as the title suggests, my question is whether feature selection algorithms are independent of the regression/classification model chosen. Maybe some feature selection algorithms are independent and some are not? If so can you name a few of each kind? Thanks.
It depends on the algorithm you use to select features. Filter methods that are done prior to modeling are of course agnostic, as they're using statistical methods like chi-squared or correlation coefficient to get rid of unnecessary features.
If you use embedded methods where features are selected during model creation, it is possible that different models will find value in different feature sets. Lasso, Elastic Net, Ridge Regression are a few examples of these.
It's worth noting that some model types perform well with sparse data or missing values while others do not.
Related
I want to perform feature selection on the list of 52 features that I have.
But I do not have the class label column in my dataset.
So how do I select features without dependency on class label.
For example: Select K Best algorithm chooses features based on relationship between features and the class labels. As said in my case I do not have class labels column itself
Thanks in advance.
You are looking for feature selection for unsupervised learning. Some common methods: Removing low-variance features (an implementation here). Removing correlated features (can be implemented using corr() from pandas). Performing Principal Feature Analysis (PFA) (a python implementation here). More advanced methods can be found in Feature Selection for Clustering section of this book.
There is also PCA, which is not feature selection, but extraction, but useful if the ultimate goal is dimensionality reduction.
I am wondering if it is possbile to define feature importances/weights in Pyhton Classification methods? For example:
model = tree.DecisionTreeClassifier(feature_weight = ...)
I've seen in RandomForest there is an attribute feature_importance, which shows the importance of features based on analysis. But is it possible that I could define the feature importance for analysis in advance?
Thank you very much for your help in advance!
The feature importance determination in random forest classifiers uses a random forest-specific method (invert all binary tests over the feature, and get the additional classification error).
Feature importance is thus a concept that relates to the predictive ability of the model, not the training phase. Now, if you want to make it so that your model favours some feature over others, you will have to find some trick that depends on the model.
Regarding sklearn's DecisionTreeClassifier, such a trick does not appear to be trivial. You could custom your class weights, if you know some classes will be more easily predicted by some features that you want to favour; but this seems pretty dirty.
In other types of models, such as ones using kernels, you can do this more easily, by setting hyperparameters which directly relate to features.
If you are trying to limit an overfitting, I would also simply suggest that you remove the features you know to be less important.
I am using tensorflow's DNNRegressor to model a multivariate regression problem. I want to form an optimal feature-set from a mixed bag of categorical and continuous features. What would be the best way to proceed? The reason, I want this approach to be independent of the model is because I couldn't find much about feature selection/evaluation in direct context of tensorflow.
Tensorflow is mostly library for machine learning algorithms. So, you need to use other libraries for preprocessing.
Scikit-library is good in many cases. You should try it, it contains the feature selection methods. I'm not sure about the categorical features, but if not you always can convert it to numerical ones.
They suggest:
For regression: f_regression, mutual_info_regression
And for any problem, you can use their first method VarianceThreshold
I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!
If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.
I have a question about ensemble feature selection.
My data set is consist of 1000 samples with about 30000 features, and they are classified into label A or label B.
What I want to do is picking of some features which can classify the label efficiently.
I used three type of methods, univariate method(Pearson's coefficient), lasso regression and SVM-RFE(recursive feature elimination), so I got three feature sets from them. I used python scikit-learn for feature selection.
Then I am thinking of ensemble feature selection approach, because the size of features were so large. In this case, what is the way to make integrated subset with 3 feature sets?
What can I think is taking union of the sets and using lasso regression or SVM-RFE again, or just take the intersection of the sets.
Can anyone give an idea?
I guess what you do depends on how you want to use these features afterwards. If your goal is to "classify the label efficiently" one thing you can do is to use your classification algorithm (i.e. SVC, Lasso, etc.) as a wrapper and do Recursive Feature Elimination (RFE) with cross-validation.
You can start from the union of features from the previous three methods you used, or from scratch for the given type of model you want to fit, since the number of examples is small. In any case I believe the best way to select features in your case is to select the ones that optimize your goal, which seems to be classification accuracy, thus the CV proposal.