I am wondering if it is possbile to define feature importances/weights in Pyhton Classification methods? For example:
model = tree.DecisionTreeClassifier(feature_weight = ...)
I've seen in RandomForest there is an attribute feature_importance, which shows the importance of features based on analysis. But is it possible that I could define the feature importance for analysis in advance?
Thank you very much for your help in advance!
The feature importance determination in random forest classifiers uses a random forest-specific method (invert all binary tests over the feature, and get the additional classification error).
Feature importance is thus a concept that relates to the predictive ability of the model, not the training phase. Now, if you want to make it so that your model favours some feature over others, you will have to find some trick that depends on the model.
Regarding sklearn's DecisionTreeClassifier, such a trick does not appear to be trivial. You could custom your class weights, if you know some classes will be more easily predicted by some features that you want to favour; but this seems pretty dirty.
In other types of models, such as ones using kernels, you can do this more easily, by setting hyperparameters which directly relate to features.
If you are trying to limit an overfitting, I would also simply suggest that you remove the features you know to be less important.
Related
I have trained a model using several algorithms, including Random Forest from skicit-learn and LightGBM. and these model performs similarly in term of accuracy and other stats.
The issue is the inconsistent behavior between these two algorithms in terms of feature importance. I used default parameters and I know that they are using different method for calculating the feature importance but I suppose the highly correlated features should always have the most influence to the model's prediction. Random Forest makes more sense to me because the highly correlated features appear at top while it is not the case for LightGBM.
Is there a way to explain for this behavior and does this result with LightGBM is trustworthy to be presented?
Random Forest feature importance
LightGBM feature importance
Correlation with target
I have had a similar issue. The default feature importance for LGBM is based on 'split', and when I changed this to 'gain', the plots gave similar results.
Well, GBM is often shown to perform better especially when you comparing with random forest. Especially when comparing it with LightGBM. A properly-tuned LightGBM will most likely win in terms of performance and speed compared with random forest.
GBM advantages :
More developed. A lot of new features are developed for modern GBM model (xgboost, lightgbm, catboost) which affect its performance, speed, and scalability.
GBM disadvantages :
Number of parameters to tune
Tendency to overfit easily
If you aren't completely sure the hyperparameters are tuned correctly for the LightGBM, stick with the Random Forest; this will be easier to use and maintain.
So as the title suggests, my question is whether feature selection algorithms are independent of the regression/classification model chosen. Maybe some feature selection algorithms are independent and some are not? If so can you name a few of each kind? Thanks.
It depends on the algorithm you use to select features. Filter methods that are done prior to modeling are of course agnostic, as they're using statistical methods like chi-squared or correlation coefficient to get rid of unnecessary features.
If you use embedded methods where features are selected during model creation, it is possible that different models will find value in different feature sets. Lasso, Elastic Net, Ridge Regression are a few examples of these.
It's worth noting that some model types perform well with sparse data or missing values while others do not.
I am dealing with a multi-class problem (4 classes) and I am trying to solve it with scikit-learn in Python.
I saw that I have three options:
I simply instantiate a classifier, then I fit with train and evaluate with test;
classifier = sv.LinearSVC(random_state=123)
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsRest object, generating a new classifier that I use for train and test;
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsOne object, generating a new classifier that I use for train and test.
classifier = OneVsOneClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I understand the difference between OneVsRest and OneVsOne, but I cannot understand what I am doing in the first scenario where I do not explicitly pick up any of these two options. What does scikit-learn do in that case? Does it implicitly use OneVsRest?
Any clarification on the matter would be highly appreciated.
Best,
MR
Edit:
Just to make things clear, I am not specifically interested in the case of SVMs. For example, what about RandomForest?
Updated answer: As clarified in the comments and edits, the question is more about the general setting of sklearn, and less about the specific case of LinearSVC which is explained below.
The main difference here is that some of the classifiers you can use have "built-in multiclass classification support", i.e. it is possible for that algorithm to discern between more than two classes by default. One example for this would for example be a Random Forest, or a Multi-Layer Perceptron (MLP) with multiple output nodes.
In these cases, having a OneVs object is not required at all, since you are already solving your task. In fact, using such a strategie might even decreaes your performance, since you are "hiding" potential correlations from the algorithm, by letting it only decide between single binary instances.
On the other hand, algorithms like SVC or LinearSVC only support binary classification. So, to extend these classes of (well-performing) algorithms, we instead have to rely on the reduction to a binary classification task, from our initial multiclass classification task.
As far as I am aware of, the most complete overview can be found here:
If you scroll down a little bit, you can see which one of the algorithms is inherently multiclass, or uses either one of the strategies by default.
Note that all of the listed algorithms under OVO actually now employ a OVR strategy by default! This seems to be slightly outdated information in that regard.
Initial answer:
This is a question that can easily be answered by looking at the relevant scikit-learn documentation.
Generally, the expectation on Stackoverflow is that you have at least done some form of research on your own, so please consider looking into existing documentation first.
multi_class : string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)
Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while
"crammer_singer" optimizes a joint objective over all classes. While
crammer_singer is interesting from a theoretical perspective as it is
consistent, it is seldom used in practice as it rarely leads to better
accuracy and is more expensive to compute. If "crammer_singer" is
chosen, the options loss, penalty and dual will be ignored.
So, clearly, it uses one-vs-rest.
The same holds by the way for the "regular" SVC.
I am using tensorflow's DNNRegressor to model a multivariate regression problem. I want to form an optimal feature-set from a mixed bag of categorical and continuous features. What would be the best way to proceed? The reason, I want this approach to be independent of the model is because I couldn't find much about feature selection/evaluation in direct context of tensorflow.
Tensorflow is mostly library for machine learning algorithms. So, you need to use other libraries for preprocessing.
Scikit-library is good in many cases. You should try it, it contains the feature selection methods. I'm not sure about the categorical features, but if not you always can convert it to numerical ones.
They suggest:
For regression: f_regression, mutual_info_regression
And for any problem, you can use their first method VarianceThreshold
I am working on a classification task related to written text and I wonder how important it is to perform some kind of "feature selection" procedure in order to improve the classification results.
I am using a number of features (around 40) related to the subject, but I am not sure if all the features are really relevant or not and in which combinations. I am experementing with SVM (scikits) and LDAC (mlpy).
If a have a mix of relevant and irrelevant features, I assume I will get poor classification results. Should I perform a "feature selection procedure" before classification?
Scikits has an RFE procedure that is tree-based that is able to rank the features. Is it meaningful to rank the features with a tree-based RFE to choose the most important features and to perform the actual classification with SVM (non linear) or LDAC? Or should I implement some kind of wrapper method using the same classifier to rank the features (trying to classify with different groups of features would be very time consuming)?
Just try an see if it improves the classification score as measured with cross validation. Also before trying RFE, I would try less CPU intensive schemes such as univariate chi2 feature selection.
Having 40 features is not too bad. Some machine-learning is impeded by irrelevant features, but many things are quite robust to them (eg naive Bayes, SVM, decision trees). You probably don't need to do feature selection unless you decide to add many more features in.
It's not a bad idea to throw away useless features, but don't waste your own mental time on trying that out unless you have a particular motivation to.