Determine what features to drop / select using GridSearch in scikit-learn - python

How does one determine what features/columns/attributes to drop using GridSearch results?
In other words, if GridSearch returns that max_features should be 3, can we determine which EXACT 3 features should one use?
Let's take the classic Iris data set with 4 features.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
iris = datasets.load_iris()
all_inputs = iris.data
all_labels = iris.target
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(n_splits=10)
grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
Let's say we get that max_features is 3. How do I find out which 3 features were the most appropriate here?
Putting in max_features = 3 will work for fitting, but I want to know which attributes were the right ones.
Do I have to generate the possible list of all feature combinations myself to feed GridSearch or is there an easier way ?

max_features is one hyperparameter of your decision tree.
it does not drop any of your features before training nor does it find good or bad features.
Your decisiontree looks at all features to find the best feature to split up your data based on your labels. If you set maxfeatures to 3 as in your example, your decision tree just looks at three random features and takes the best features of those to make the split. This makes your training faster and adds some randomness to your classifier (might also help against overfitting).
Your classifier determines which is the feature by a criterion (like gini index or information gain(1-entropy)). So you can either take such a measurement for feature importance or
use an estimator that has the attribute feature_importances_
as #gorjan mentioned.

If you use an estimator that has the attribute feature_importances_ you can simply do:
feature_importances = grid_search.best_estimator_.feature_importances_
This will return a list (n_features) of how important each feature was for the best estimator found with grid search. Additionally, if you want to use let's say a linear classifier (logistic regression), that doesn't have the attribute feature_importances_ what you could do is:
# Get the best estimator's coefficients
estimator_coeff = grid_search.best_estimator_.coef_
# Multiply the model coefficients by the standard deviation of the data
coeff_magnitude = np.std(all_inputs, 0) * estimator_coeff)
which is also an indication of the feature importance. If a model's coefficient is >> 0 or << 0, that means, in layman's terms, that the model is trying hard to capture the signal present in that feature.

Related

Can I show feature importance for MultiOutputClassifier?

I'm trying to recover the feature importance of a multioutput Classifier using a RandomForest.
The MultiOutput model does not show any problems:
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
## Generate data
x, y = make_multilabel_classification(n_samples=1000,
n_features=15,
n_labels = 5,
n_classes=3,
random_state=12,
allow_unlabeled = True)
x_train = x[:700,:]
x_test = x[701:,:]
y_train = y[:700,:]
y_test = y[701:,:]
## Generate model
forest = RandomForestClassifier(n_estimators = 100, random_state = 1)
multi_forest = MultiOutputClassifier(forest, n_jobs = -1).fit(x_train, y_train)
## Make prediction
dfOutput_multi_forest = multi_forest.predict_proba(x_test)
The prediction dfOutput_multi_forest does not show any problems, but I want to recover the feature importance of the multi_forest for interpretation of the output.
Using multi_forest.feature_importance_ throws the following error message:
AttributeError: 'MultiOutputClassifier' object has no attribute 'feature_importance_'
Does anyone know how to retrieve the feature importance?
I'm using scikit v0.20.2
Indeed, it doesn't appear that Sklearn's MultiOutputClassifier has an attribute that contains some sort of amalgamation of the feature importances of all the estimators (in your case, all the RandomForest classifiers) used in the model.
However, it is possible to access the feature importances of each RandomForest classifier, and then average them all together to give you each feature's average importance, across all RandomForest classifiers.
MultiOutputClassifier objects have an attribute called estimators_. If you run multi_forest.estimators_, you will get a list containing an object for each of your RandomForest classifiers.
For each of these RandomForest classifier objects, you can access its feature importances through the feature_importances_ attribute.
To put it all together, this was my approach:
feat_impts = []
for clf in multi_forest.estimators_:
feat_impts.append(clf.feature_importances_)
np.mean(feat_impts, axis=0)
I ran the example code you pasted into your question, and then ran the above block of code to output a list of the following 15 averages:
array([0.09830467, 0.0912088 , 0.05738045, 0.1211305 , 0.03901933,
0.05429491, 0.06929378, 0.06404416, 0.05676634, 0.04919717,
0.05244265, 0.0509295 , 0.05615341, 0.09202444, 0.04780991])
Which contains the average importance of each of your 15 features, across each of the 3 random forest classifiers used in your MultiOutputClassifier.
This should at least help you to see which features, on the whole, tended to be more important in making predictions for each of your 3 classes.

SKLearn how to get decision probabilities for LinearSVC classifier

I am using scikit-learn's linearSVC classifier for text mining. I have the y value as a label 0/1 and the X value as the TfidfVectorizer of the text document.
I use a pipeline like below
pipeline = Pipeline([
('count_vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
('classifier', LinearSVC())
])
For a prediction, I would like to get the confidence score or probability of a data point being classified as
1 in the range (0,1)
I currently use the decision function feature
pipeline.decision_function(test_X)
However it returns positive and negative values that seem to indicate confidence. I am not too sure about what they mean either.
However, is there a way to get the values in range 0-1?
For example here is the output of the decision function for some of the data points
-0.40671879072078421,
-0.40671879072078421,
-0.64549376401063352,
-0.40610652684648957,
-0.40610652684648957,
-0.64549376401063352,
-0.64549376401063352,
-0.5468745098794594,
-0.33976011539714374,
0.36781572474117097,
-0.094943829974515004,
0.37728641897721765,
0.2856211778200019,
0.11775493140003235,
0.19387473663623439,
-0.062620918785563556,
-0.17080866610522819,
0.61791016307670399,
0.33631340372946961,
0.87081276844501176,
1.026991628346146,
0.092097790098391641,
-0.3266704728249083,
0.050368652422013376,
-0.046834129250376291,
You can't.
However you can use sklearn.svm.SVC with kernel='linear' and probability=True
It may run longer, but you can get probabilities from this classifier by using predict_proba method.
clf=sklearn.svm.SVC(kernel='linear',probability=True)
clf.fit(X,y)
clf.predict_proba(X_test)
If you insist on using the LinearSVC class, you can wrap it in a sklearn.calibration.CalibratedClassifierCV object and fit the calibrated classifier which will give you a probabilistic classifier.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Here is the output:
[[ 9.98626760e-01 1.27594869e-03 9.72912751e-05]
[ 9.99578199e-01 1.79053170e-05 4.03895759e-04]]
which shows probabilities for each class for each data point.

RFECV does not return same features for same data

I have a dataframe X which is comprised of 60 features and ~ 450k outcomes. My response variable y is categorical (survival, no survival).
I would like to use RFECV to reduce the number of significant features for my estimator (right now, logistic regression) on Xtrain, which I would like to score of accuracy under an ROC Curve. "Features Selected" is a list of all features.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm
# Create train and test datasets to evaluate each model
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,train_size = 0.70)
# Use RFECV to reduce features
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(ytrain, 10), scoring='roc_auc')
# Fit the features to the response variable
X_new = rfecv.fit_transform(Xtrain[features_selected], ytrain)
I have a few questions:
a) X_new returns different features when run on separate occasions (one time it returned 5 features, another run it returned 9. One is not a subset of the other). Why would this be?
b) Does this imply an unstable solution? While using the same seed for StratifiedKFold should solve this problem, does this mean I need to reconsider the approach in totality?
c) IN general, how do I approach tuning? e.g., features are selected BEFORE tuning in my current implementation. Would tuning affect the significance of certain features? Or should I tune simultaneously?
In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub-samples. Therefore, it's not surprising to get different results every time you execute the algorithm. Source
There is an approach, so-called Pearson's correlation coefficient. By using this method, you can calculate the a correlation coefficient between each two features, and aim for removing features with a high correlation. This method could be considered as a stable solution to such a problem. Source

How to specify the prior probability for scikit-learn's Naive Bayes

I'm using the scikit-learn machine learning library (Python) for a machine learning project. One of the algorithms I'm using is the Gaussian Naive Bayes implementation. One of the attributes of the GaussianNB() function is the following:
class_prior_ : array, shape (n_classes,)
I want to alter the class prior manually since the data I use is very skewed and the recall of one of the classes is very important. By assigning a high prior probability to that class the recall should increase.
However, I can't figure out how to set the attribute correctly. I've read the below topics already but their answers don't work for me.
How can the prior probabilities manually set for the Naive Bayes clf in scikit-learn?
How do I know what prior's I'm giving to sci-kit learn? (Naive-bayes classifiers.)
This is my code:
gnb = GaussianNB()
gnb.class_prior_ = [0.1, 0.9]
gnb.fit(data.XTrain, yTrain)
yPredicted = gnb.predict(data.XTest)
I figured this was the correct syntax and I could find out which class belongs to which place in the array by playing with the values but the results remain unchanged. Also no errors were given.
What is the correct way of setting the attributes of the GaussianNB algorithm from scikit-learn library?
Link to the scikit documentation of GaussianNB
#Jianxun Li: there is in fact a way to set prior probabilities in GaussianNB. It's called 'priors' and its available as a parameter. See documentation:
"Parameters: priors : array-like, shape (n_classes,)
Prior probabilities of the classes. If specified the priors are not adjusted according to the data."
So let me give you an example:
from sklearn.naive_bayes import GaussianNB
# minimal dataset
X = [[1, 0], [1, 0], [0, 1]]
y = [0, 0, 1]
# use empirical prior, learned from y
mn = GaussianNB()
print mn.fit(X,y).predict([1,1])
print mn.class_prior_
>>>[0]
>>>[ 0.66666667 0.33333333]
But if you changed the prior probabilities, it will give a different answer which is what you are looking for I believe.
# use custom prior to make 1 more likely
mn = GaussianNB(priors=[0.1, 0.9])
mn.fit(X,y).predict([1,1])
>>>>array([1])
The GaussianNB() implemented in scikit-learn does not allow you to set class prior. If you read the online documentation, you see .class_prior_ is an attribute rather than parameters. Once you fit the GaussianNB(), you can get access to class_prior_ attribute. It is calculated by simply counting the number of different labels in your training sample.
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
# simulate data with unbalanced weights
X, y = make_classification(n_samples=1000, weights=[0.1, 0.9])
# your GNB estimator
gnb = GaussianNB()
gnb.fit(X, y)
gnb.class_prior_
Out[168]: array([ 0.105, 0.895])
gnb.get_params()
Out[169]: {}
You see the estimator is smart enough to take into account the unbalanced weight issue. So you don't have to manually specify the priors.

How can the prior probabilities manually set for the Naive Bayes clf in scikit-learn? [duplicate]

I'm using the scikit-learn machine learning library (Python) for a machine learning project. One of the algorithms I'm using is the Gaussian Naive Bayes implementation. One of the attributes of the GaussianNB() function is the following:
class_prior_ : array, shape (n_classes,)
I want to alter the class prior manually since the data I use is very skewed and the recall of one of the classes is very important. By assigning a high prior probability to that class the recall should increase.
However, I can't figure out how to set the attribute correctly. I've read the below topics already but their answers don't work for me.
How can the prior probabilities manually set for the Naive Bayes clf in scikit-learn?
How do I know what prior's I'm giving to sci-kit learn? (Naive-bayes classifiers.)
This is my code:
gnb = GaussianNB()
gnb.class_prior_ = [0.1, 0.9]
gnb.fit(data.XTrain, yTrain)
yPredicted = gnb.predict(data.XTest)
I figured this was the correct syntax and I could find out which class belongs to which place in the array by playing with the values but the results remain unchanged. Also no errors were given.
What is the correct way of setting the attributes of the GaussianNB algorithm from scikit-learn library?
Link to the scikit documentation of GaussianNB
#Jianxun Li: there is in fact a way to set prior probabilities in GaussianNB. It's called 'priors' and its available as a parameter. See documentation:
"Parameters: priors : array-like, shape (n_classes,)
Prior probabilities of the classes. If specified the priors are not adjusted according to the data."
So let me give you an example:
from sklearn.naive_bayes import GaussianNB
# minimal dataset
X = [[1, 0], [1, 0], [0, 1]]
y = [0, 0, 1]
# use empirical prior, learned from y
mn = GaussianNB()
print mn.fit(X,y).predict([1,1])
print mn.class_prior_
>>>[0]
>>>[ 0.66666667 0.33333333]
But if you changed the prior probabilities, it will give a different answer which is what you are looking for I believe.
# use custom prior to make 1 more likely
mn = GaussianNB(priors=[0.1, 0.9])
mn.fit(X,y).predict([1,1])
>>>>array([1])
The GaussianNB() implemented in scikit-learn does not allow you to set class prior. If you read the online documentation, you see .class_prior_ is an attribute rather than parameters. Once you fit the GaussianNB(), you can get access to class_prior_ attribute. It is calculated by simply counting the number of different labels in your training sample.
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
# simulate data with unbalanced weights
X, y = make_classification(n_samples=1000, weights=[0.1, 0.9])
# your GNB estimator
gnb = GaussianNB()
gnb.fit(X, y)
gnb.class_prior_
Out[168]: array([ 0.105, 0.895])
gnb.get_params()
Out[169]: {}
You see the estimator is smart enough to take into account the unbalanced weight issue. So you don't have to manually specify the priors.

Categories