AUC-base Features Importance using Random Forest - python

I'm trying to predict a binary variable with both random forests and logistic regression. I've got heavily unbalanced classes (approx 1.5% of Y=1).
The default feature importance techniques in random forests are based on classification accuracy (error rate) - which has been shown to be a bad measure for unbalanced classes (see here and here).
The two standard VIMs for feature selection with RF are the Gini VIM and the permutation VIM. Roughly speaking the Gini VIM of a predictor of interest is the sum over the forest of the decreases of Gini impurity generated by this predictor whenever it was selected for splitting, scaled by the number of trees.
My question is : is that kind of method implemented in scikit-learn (like it is in the R package party) ? Or maybe a workaround ?
PS : This question is kind of linked with an other.

scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.
scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.

After doing some researchs, this is what I came out with :
from sklearn.cross_validation import ShuffleSplit
from collections import defaultdict
names = db_train.iloc[:,1:].columns.tolist()
# -- Gridsearched parameters
model_rf = RandomForestClassifier(n_estimators=500,
scores = defaultdict(list)
# -- Fit the model (could be cross-validated)
rf =, Y_train)
acc = roc_auc_score(Y_test, rf.predict(X_test))
for i in range(X_train.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
print("Features sorted by their score:")
print(sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True))
Features sorted by their score:
[(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]
The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.
I took my inspiration from this (great) notebook.
All suggestions/comments are most welcome !


How does cross-validated recursive feature elimination drop features in each iteration (sklearn RFECV)?

I am using sklearn.feature_selection.RFECV to reduce the number of features in my final model. With non-cross-validated RFE, you can choose exactly how many features to select. However, with RFECV, you can only specify min_number_features_to_select, which acts more like a lower bound.
So how does RFECV drop features in each iteration? I understand normal RFE, but how does cross validation come into play?
Here are my instances:
clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.03, n_estimators=500,
subsample=1.0, criterion='friedman_mse', min_samples_leaf=100,
max_depth=7, max_features='sqrt', random_state=123)
rfe = RFECV(estimator=clf, step=1, min_features_to_select=35, cv=5, scoring='roc_auc',
verbose=1, n_jobs=-1), y_train)
I could not find anything more specific in the documentation or user guide.
Your guess (edited out now) thinks of an algorithm that cross-validates the elimination step itself, but that is not how RFECV works. (Indeed, such an algorithm might stabilize RFE itself, but it wouldn't inform about the optimal number of features, and that is the goal of RFECV.)
Instead, RFECV runs separate RFEs on each of the training folds, down to min_features_to_select. These are very likely to result in different orders of elimination and final features, but none of that is taken into consideration: only the scores of the resulting models, for each number of features, on the test fold is retained. (Note that RFECV has a scorer parameter that RFE lacks.) Those scores are then averaged, and the best score corresponds to the chosen n_features_. Finally, a last RFE is run on the entire dataset with that target number of features.
source code

Ridge regression model using cross validation technique and Grid-search technique

I created python code for ridge regression.For that I used cross validation and grid-search technique in together. i got output result. I want check whether my regression model building steps correct or not? can some one explain it?
from sklearn.linear_model import Ridge
ridge_reg = Ridge()
from sklearn.model_selection import GridSearchCV
params_Ridge = {'alpha': [1,0.1,0.01,0.001,0.0001,0] , "fit_intercept": [True, False], "solver": ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
Ridge_GS = GridSearchCV(ridge_reg, param_grid=params_Ridge, n_jobs=-1),y_train)
output - {'alpha': 1, 'fit_intercept': True, 'solver': 'cholesky'}
Ridgeregression = Ridge(random_state=3, **Ridge_GS.best_params_)
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=Ridgeregression, X=x_train, y=y_train, cv=5)
output - array([0.93335508, 0.8984485 , 0.91529146, 0.89309012, 0.90829416])
output - 0.909695864130532,y_train)
output - 0.9113458623386644
Is 0.9113458623386644 my ridge regression accuracy(R squred) ?
if it is, then what is meaning of 0.909695864130532 value.
Yes the score method from Ridge regression returns your R-squared value (docs).
In case you are not aware how the CV method works it splits your data into 5 equal chunks. Then for each combination of parameters it fits the model five times using each chunk once as evaluation set, while using the remainder of the data as the training set. The best parameter set is chosen to be the set which gives the highest average score.
Your main question seems to be why the average of your CV score is less than the score from the full training evaluated on the test set. This is not necessarily surprising, since the full training set will be larger than any of CV samples which are used for the all_accuracies values. More training data will generally get you a more accurate model.
The test set score (i.e. your second 'score', 0.91...) is most likely to represent how your model will generalize to unseen data. This is what you should quote as the 'score' of your model. The performance on CV set is biased, since this is the data on which you based your parameter choices.
In general your method looks correct. The step where you refit ridge regression using cross_val_score seems necessary. Once you have found your best parameters from GridSearchCV I would go straight to fitting on the full training dataset (as you do at the end).

Why is Random Forest with a single tree much better than a Decision Tree classifier?

I apply the
decision tree classifier and the random forest classifier to my data with the following code:
def decision_tree(train_X, train_Y, test_X, test_Y):
clf = tree.DecisionTreeClassifier(), train_Y)
return clf.score(test_X, test_Y)
def random_forest(train_X, train_Y, test_X, test_Y):
clf = RandomForestClassifier(n_estimators=1)
clf =, Y)
return clf.score(test_X, test_Y)
Why the result are so much better for the random forest classifier (for 100 runs, with randomly sampling 2/3 of data for the training and 1/3 for the test)?
100%|███████████████████████████████████████| 100/100 [00:01<00:00, 73.59it/s]
Algorithm: Decision Tree
Min : 0.3883495145631068
Max : 0.6476190476190476
Mean : 0.4861783113770316
Median : 0.48868030937802126
Stdev : 0.047158171852401135
Variance: 0.0022238931724605985
100%|███████████████████████████████████████| 100/100 [00:01<00:00, 85.38it/s]
Algorithm: Random Forest
Min : 0.6846846846846847
Max : 0.8653846153846154
Mean : 0.7894823428836184
Median : 0.7906101571063208
Stdev : 0.03231671150915106
Variance: 0.0010443698427656967
The random forest estimators with one estimator isn't just a decision tree?
Have i done something wrong or misunderstood the concept?
The random forest estimators with one estimator isn't just a decision tree?
Well, this is a good question, and the answer turns out to be no; the Random Forest algorithm is more than a simple bag of individually-grown decision trees.
Apart from the randomness induced from ensembling many trees, the Random Forest (RF) algorithm also incorporates randomness when building individual trees in two distinct ways, none of which is present in the simple Decision Tree (DT) algorithm.
The first is the number of features to consider when looking for the best split at each tree node: while DT considers all the features, RF considers a random subset of them, of size equal to the parameter max_features (see the docs).
The second is that, while DT considers the whole training set, a single RF tree considers only a bootstrapped sub-sample of it; from the docs again:
The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
The RF algorihm is essentially the combination of two independent ideas: bagging, and random selection of features (see the Wikipedia entry for a nice overview). Bagging is essentially my second point above, but applied to an ensemble; random selection of features is my first point above, and it seems that it had been independently proposed by Tin Kam Ho before Breiman's RF (again, see the Wikipedia entry). Ho had already suggested that random feature selection alone improves performance. This is not exactly what you have done here (you still use the bootstrap sampling idea from bagging, too), but you could easily replicate Ho's idea by setting bootstrap=False in your RandomForestClassifier() arguments. The fact is that, given this research, the difference in performance is not unexpected...
To replicate exactly the behaviour of a single tree in RandomForestClassifier(), you should use both bootstrap=False and max_features=None arguments, i.e.
clf = RandomForestClassifier(n_estimators=1, max_features=None, bootstrap=False)
in which case neither bootstrap sampling nor random feature selection will take place, and the performance should be roughly equal to that of a single decision tree.

RFECV does not return same features for same data

I have a dataframe X which is comprised of 60 features and ~ 450k outcomes. My response variable y is categorical (survival, no survival).
I would like to use RFECV to reduce the number of significant features for my estimator (right now, logistic regression) on Xtrain, which I would like to score of accuracy under an ROC Curve. "Features Selected" is a list of all features.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm
# Create train and test datasets to evaluate each model
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,train_size = 0.70)
# Use RFECV to reduce features
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(ytrain, 10), scoring='roc_auc')
# Fit the features to the response variable
X_new = rfecv.fit_transform(Xtrain[features_selected], ytrain)
I have a few questions:
a) X_new returns different features when run on separate occasions (one time it returned 5 features, another run it returned 9. One is not a subset of the other). Why would this be?
b) Does this imply an unstable solution? While using the same seed for StratifiedKFold should solve this problem, does this mean I need to reconsider the approach in totality?
c) IN general, how do I approach tuning? e.g., features are selected BEFORE tuning in my current implementation. Would tuning affect the significance of certain features? Or should I tune simultaneously?
In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub-samples. Therefore, it's not surprising to get different results every time you execute the algorithm. Source
There is an approach, so-called Pearson's correlation coefficient. By using this method, you can calculate the a correlation coefficient between each two features, and aim for removing features with a high correlation. This method could be considered as a stable solution to such a problem. Source

How to ensemble SVM and Logistic Regression with python

I am doing a task of text classification(7000 texts evenly distributed by 10 labels). And by exploring SVM and and Logistic Regression
clf1 = svm.LinearSVC(), y)
score1 = clf1.score(X_test,y_true)
clf2 = linear_model.LogisticRegression(), y)
score2 = clf2.score(X_test,y_true)
I got two accuracies, score1 and score2 I guess whether I could improve my accuracy by developing an ensemble system combining the outputs of the two classifiers above.
I have learnt knowledge on ensemble by myself and I know there are bagging,boosting,and stacking.
However, I do not know how to use the scores predicted from my SVM and Logistic Regression in ensemble. Could anyone give me some ideas or show me some example code?
You can just multiply the probabilities, or use another combination rule.
In order to do that in a more generic way (try several rules)
you can use brew.
from brew.base import Ensemble
from brew.base import EnsembleClassifier
from brew.combination.combiner import Combiner
# create your Ensemble
clfs = [clf1, clf2]
ens = Ensemble(classifiers=clfs)
# Since you have only 2 classifiers 'majority_vote' is note an option,
# rule = ['mean', 'majority_vote', 'max', 'min', 'median']
comb = Combiner(rule='mean')
# now create your ensemble classifier
ensemble_clf = EnsembleClassifier(ensemble=ens, combiner=comb)
Also, keep in mind that the classifiers should be diverse enough to give a good combination result.
If you had fewer features, I'd say you should check out some Dynamic Classifier/Ensemble Selection (also provided in brew) but since you probably have many features, euclidean distance probably do not make sense to get the region of competence of each classifier. Best thing is to check out by hand which kind of labels each classifiers tends to get right based on the confusion matrix.
