SKLearn Decision Tree Classifier Depth/Order - python

While reviewing the decision tree documentation here, I noticed the classifier does not have a means to adjust the "order" of the fit. Specifically, regarding the call:
tree.DecisionTreeClassifier()
I would like to play around with high / low "orders" to see how the decision surface visual changes.
The call to the Regressor does seem to have this feature:
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
Does the DecisionTreeClassifier() call not have comparable arguments? I would presume in some instances it would be vital.

DecisionTreeClassifier has a max_depth argument, too. See the docs.
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(max_depth=2)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)

Related

plot_tree and graphviz don't display the same tree in random forest estimator

I was trying to compare the plot of Decision Tree using plot_tree and graphviz. The results were similar if I use either DecisionTreeClassifier or DecisionTreeRegressor. When I used RandomForestRegressor, I found that in order to plot tree I need to look at an individual tree using .estimators_[0] attribute. Below is the code I am trying to use
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('auto.csv')
X, y = pd.get_dummies(df.drop(columns='mpg')), df['mpg']
SEED = 1
# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
# Instantiate a random forests regressor 'rf' 400 estimators
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=SEED)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
Here is the difference.
from sklearn.tree import plot_tree
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
plot_tree(rf.estimators_[0],feature_names=X.columns, filled=True, rounded=True)
Using graphviz on rf.estimators_[0] yields different result.
import graphviz
from sklearn.tree import export_graphviz
export_graphviz(rf.estimators_[0], feature_names=X.columns,
filled=True, rounded=True)
display(graphviz.Source(tree_graph))
Could anyone please explain why they are different since I used the same rf.estimators_[0]? Which one should be the one I can use to explain the result correctly?
Thanks very much!

Right way to use RFECV and Permutation Importance - Sklearn

There is a proposal to implement this in Sklearn #15075, but in the meantime, eli5 is suggested as a solution. However, I'm not sure if I'm using it the right way. This is my code:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import eli5
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
perm = eli5.sklearn.PermutationImportance(estimator, scoring='r2', n_iter=10, random_state=42, cv=3)
selector = RFECV(perm, step=1, min_features_to_select=1, scoring='r2', cv=3)
selector = selector.fit(X, y)
selector.ranking_
#eli5.show_weights(perm) # fails: AttributeError: 'PermutationImportance' object has no attribute 'feature_importances_'
There are a few issues:
I am not sure if I am using cross-validation the right way. PermutationImportance is using cv to validate importance on the validation set, or cross-validation should be only with RFECV? (in the example, I used cv=3 in both cases, but not sure if that's the right thing to do)
If I uncomment the last line, I'll get a AttributeError: 'PermutationImportance' ... is this because I fit using RFECV? what I'm doing is similar to the last snippet here: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
as a less important issue, this gives me a warning when I set cv in eli5.sklearn.PermutationImportance :
.../lib/python3.8/site-packages/sklearn/utils/validation.py:68: FutureWarning: Pass classifier=False as keyword args. From version 0.25 passing these as positional arguments will result in an error warnings.warn("Pass {} as keyword args. From version 0.25 "
The whole process is a bit vague. Is there a way to do it directly in Sklearn? e.g. by adding a feature_importances attribute?
Since the objective is to select the optimal number of features with permutation importance and recursive feature elimination, I suggest using RFECV and PermutationImportance in conjunction with a CV splitter like KFold. The code could then look like this:
import warnings
from eli5 import show_weights
from eli5.sklearn import PermutationImportance
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.svm import SVR
warnings.filterwarnings("ignore", category=FutureWarning)
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
splitter = KFold(n_splits=3) # 3 folds as in the example
estimator = SVR(kernel="linear")
selector = RFECV(
PermutationImportance(estimator, scoring='r2', n_iter=10, random_state=42, cv=splitter),
cv=splitter,
scoring='r2',
step=1
)
selector = selector.fit(X, y)
selector.ranking_
show_weights(selector.estimator_)
Regarding your issues:
PermutationImportance will calculate the feature importance and RFECV the r2 scoring with the same strategy according to the splits provided by KFold.
You called show_weights on the unfitted PermutationImportance object. That is why you got an error. You should access the fitted object with the estimator_ attribute instead.
Can be ignored.

cross_val_score return accuracy per class

I would like the cross_val_score from sklearn function to return the accuracy per each of the classes instead of the average accuracy of all the classes.
Function:
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None,
scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None,
pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)
Reference
How can I do it?
This is not possible with cross_val_score. The approach you suggest would mean cross_val_score would have to return an array of arrays. However, if you look at the source code, you will see that the output of cross_val_score has to be :
Returns
-------
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.
As a result, cross_val_score checks if the scoring method you are using is multimetric or not. If it is, it will throw you an error like:
ValueError: scoring must return a number, got ... instead
Edit:
Like it is correctly pointed out by a comment above, an alternative for you is to use cross_validate instead. Here is how it would work on the Iris dataset for instance:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
scoring = {'recall0': make_scorer(recall_score, average = None, labels = [0]),
'recall1': make_scorer(recall_score, average = None, labels = [1]),
'recall2': make_scorer(recall_score, average = None, labels = [2])}
cross_validate(DecisionTreeClassifier(),X,y, scoring = scoring, cv = 5, return_train_score = False)
Note that this is also supported by the GridSearchCV methodology.
NB: You cannot return "accuracy by each class", I guess you meant recall, which is basically the proportions of correct predictions amongst data points that actually belong to a class.

Recursive feature elimination combined with nested (leave one group out) cross-validation in scikit

I want to do a binary classification for 30 groups of subjects having 230 samples by 150 features. I founded it very hard to implement especially when doing feature selection, parameters tunning through nested leave one group out cross-validation and report the accuracy using two classifiers the SVM and random forest and to see which features been selected.
I'm new to this and I'm sure the following code is not correct:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
X= the data (230 samples * 150 features)
y= [1,0,1,0,0,0,1,1,1..]
groups = [1,2...30]
param_grid = [{'estimator__C': [0.01, 0.1, 1.0, 10.0]}]
inner_cross_validation = LeaveOneGroupOut().split(X, y, groups)
outer_cross_validation = LeaveOneGroupOut().split(X, y, groups)
estimator = SVC(kernel="linear")
selector = RFE(estimator, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search.fit(X, y)
scores = cross_val_score(grid_search, X, y,cv=outer_cross_validation)
I don't know where to set "the random forest classifier" in the above because I want to compare the accuracies between SVM and random forest.
Thank you very much for reading and hope that someone can help me.
Best regards
You should call the tree in the same way that you call the SVM
#your libraries
from sklearn.tree import DecisionTreeClassifier
#....
estimator = SVC(kernel="linear")
estimator2 = DecisionTreeClassifier( ...parameters here...)
selector = RFE(estimator, step=1)
selector2 = RFE(estimator2, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search = GridSearchCV(selector2, ..greed for the tree here.., cv=inner_cross_validation)
Please note that this procedure will lead to two different set of selected features: one for the SVM and one for the Decision Tree

Scikit-learn using GridSearchCV on DecisionTreeClassifier

I tried to use GridSearchCV on DecisionTreeClassifier, but get the following error:
TypeError: unbound method get_params() must be called with DecisionTreeClassifier instance as first argument (got nothing instead)
here's my code:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import GridSearchCV
from sklearn.cross_validation import cross_val_score
X, Y = createDataSet(filename)
tree_para = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]}
clf = GridSearchCV(DecisionTreeClassifier, tree_para, cv=5)
clf.fit(X, Y)
In your call to GridSearchCV method, the first argument should be an instantiated object of the DecisionTreeClassifier instead of the name of the class. It should be
clf = GridSearchCV(DecisionTreeClassifier(), tree_para, cv=5)
Check out the example here for more details.
Hope that helps!
Another aspect regarding the parameters is that grid search can be run with different combination of parameters. The parameters mentioned below would check for different combinations of criterion with max_depth
tree_param = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]}
If needed, the grid search can be run over multiple set of parameter candidates:
For example:
tree_param = [{'criterion': ['entropy', 'gini'], 'max_depth': max_depth_range},
{'min_samples_leaf': min_samples_leaf_range}]
In this case, grid search would be run over two sets of parameters, first with every combination of criterion and max_depth and second, only for all provided values of min_samples_leaf
Here is the code for decision tree Grid Search
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
def dtree_grid_search(X,y,nfolds):
#create a dictionary of all values we want to test
param_grid = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15)}
# decision tree model
dtree_model=DecisionTreeClassifier()
#use gridsearch to test all values
dtree_gscv = GridSearchCV(dtree_model, param_grid, cv=nfolds)
#fit model to data
dtree_gscv.fit(X, y)
return dtree_gscv.best_params_
You need to add a () after the classifier:
clf = GridSearchCV(DecisionTreeClassifier(), tree_para, cv=5)
If the problem is still there try to replace :
from sklearn.grid_search import GridSearchCV
with
from sklearn.model_selection import GridSearchCV
It sounds stupid but I had similar problems and I managed to solve them using this tip.

Categories