Print decision tree and feature_importance when using BaggingClassifier - python

Obtaining the decision tree and the important features can be easy when using DecisionTreeClassifier in scikit learn. However I am not able to obtain none of them if I and bagging function, e.g., BaggingClassifier.
Since we need to fit the model using the BaggingClassifier, I can not return the results (print the trees (graphs), feature_importances_, ...) related to the DecisionTreeClassifier.
Hier is my script:
seed = 7
n_iterations = 199
DTC = DecisionTreeClassifier(random_state=seed,
max_depth=None,
min_impurity_split= 0.2,
min_samples_leaf=6,
max_features=None, #If None, then max_features=n_features.
max_leaf_nodes=20,
criterion='gini',
splitter='best',
)
#parametersDTC = {'max_depth':range(3,10), 'max_leaf_nodes':range(10, 30)}
parameters = {'max_features':range(1,200)}
dt = RandomizedSearchCV(BaggingClassifier(base_estimator=DTC,
#max_samples=1,
n_estimators=100,
#max_features=1,
bootstrap = False,
bootstrap_features = True, random_state=seed),
parameters, n_iter=n_iterations, n_jobs=14, cv=kfold,
error_score='raise', random_state=seed, refit=True) #min_samples_leaf=10
# Fit the model
fit_dt= dt.fit(X_train, Y_train)
print(dir(fit_dt))
tree_model = dt.best_estimator_
# Print the important features (NOT WORKING)
features = tree_model.feature_importances_
print(features)
rank = np.argsort(features)[::-1]
print(rank[:12])
print(sorted(list(zip(features))))
# Importing the image (NOT WORKING)
from sklearn.externals.six import StringIO
tree.export_graphviz(dt.best_estimator_, out_file='tree.dot') # necessary to plot the graph
dot_data = StringIO() # need to understand but it probably relates to read of strings
tree.export_graphviz(dt.best_estimator_, out_file=dot_data, filled=True, class_names= target_names, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
img = Image(graph.create_png())
print(dir(img)) # with dir we can check what are the possibilities in graph.create_png
with open("my_tree.png", "wb") as png:
png.write(img.data)
I obtain erros like: 'BaggingClassifier' object has no attribute 'tree_' and 'BaggingClassifier' object has no attribute 'feature_importances'. Does anyone know how can I obtain them? thanks.

Based on the documentation, BaggingClassifier object indeed doesn't have the attribute 'feature_importances'. You could still compute it yourself as described in the answer to this question: Feature importances - Bagging, scikit-learn
You can access the trees that were produced during the fitting of BaggingClassifier using the attribute estimators_, as in the following example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
iris = datasets.load_iris()
clf = BaggingClassifier(n_estimators=3)
clf.fit(iris.data, iris.target)
clf.estimators_
clf.estimators_ is a list of the 3 fitted decision trees:
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1422640898, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1968165419, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=2103976874, splitter='best')]
So you can iterate over the list and access each one of the trees.

Related

Display all parameters of an estimator including defaults

I am using Watson Studio and using a markdown notebook. In the notebook, I write the code:
from sklearn.tree import DecisionTreeClassifier
Tree_Loan= DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_Loan
and it displays
DecisionTreeClassifier(criterion='entropy', max_depth=4)
However, it should display something in the form of (this is from a different lab I've done using Skills Network Labs):
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
The best I can tell is that it is not importing the decision tree classifier. I have the same problem with svm from sklearn. Other functions in scikit-learn like train test split and k nearest neighbors work fine. A classmate says the rest of my code is correct and there is no reason for the error. What might be causing it?
It is importing the DecisionTreeClassifier, no problem there. But by default, sklearn prints only the parameters that were given to estimator with non-default values, from this function.
But if you want to see the "full" output, you can set the configuration of print_changed_only to False via sklearn._config.set_config like so:
>>> from sklearn.tree import DecisionTreeClassifier
>>> tree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
>>> # only displays the changed parameters
>>> tree
DecisionTreeClassifier(criterion='entropy', max_depth=4)
>>> from sklearn._config import get_config, set_config
>>> # default setting
>>> get_config()["print_changed_only"]
True
>>> # now changing it
>>> set_config(print_changed_only=False)
# now we get the default values, too
>>> tree
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
max_depth=4, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')

TypeError: Expected sequence or array-like, got estimator RandomForestRegressor

I am building a Random Forest Regression model using scikit-learn model. When I am trying to calculate the RMSE error for this model, I am getting an error. Can anybody tell me how to solve the error?
The code snippet is shown below:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, forest_reg)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
The error is shown below:
Expected sequence or array-like, got estimator RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False)

Getting error while running in jupyter notebook

ERROR
Invalid parameter C for estimator
DecisionTreeClassifier(class_weight=None, criterion='gini',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with estimator.get_params().keys().
CODE
def train(X_train,y_train,X_test):
# Scaling features
X_train=preprocessing.scale(X_train)
X_test=preprocessing.scale(X_test)
Cs = 10.0 ** np.arange(-2,3,.5)
gammas = 10.0 ** np.arange(-2,3,.5)
param = [{'gamma': gammas, 'C': Cs}]
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X_train, y_train)
cvk = skf
classifier = DecisionTreeClassifier()
clf = GridSearchCV(classifier,param_grid=param,cv=cvk)
clf.fit(X_train,y_train)
print("The best classifier is: ",clf.best_estimator_)
clf.best_estimator_.fit(X_train,y_train)
# Estimate score
scores = model_selection.cross_val_score(clf.best_estimator_, X_train,y_train, cv=5)
print (scores)
print('Estimated score: %0.5f (+/- %0.5f)' % (scores.mean(), scores.std() / 2))
title = 'Learning Curves (SVM, rbf kernel, $\gamma=%.6f$)' %clf.best_estimator_.gamma
plot_learning_curve(clf.best_estimator_, title, X_train, y_train, cv=5)
plt.show()
# Predict class
y_pred = clf.best_estimator_.predict(X_test)
return y_test,y_pred
It looks like you are making the param an array with a single dictionary inside. param needs to be just a dictionary:
EDIT:
Looking into this further, as mentioned by #DzDev, passing an array containing a single dictionary is also a valid way to pass in parameters.
It appears that your issue is that you are mixing the concepts of two different types of estimators. You are passing in the parameters for svm.SVC but are sending in a DecisionTreeClassifier estimator. So it turns out the error is exactly as it says, 'C' is not a valid parameter. You should update to either using a svm.SVC estimator or updates your parameters to be correct for the DecisionTreeClassifier.

Failing to tune decision tree classifier parameters using gridsearch

I am trying to tune parameters using GridSearchCV but keep encountering this error message
ValueError: Invalid parameter decisiontreeclassifier for estimator DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with `estimator.get_params().keys()`.
This is the code I have written
accuracy_score = make_scorer(accuracy_score,greater_is_better = True)
dtc = DecisionTreeClassifier()
depth = np.arange(1,30)
leaves = [1,2,4,5,10,20,30,40,80,100]
param_grid =[{'decisiontreeclassifier__max_depth':depth,
'decisiontreeclassifier__min_samples_leaf':leaves}]
grid_search = GridSearchCV(estimator = dtc,param_grid = param_grid,
scoring=accuracy_score,cv=10)
grid_search = grid_search.fit(X_train,y_train)
Use max_depth instead of decisiontreeclassifier__max_depth in your param_grid. (The same thing applies to the other parameter.) The notation that you're using is for pipelines with multiple estimators chained together.
accuracy_score = make_scorer(accuracy_score,greater_is_better = True)
dtc = DecisionTreeClassifier()
depth = np.arange(1,30)
leaves = [1,2,4,5,10,20,30,40,80,100]
param_grid =[{'max_depth':depth,
'min_samples_leaf':leaves}]
grid_search = GridSearchCV(estimator = dtc,param_grid = param_grid,
scoring=accuracy_score,cv=10)
grid_search = grid_search.fit(X_train,y_train)

Export machine learning model

I am creating a machine learning algorithm and want to export it.
Suppose i am using scikit learn library and Random Forest algorithm.
modelC=RandomForestClassifier(n_estimators=30)
m=modelC.fit(trainvec,yvec)
modelC.model
How can i export it or is there a any function for it ?
If you follow scikit documentation on model persistence
In [1]: from sklearn.ensemble import RandomForestClassifier
In [2]: from sklearn import datasets
In [3]: from sklearn.externals import joblib
In [4]: iris = datasets.load_iris()
In [5]: X, y = iris.data, iris.target
In [6]: m = RandomForestClassifier(2).fit(X, y)
In [7]: m
Out[7]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=2, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
In [8]: joblib.dump(m, "filename.cls")
In fact, you can use pickle.dump instead of joblib, but joblib does a very good job at compressing the numpy arrays inside classifiers.

Categories