Failing to tune decision tree classifier parameters using gridsearch

Failing to tune decision tree classifier parameters using gridsearch - python

I am trying to tune parameters using GridSearchCV but keep encountering this error message
ValueError: Invalid parameter decisiontreeclassifier for estimator DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with `estimator.get_params().keys()`.
This is the code I have written
accuracy_score = make_scorer(accuracy_score,greater_is_better = True)
dtc = DecisionTreeClassifier()
depth = np.arange(1,30)
leaves = [1,2,4,5,10,20,30,40,80,100]
param_grid =[{'decisiontreeclassifier__max_depth':depth,
'decisiontreeclassifier__min_samples_leaf':leaves}]
grid_search = GridSearchCV(estimator = dtc,param_grid = param_grid,
scoring=accuracy_score,cv=10)
grid_search = grid_search.fit(X_train,y_train)

Use max_depth instead of decisiontreeclassifier__max_depth in your param_grid. (The same thing applies to the other parameter.) The notation that you're using is for pipelines with multiple estimators chained together.
accuracy_score = make_scorer(accuracy_score,greater_is_better = True)
dtc = DecisionTreeClassifier()
depth = np.arange(1,30)
leaves = [1,2,4,5,10,20,30,40,80,100]
param_grid =[{'max_depth':depth,
'min_samples_leaf':leaves}]
grid_search = GridSearchCV(estimator = dtc,param_grid = param_grid,
scoring=accuracy_score,cv=10)
grid_search = grid_search.fit(X_train,y_train)

Related

The predictions from StackingRegressor (Sklearn) are not reproducible

I am working on training a regression model using StackingRegressor and I found out that the prediction from this model is not consistent while I am using the same random_state.
Here is my code:
random_seed = 42
mdl_lgbm = lightgbm.LGBMRegressor(colsample_bytree=0.6,
learning_rate=0.05,
max_depth=6,
min_child_samples=227,
min_child_weight=10,
n_estimators=1800,
num_leaves=45,
reg_alpha=0,
reg_lambda=1,
subsample=0.6,
n_jobs=-1,
random_state=random_seed)
mdl_xgb = xgb.XGBRegressor(subsample=0.5,
n_estimators=900,
min_child_weight=8,
max_depth=6,
learning_rate=0.03,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=2,
reg_lambda=50,
objective='reg:squarederror',
random_state=random_seed)
mdl_rf = RandomForestRegressor(bootstrap=True,
max_depth=110,
max_features='auto',
min_samples_leaf=5,
min_samples_split=5,
n_estimators=1430,
n_jobs=-1,
random_state=random_seed)
# Base models
base_mdl_names = {
'XGB': mdl_xgb,
'LGBM': mdl_lgbm,
'RF': mdl_rf,
}
final_estimator = xgb.XGBRegressor(subsample=0.3,
n_estimators=1200,
min_child_weight=2,
max_depth=5,
learning_rate=0.06,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=1,
reg_lambda=0.1,
objective='reg:squarederror',
random_state=random_seed)
base_estimators = list()
for name, mdl in base_mdl_names.items():
base_estimators.append((name, mdl))
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=5,
passthrough=True)
stacked_mdl.fit(X_train, y_train)
Please note that I am not changing X_train. When I use the trained model for the prediction, the results are not reproducable. I mean that if I re-train the model, the results will be different while every input is the same. Any clue why this happening?

StackingRegressor use cv (cross validation). Therefore you also have to set its random_state in order to have exactly the same cross validation split at each run.
You should do as follow:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, random_state=random_seed, shuffle=True)
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=kfold,
passthrough=True)

show overfitting with sklearn & random forest

I followed this tutorial to create a simple image classification script:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
train_data = scipy.io.loadmat('extra_32x32.mat')
# extract the images and labels from the dictionary object
X = train_data['X']
y = train_data['y']
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
....
clf = RandomForestClassifier()
print(clf)
start_time = time.time()
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
It gave me an accuracy of approximately 0.7.
Is there someway to visualize or show where/when/if the model is overfitting? I believe this can be shown by training the model until we see that the accuracy of training is increasing and the validation data is decreasing. But how can I do so in the code?

There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate]. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.

Another option is to use a library like Optuna, which will test various hyperparameters for you and you could use the methods mentioned above.

Getting error while running in jupyter notebook

ERROR
Invalid parameter C for estimator
DecisionTreeClassifier(class_weight=None, criterion='gini',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with estimator.get_params().keys().
CODE
def train(X_train,y_train,X_test):
# Scaling features
X_train=preprocessing.scale(X_train)
X_test=preprocessing.scale(X_test)
Cs = 10.0 ** np.arange(-2,3,.5)
gammas = 10.0 ** np.arange(-2,3,.5)
param = [{'gamma': gammas, 'C': Cs}]
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X_train, y_train)
cvk = skf
classifier = DecisionTreeClassifier()
clf = GridSearchCV(classifier,param_grid=param,cv=cvk)
clf.fit(X_train,y_train)
print("The best classifier is: ",clf.best_estimator_)
clf.best_estimator_.fit(X_train,y_train)
# Estimate score
scores = model_selection.cross_val_score(clf.best_estimator_, X_train,y_train, cv=5)
print (scores)
print('Estimated score: %0.5f (+/- %0.5f)' % (scores.mean(), scores.std() / 2))
title = 'Learning Curves (SVM, rbf kernel, $\gamma=%.6f$)' %clf.best_estimator_.gamma
plot_learning_curve(clf.best_estimator_, title, X_train, y_train, cv=5)
plt.show()
# Predict class
y_pred = clf.best_estimator_.predict(X_test)
return y_test,y_pred

It looks like you are making the param an array with a single dictionary inside. param needs to be just a dictionary:
EDIT:
Looking into this further, as mentioned by #DzDev, passing an array containing a single dictionary is also a valid way to pass in parameters.
It appears that your issue is that you are mixing the concepts of two different types of estimators. You are passing in the parameters for svm.SVC but are sending in a DecisionTreeClassifier estimator. So it turns out the error is exactly as it says, 'C' is not a valid parameter. You should update to either using a svm.SVC estimator or updates your parameters to be correct for the DecisionTreeClassifier.

Print decision tree and feature_importance when using BaggingClassifier

Obtaining the decision tree and the important features can be easy when using DecisionTreeClassifier in scikit learn. However I am not able to obtain none of them if I and bagging function, e.g., BaggingClassifier.
Since we need to fit the model using the BaggingClassifier, I can not return the results (print the trees (graphs), feature_importances_, ...) related to the DecisionTreeClassifier.
Hier is my script:
seed = 7
n_iterations = 199
DTC = DecisionTreeClassifier(random_state=seed,
max_depth=None,
min_impurity_split= 0.2,
min_samples_leaf=6,
max_features=None, #If None, then max_features=n_features.
max_leaf_nodes=20,
criterion='gini',
splitter='best',
)
#parametersDTC = {'max_depth':range(3,10), 'max_leaf_nodes':range(10, 30)}
parameters = {'max_features':range(1,200)}
dt = RandomizedSearchCV(BaggingClassifier(base_estimator=DTC,
#max_samples=1,
n_estimators=100,
#max_features=1,
bootstrap = False,
bootstrap_features = True, random_state=seed),
parameters, n_iter=n_iterations, n_jobs=14, cv=kfold,
error_score='raise', random_state=seed, refit=True) #min_samples_leaf=10
# Fit the model
fit_dt= dt.fit(X_train, Y_train)
print(dir(fit_dt))
tree_model = dt.best_estimator_
# Print the important features (NOT WORKING)
features = tree_model.feature_importances_
print(features)
rank = np.argsort(features)[::-1]
print(rank[:12])
print(sorted(list(zip(features))))
# Importing the image (NOT WORKING)
from sklearn.externals.six import StringIO
tree.export_graphviz(dt.best_estimator_, out_file='tree.dot') # necessary to plot the graph
dot_data = StringIO() # need to understand but it probably relates to read of strings
tree.export_graphviz(dt.best_estimator_, out_file=dot_data, filled=True, class_names= target_names, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
img = Image(graph.create_png())
print(dir(img)) # with dir we can check what are the possibilities in graph.create_png
with open("my_tree.png", "wb") as png:
png.write(img.data)
I obtain erros like: 'BaggingClassifier' object has no attribute 'tree_' and 'BaggingClassifier' object has no attribute 'feature_importances'. Does anyone know how can I obtain them? thanks.

Based on the documentation, BaggingClassifier object indeed doesn't have the attribute 'feature_importances'. You could still compute it yourself as described in the answer to this question: Feature importances - Bagging, scikit-learn
You can access the trees that were produced during the fitting of BaggingClassifier using the attribute estimators_, as in the following example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
iris = datasets.load_iris()
clf = BaggingClassifier(n_estimators=3)
clf.fit(iris.data, iris.target)
clf.estimators_
clf.estimators_ is a list of the 3 fitted decision trees:
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1422640898, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1968165419, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=2103976874, splitter='best')]
So you can iterate over the list and access each one of the trees.

GridSearch over MultiOutputRegressor?

Let's consider a multivariate regression problem (2 response variables: Latitude and Longitude). Currently, a few machine learning model implementations like Support Vector Regression sklearn.svm.SVR do not currently provide naive support of multivariate regression. For this reason, sklearn.multioutput.MultiOutputRegressor can be used.
Example:
from sklearn.multioutput import MultiOutputRegressor
svr_multi = MultiOutputRegressor(SVR(),n_jobs=-1)
#Fit the algorithm on the data
svr_multi.fit(X_train, y_train)
y_pred= svr_multi.predict(X_test)
My goal is to tune the parameters of SVR by sklearn.model_selection.GridSearchCV. Ideally, if the response was a single variable and not multiple, I would perform an operation as follows:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
pipe_svr = (Pipeline([('scl', StandardScaler()),
('reg', SVR())]))
grid_param_svr = {
'reg__C': [0.01,0.1,1,10],
'reg__epsilon': [0.1,0.2,0.3],
'degree': [2,3,4]
}
gs_svr = (GridSearchCV(estimator=pipe_svr,
param_grid=grid_param_svr,
cv=10,
scoring = 'neg_mean_squared_error',
n_jobs = -1))
gs_svr = gs_svr.fit(X_train,y_train)
However, as my response y_train is 2-dimensional, I need to use the MultiOutputRegressor on top of SVR. How can I modify the above code to enable this GridSearchCV operation? If not possible, is there a better alternative?

For use without pipeline, put estimator__ before parameters:
param_grid = {'estimator__min_samples_split':[10, 50],
'estimator__min_samples_leaf':[50, 150]}
gb = GradientBoostingRegressor()
gs = GridSearchCV(MultiOutputRegressor(gb), param_grid=param_grid)
gs.fit(X,y)

I just found a working solution. In the case of nested estimators, the parameters of the inner estimator can be accessed by estimator__.
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
pipe_svr = Pipeline([('scl', StandardScaler()),
('reg', MultiOutputRegressor(SVR()))])
grid_param_svr = {
'reg__estimator__C': [0.1,1,10]
}
gs_svr = (GridSearchCV(estimator=pipe_svr,
param_grid=grid_param_svr,
cv=2,
scoring = 'neg_mean_squared_error',
n_jobs = -1))
gs_svr = gs_svr.fit(X_train,y_train)
gs_svr.best_estimator_
Pipeline(steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)),
('reg', MultiOutputRegressor(estimator=SVR(C=10, cache_size=200,
coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1,
shrinking=True, tol=0.001, verbose=False), n_jobs=1))])

Thank you, Marco.
Adding to your answer here is a short illustrative example of a Randomized Search applied to a Multi-Ouput GradientBoostingRegressor.
from sklearn.datasets import load_linnerud
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import RandomizedSearchCV
x, y = load_linnerud(return_X_y=True)
model = MultiOutputRegressor(GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0,
criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_depth=3,
min_impurity_decrease=0.0,
min_impurity_split=None, init=None, random_state=None,
max_features=None,
alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False,
validation_fraction=0.1, n_iter_no_change=None, tol=0.0001,
ccp_alpha=0.0))
hyperparameters = dict(estimator__learning_rate=[0.05, 0.1, 0.2, 0.5, 0.9], estimator__loss=['ls', 'lad', 'huber'],
estimator__n_estimators=[20, 50, 100, 200, 300, 500, 700, 1000],
estimator__criterion=['friedman_mse', 'mse'], estimator__min_samples_split=[2, 4, 7, 10],
estimator__max_depth=[3, 5, 10, 15, 20, 30], estimator__min_samples_leaf=[1, 2, 3, 5, 8, 10],
estimator__min_impurity_decrease=[0, 0.2, 0.4, 0.6, 0.8],
estimator__max_leaf_nodes=[5, 10, 20, 30, 50, 100, 300])
randomized_search = RandomizedSearchCV(model, hyperparameters, random_state=0, n_iter=5, scoring=None,
n_jobs=2, refit=True, cv=5, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise', return_train_score=True)
hyperparameters_tuning = randomized_search.fit(x, y)
print('Best Parameters = {}'.format(hyperparameters_tuning.best_params_))
tuned_model = hyperparameters_tuning.best_estimator_
print(tuned_model.predict(x))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Failing to tune decision tree classifier parameters using gridsearch - python

Related

The predictions from StackingRegressor (Sklearn) are not reproducible

show overfitting with sklearn & random forest

Getting error while running in jupyter notebook

Print decision tree and feature_importance when using BaggingClassifier

GridSearch over MultiOutputRegressor?

Categories

Resources