Configuration of GridSearchCV for AdaBoost and its base learner

Configuration of GridSearchCV for AdaBoost and its base learner - python

I'm running grid search on AdaBoost with DecisionTreeClassifier as its base learner to get the best parameters for AdaBoost and DecisionTree.
The search on a dataset (130000, 22) has been running for 18 hours so I'm wondering if it's just another typical day of waiting for training or maybe there might be an issue with the set up.
Is the base-learner, grid search, training and params set up correctly?
ada_params = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"base_estimator__min_samples_leaf": [*np.arange(100,1500,100)],
"base_estimator__max_depth": [5,10,13,15],
"base_estimator__max_features": [5,10,15],
"n_estimators": [500, 700, 1000, 1500],
"learning_rate": [0.001, 0.01, 0.1, 0.3]
}
dt_base_learner = DecisionTreeClassifier(random_state = 42, max_features="auto", class_weight = "balanced")
ada_clf = AdaBoostClassifier(base_estimator = dt_base_learner)
ada_search = GridSearchCV(ada_clf, param_grid=ada_params, scoring = 'f1', cv=kf)
ada_search.fit(scaled_X_train, y_train)

If I am not mistaken, your GridSearch tests 14 * 4 * 3 * 4 * 4 = 2,688 different model configuration, each for a crossvalidation of an unknown number of splits. You should definitely try to reduce the number of combinations in the GridSearchCV or go for RandomizedSearchCV or BayesSearchCV from skopt.

Gridsearch will not finish until all joins are done, check the RandomSearchcv documentation and increase the joins a few at a time (n_iter) and put "-1" in "n_jobs" to parallelize as much as possible
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Related

GridSearchCV not giving the most optimal settings?

I'm working on a XGBoost model and I also tried the GridSearchCV from Scikit learn. After I did a search for most optimal parameter settings, I got this result:
Fitting 4 folds for each of 2304 candidates, totalling 9216 fits
Best parameters found: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 11, 'n_estimators': 200, 'subsample': 0.9}
Now, after training a model with these settings and doing a prediction on unseen testdata (train/test set used), I got a result. As a test I was changing some settings and then I get a better result than with the most optimal parameters from the grid search.
Is this because the test set is different from the training data? If I have another testset, are those settings also different for the best score? I think both answers can be answered with yes, but how are other people working with this effect?
Because you get the results from the grid search, but do you always use these settings or are you doing the same as I do? What will be your final setting for the model you want to deploy?
Hope to receive some inspirational thoughts:)
My final code for train/test after manual fine tuning:
xgb_skl_tuned_2 = xgb.XGBRegressor(
colsample_bytree = 0.7,
subsample = 0.9,
learning_rate = 0.3,
max_depth = 5,
min_child_weight = 13,
gamma = 10,
n_estimators = 50
)
xgb_skl_tuned_2.fit(X_train_2,y_train_2)
preds_2 = xgb_skl_tuned_2.predict(X_test_2)
mse = mean_squared_error(y_test_2, preds_2, squared=False)
print('Model RMSE: {}'.format(mse))
Also checked this thread: parameters tuning with GridsearchCV not giving best result

How to use RandomizedSearchCV or GridSearchCV for only 30% of data

How to use RandomizedSearchCV or GridSearchCV for only 30% of data in order to speed up the process.
My X.shape is 94456,100 and I'm trying to use RandomizedSearchCV or GridSearchCV but it's taking a verly long time. I'm runnig my code for several hours but still with no results.
My code looks like this:
# Random Forest
param_grid = [
{'n_estimators': np.arange(2, 25), 'max_features': [2,5,10,25],
'max_depth': np.arange(10, 50), 'bootstrap': [True, False]}
]
clf = RandomForestClassifier()
grid_search_forest = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search_forest.fit(X, y)
rf_best_model = grid_search_forest.best_estimator_
# Decsision Tree
param_grid = {'max_depth': np.arange(1, 50), 'min_samples_split': [20, 30, 40]}
grid_search_dec_tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy')
grid_search_dec_tree.fit(X, y)
dt_best_model = grid_search_dec_tree.best_estimator_
# K Nearest Neighbor
knn = KNeighborsClassifier()
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid_search_knn = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
grid_search_knn.fit(X, y)
knn_best_model = grid_search_knn.best_estimator_

You can always sample a part of your data to fit your models. Although not designed for this purpose, train_test_split can be useful here (it can take care of shuffling, stratification etc, which in a manual sampling you would have to take care of by yourself):
from sklearn.model_selection import train_test_split
X_train, _, y_train, _ = train_test_split(X, y, stratify=y, test_size=0.70)
By asking for test_size=0.70, your training data X_train will now be 30% of your initial set X.
You should now replace all the .fit(X, y) statements in your code with .fit(X_train, y_train).
On a more general level, all these np.arange() statements in your grid look like overkill - I would suggest selecting some representative values in a list instead of going through a grid search in that detail. Random Forests in particular are notoriously insensitive in the number of trees n_estimators, and adding one tree at a time is hardly useful - go for something like 'n_estimators': [50, 100]...

ShuffleSplit fits very well to this problem. You can define your cv as:
cv = ShuffleSplit(n_splits=1, test_size=.3)
This means setting aside and using 30% of your training data for validating each hyper-parameter setting. cv=5 on the other hand will carry out a 5-fold cross validation, which means going through 5 fit and predict for each hyper-parameter setting.
So, this also requires very minimal change to your code. Just replace those cv=5 or cv=10 inside GridSearchCV with cv = ShuffleSplit(n_splits=1, test_size=.3) and you're done.

python LightGBM reproductibility issue

I am running a lightGBM on a classification problem, with crossvalidation (using sklearn) to get the optimal hyper parameters values.
Although I specified the random_state when create the model object, rerunning the grid search results in different optimal parameters each time.
import lightgbm as lab
from sklearn.model_selection import GridSearchCV
lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt',
objective='multiclass',
num_class= 4,
random_state = 42,
)
grid = {
'num_leaves': [60, 70, 80, 100, 120],
'feature_fraction': [0.5, 0.7],
'bagging_fraction': [0.7, 0.8],
'num_trees':[50, 80, 100],
'C': [0, 0.3, 0.5, 1]
}
GBM_grid_search = GridSearchCV(estimator = lgb_estimator,
param_grid = grid,
scoring = 'f1_micro',
cv = 15,
n_jobs = 2)
lgb_model_trained = GBM_grid_search.fit(X=X_train,
y=y_train)
My training data is splitted using a random seed so no issues on that side
What's causing the randomness ? And how to solve this ?

AFAIK, setting the random seed (random_state in LGBMClassifier) does not result in reproducibility if LightGBM is working in parallel (n_jobs>1). If you need reproducibility and want to use all your n cores, you should find or create a method to run n instances of LightGBM with n_jobs=1 each. The simplest way: having 2 cores you may run 2 instances of your original code, the first one with bagging_fraction=0.7, the second one with bagging_fraction=0.8, each one with n_jobs=1. Or simply set n_jobs=1 without any other changes, if reproducibility is more important than speed.

How to correctly compute the optimal C and gamma for my SVM?

I am trying to compute the optimal C and Gamma for my SVM. When trying to run my script I get this error:
ValueError: Invalid parameter max_features for estimator SVC. Check the list of available parameters withestimator.get_params().keys().
I went through the docs to understand what n_estimators actually means so that I know what values I should fill in there. But it is not totally clear to me. Could someone tell me what this value should be so that I can run my script in order to find the optimal C and gamma?
my code:
if __name__=='__main__':
fname = "/home/John/labels.csv"
labels = pd.read_csv(fname, header=None).as_matrix()[:, 1]
labels = map(itemgetter(1),
map(os.path.split,
map(os.path.dirname, labels)))
fname = "/home/John/reps.csv"
embeddings = pd.read_csv(fname, header=None).as_matrix()
le = LabelEncoder().fit(labels)
labelsNum = le.transform(labels)
nClasses = len(le.classes_)
svcClassifier = SVC(kernel='rbf', probability=True, C=10, gamma=10)
#classifier = OneVsRestClassifier(svcClassifier).fit(embeddings, labelsNum)
param_grid = {
'n_estimators': [200, 700],
'max_features': ['auto', 'sqrt', 'log2']
}
CV_rfc = GridSearchCV(estimator=svcClassifier, param_grid=param_grid, cv= 5)
CV_rfc.fit(embeddings, labelsNum)
print CV_rfc.best_params_
After trying I manually found out that in my case C=10 and gamma=10 give the best results. I would however like to use this function to find out what the optimal values should be.
My code is insired by this post: How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

The SVC class has no argument max_features or n_estimators as these are arguments of the RandomForest you used as a base for your code. If you want to optimize the model regarding C and gamma you can try to use:
param_grid = {
'C': [0.1, 0.5, 1.0],
'gamma': [0.1, 0.5, 1.0]
}
Furhtermore, I also recommend you to search for the optimal kernel, which can be rbf, linear or poly in the sklearn framework.
Edit: The values here are just arbitray and meant to illustrate the general approach. You should add many different values here, which depend on your situation. And whose range also depends on your situation.

Grid-search with specific validation data

I'm looking for a way to grid-search for hyperparameters in sklearn, without using K-fold validation. I.e I want my grid to train on on specific dataset (X1,y1 in the example below) and validate itself on specific hold-out dataset (X2,y2 in the example below).
X1,y2 = train data
X2,y2 = validation data
clf_ = SVC(kernel='rbf',cache_size=1000)
Cs = [1,10.0,50,100.0,]
Gammas = [ 0.4,0.42,0.44,0.46,0.48,0.5,0.52,0.54,0.56]
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),
cv=???, # validate on X2,y2
n_jobs=8,verbose=10)
clf.fit(X1, y1)

Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),cv=???, # validate on X2,y2,n_jobs=8,verbose=10)
n_jobs>1 does not make any sense. If n_jobs=-1 it means the processing will use all the cores on your machine. If it is 1 only one core would be use.
If cv =5 it will run five cross validations for every iteration.
In your case total number of iterations will be 9(size of Cs)*5(Size of gammas)*5(Value of CV)
If you are using cross validation it does not make any sense to hold out the data for rechecking your model. If you are not confident about the performance you can just increase the cv to get a better fit.
This will be very time consuming especially for SVM ,I will rather suggest you to use RandomSearchCV which allows you give the number of iterations you want your model to randomly select.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Configuration of GridSearchCV for AdaBoost and its base learner - python

Gridsearch will not finish until all joins are done, check the RandomSearchcv documentation and increase the joins a few at a time (n_iter) and put "-1" in "n_jobs" to parallelize as much as possible https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Related

GridSearchCV not giving the most optimal settings?

How to use RandomizedSearchCV or GridSearchCV for only 30% of data

python LightGBM reproductibility issue

How to correctly compute the optimal C and gamma for my SVM?

Grid-search with specific validation data

Categories

Resources