Hyper-tuning of SVM regressor using Grid search and cuML - python

I tried tuning the SVM regressor parameters using the code below. However, during the search for the best params, the grid-search model tends to choose the first kernel of the model within the proposed kernels, every time. I tried different combinations to see if I could reach good results.
I attempted to use various cross-validations(3, 5-default, 7).
I used/removed the score function of the GridSearch model, to see if this impact.
I also tried with higher values of the penalty score (see the attached).
Please can anyone confirm this?
And is the kernels' order in the list/Grid params impact the search?
Note: I used the cuML to benefit from using the GPU and to speed up the search.
I can send you the dataset if you want
from cuml.svm import SVR
from sklearn.model_selection import GridSearchCV
#from sklearn.svm import SVR
from sklearn.metrics import make_scorer, mean_squared_error
from pprint import pprint
# Hyper-tunning for SVM regressor
import numpy as np
base_svr = SVR()
# If you want to use your custom score function, specify the function and use it.
def my_rmse_loss_func(y_true, y_pred):
return np.sqrt(mean_squared_error((y_true, y_pred)))
scorer = make_scorer(mean_squared_error, greater_is_better=False)#,, squared = False
param_grid_svr = {'C': [0.0001, 0.01, 0.1, 0.9,1, 1.1, 2,3], # I used 10, 100
'gamma': ['scale', 'auto'],
'kernel': ['rbf', 'poly', 'sigmoid', 'precomputed','linear'], #,
'epsilon': [0.01, 0.1, 0.2 ,0.22, 0.3, 1]}
# Create a GridSearchCV object and fit it to the training data with 7 cv
svr_cv = GridSearchCV(base_svr,param_grid_svr, n_jobs = -1 , # Use 4
scoring='neg_mean_squared_error', cv = 7 ,verbose=3 ,return_train_score =True ) # ,refit=True mean_squared_error, scoring='neg_mean_squared_error' ,
# Train the model in GPU
svr_cv.fit(X_train,y_train)
# Predictions
## By using the re-trained model
predictions_RF_regr_tuned= svr_cu.predict(X_test)

Related

AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'

I'm trying to tune my model using the Grid search model in #kaggle notebook. In order to benefit from the GPU, I used this package hummingbird-ml. Thanks in advance
However, I get the following issue:
AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'
Here is my code:
from hummingbird.ml import convert
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.metrics import make_scorer, mean_squared_error
from pprint import pprint
# Hyper-tunning for SVM regressor
import numpy as np
base_svr = SVR()
scorer = make_scorer(mean_squared_error, greater_is_better=False)
param_grid_svr = {'C': [0.01, 0.1,1, 10, 100],
'gamma': [1,0.1, 0.01, 0.001, 0.0001],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
'epsilon': [0.01, 0.1, 0.2 , 0.3, 1]}
pprint(param_grid_svr)
# Create a GridSearchCV object and fit it to the training data
svr_gs = GridSearchCV(base_svr,param_grid_svr, n_jobs = -1 , scoring=scorer, cv=3 ,refit=True,verbose=2)
# Converting scikit-learn model to PyTorch on CPU
svr_gs_pytorch = convert(svr_gs, 'torch')
# Switching PyTorch from CPU to GPU
%%capture
svr_gs_pytorch.to('cuda')
# Train the model in GPU
svr_gs_pytorch.fit(X_train,y_train)
# print best parameter after tuning
svr_gs_pytorch.best_params_

How to run GridSearchCV inside RFECV?

I would like to write a script to select most important features using RFECV. The estimator I want to use is logistic regression. In addition, I also want to do the GridSearchCV. In other words, I want to tune the parameters first using GridSearchCV and then update the parameters in each iterations of RFECV.
I have written a code below but I'm not sure when I use RFECV(GridSearchCV(LogisticRegression)), the parameters of the model is tuned and updated in each iterations of RFECV or not. Please give me some advices on this issue.
Thank you so much!
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.model_selection import ParameterGrid, StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np
X,y = make_classification(n_samples =50,
n_features=5,
n_informative=3,
n_redundant=0,
random_state=0)
class GridSeachWithCoef(GridSearchCV):
#property
def coef_(self):
return self.best_estimator_.coef_
solvers = ['lbfgs', 'liblinear']
penalty = ['l1', 'l2']
c_values = np.logspace(-4, 4, 20)
param_grid = [
{'penalty' : penalty,
'C': c_values,
'solver': solvers}
]
GS = GridSeachWithCoef(LogisticRegression(), param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(
estimator=GS, cv=3, scoring = "accuracy"
)
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
(the code above was adopted from other people in the forum. Thank you for your code)

Different results obtained on a regression model fitted on a database from different versions of XGBoost

I have wrote a code in python to do some regression work with XGBoost but when I run the code on two different computers with two different versions of XGBoost and Python, the results are drastically different. My code is long but I would like to show some parts of it here. The parts I am presenting here are the hyperparameter tuning using xgb.cv() command and fitting and prediction using Scikit's XGBRegressor with the optimized parameters obtained by the hyperparameter tuning. The parameters that will be tuned are stored in the following list with an initial arbitary value:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing
from model_functions import GaussRankScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt
import xgboost as xgb
from xgboost import XGBRegressor
import shap
import operator
from sklearn.model_selection import GridSearchCV
import joblib
import plotly.graph_objs as go
import scipy as sp
import seaborn as sns
from numpy import asarray
from sklearn.svm import SVR
from scipy.stats import ttest_ind
from sklearn.impute import SimpleImputer
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
import scipy.stats as stats
from yellowbrick.regressor import residuals_plot, ResidualsPlot
import sys
from scipy.stats import pearsonr
params = {
'max_depth': 6,
'min_child_weight': 1,
'learning_rate': 0.3,
'subsample': 1,
'colsample_bytree': 1,
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'booster': 'gbtree',
'nthread': -1,
'validate_parameters':'True',
'alpha': 0.2,
'lambda': 0.001,
'colsample_bylevel': 0.9,
'verbose': 0,
'gamma': 0.01,
'max_delta_step': 0.1,
'silent': 0
}
The parameter tuning is done as mentioned below using a for loop. In every loop, two of the parameters are tuned except for learning rate and gamma where they are optimized individually. Each two parameters will be optimized in a for loop and the list of parameters would be updated with the best value optimized for them at the end of each loop. The loops are similar with the only difference between them being the parameters optimized. xgb.cv() is used for the cross validation part of the process. The evaluation metrics used to choose the best value for each parameter is RMSE. The following is the loop that is responsible for optimizing learning rate (AKA eta):
df_x = dfnum.iloc[:,:-1]
df_y = dfnum.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(df_x, df_y,
test_size=0.1,
random_state=42)
"Converting features' distributions to normal distribution"
gauss_scaler = GaussRankScaler()
X_trainnum = gauss_scaler.fit_transform(X_train)
X_testnum = gauss_scaler.transform(X_test)
"Scaling all the features to be between 0 and 1"
scaler = preprocessing.MinMaxScaler()
X_trainnum = scaler.fit_transform(X_trainnum)
X_testnum = scaler.transform(X_testnum)
num_boost_round = 999
dtrain = xgb.DMatrix(X_trainnum, label=y_train)
dtest = xgb.DMatrix(X_testnum, label=y_test)
min_rmse = float("Inf")
best_params = None
for learning_rate in [.3, .2, .1, .05, .01, .005]:
params['learning_rate'] = learning_rate
cv_results = xgb.cv(
params,
dtrain,
num_boost_round=num_boost_round,
seed=42,
nfold=3,
metrics=['rmse'],
early_stopping_rounds=10
)
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_params = learning_rate
print('')
print("Best parameter: learning_rate = {}, RMSE: {}".format(best_params, min_rmse))
print('')
params['learning_rate'] = best_params
After all the parameters are tuned in the above fashion, the updated and optimized list of parameters is passed through the XGBRegressor and the model is fitted on the database at hand:
print('Fitting the model')
best_model = XGBRegressor(**params,early_stopping_rounds=10,num_boost_round=999)
best_model.fit(X_trainnum, y_train)
joblib.dump(best_model,'best_model_grid')
y_pred = best_model.predict(X_testnum)
y_pred1 = best_model.predict(X_trainnum)
I am using Python and XGBoost through Anaconda on both of my machines (personal lap top and my office PC). The XGBoost version on my lap top is 0.90 and the Python version is 3.7.10. My office PC on the other hand, runs the 3.8.11 version of Python and 1.42 version of XGBoost.
When running my code on my personal lap top with the older version of Python and XGBoost, the code runs smoothly without any warnings or errors. However, when it is ran on my office PC with newer versions of the Python and XGBoost, at each step of the for loops containing xgb.cv() command, designed to do the hyperparameter tuning, I receive this error message:
[13:38:44] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573:
Parameters: { "early_stoppage" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
The error messages would then change to:
Hyperparameter tuning.
[13:38:47] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573:
Parameters: { "silent", "verbose" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
And finally, when the model is fitted with XGBRegressor, it changes to:
[15:08:00] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573:
Parameters: { "early_stopping_rounds", "num_boost_round", "silent", "verbose" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
The results obtained on the older versions of the algorithm and programing language used in this project, are much better than the results obtained through the newer version. The older version yields much better results than the newer version. The difference is very Significant. The database I work with consists of 11 numerical features and a numerical target feature.
I have researched and browsed on this website and other sources and sought help from a number of data analyst experts on this, but unfortunately I have not been able to find a solution or a reason for this problem.
I would be really thankful and appreciative if someone could help me with this issue
I will focus on this snippet of code:
best_model = XGBRegressor(**params,early_stopping_rounds=10,num_boost_round=999)
The correct version should be:
# Removed `verbose`, `eval_metric`. Replaced `nthread` with `n_jobs`.
# Replaced objective to "reg:squarederror" since you are using regression instead of classification.
params = {
'max_depth': 6,
'min_child_weight': 1,
'eta': 0.3,
'subsample': 1,
'colsample_bytree': 1,
"objective": "reg:squarederror",
'booster': 'gbtree',
'n_jobs': 10,
'validate_parameters':'True',
'alpha': 0.2,
'lambda': 0.001,
'colsample_bylevel': 0.9,
'gamma': 0.01,
'max_delta_step': 0.1,
}
# notice the `n_estimators`
model = XGBRegressor(**params, n_estimators=999)
# Passed `early_stopping_rounds`, `verbose`, `eval_metric` here.
# Replaced the `eval_metric` to `rmse` since you are using regression instead of classification.
# Added `eval_set` since you need to carry out evaluation.
model.fit(
X,
y,
early_stopping_rounds=10,
verbose=True,
eval_metric="rmse",
eval_set=[(X, y)],
)
You can find the document of the estimator here https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn .

Optimizing learning rate and number of estimators for multioutput gradient boosting

I have a dataset with multiple outputs and am trying to use gradient boosting to predict all the values at once. I imported MultiOutputRegressor so multiple outputs can be predicted at once; I'm able to make it work for the default gradient boosting function. However, I'm running into an error when I try to optimize the gradient boosting function for each output.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn import ensemble
params = {'max_depth': 3, 'n_estimators': 100, 'learning_rate': 0.1}
gradient_regressor = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
GradBoostModel = gradient_regressor.fit(X_train, y_train)
prediction_GradBoost = GradBoostModel.predict(X_test)
LR = {'learning_rate':[0.15, 0.125, 0.1, 0.75, 0.05], 'n_estimators':[50, 75, 100, 150, 200, 250, 300, 400]}
tuning = GridSearchCV(estimator = GradBoostModel, param_grid = LR, scoring = 'r2')
tuning.fit(X_train, y_train)
tuning.best_params_, tuning.best_score_
I'm trying to use GridSearchCV to cycle through the listed learning rates and number of estimators to find the optimal values. But, I get the following error:
Invalid parameter learning_rate for estimator MultiOutputRegressor.
Check the list of available parameters with `estimator.get_params().keys()`
I think I understand the reason for the error: when I try to optimize the gradient boosting parameters, they are passed through the MultiOutputRegressor, which doesn't recognize them. Is this the case? Also, how can I change my code, such that I can optimize these parameters for each output?
Indeed the params are prefixed with estimator__, in general, to find out what params to use downstream in your pipeline use the .get_params().keys() method on your model, eg:
print(GradBoostModel.get_params().keys())
dict_keys(['estimator__alpha', 'estimator__ccp_alpha', 'estimator__criterion', 'estimator__init', 'estimator__learning_rate',...
Full working example with the linnerud dataset:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
# Data
rng = np.random.RandomState(0)
X, y = load_linnerud(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
# Model
params = {'max_depth': 3, 'n_estimators': 100, 'learning_rate': 0.1}
gradient_regressor = MultiOutputRegressor(GradientBoostingRegressor(**params))
GradBoostModel = gradient_regressor.fit(X_train, y_train)
prediction_GradBoost = GradBoostModel.predict(X_test)
LR = {'estimator__learning_rate': [0.15, 0.125, 0.1, 0.75, 0.05], 'estimator__n_estimators': [50, 75, 100, 150, 200, 250, 300, 400]}
print('Params from GradBoostModel', GradBoostModel.get_params().keys())
tuning = GridSearchCV(estimator=GradBoostModel, param_grid=LR, scoring='r2')
tuning.fit(X_train, y_train)

Ensuring right order of operations in random forest classification in scikit learn

I would like to ensure that the order of operations for my machine learning is right:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV
# 1. Initialize model
model = RandomForestClassifier(5000)
# 2. Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# 3. Remove unimportant features
model = SelectFromModel(model, threshold=0.5).estimator
# 4. cross validate model on the important features
k_fold = KFold(n=len(data), n_folds=10, shuffle=True)
for k, (train, test) in enumerate(k_fold):
self.model.fit(data[train], target[train])
# 5. grid search for best parameters
param_grid = {
'n_estimators': [1000, 2500, 5000],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [3, 5, data.shape[1]]
}
gs = GridSearchCV(estimator=model, param_grid=param_grid)
gs.fit(X, y)
model = gs.best_estimator_
# Now the model can be used for prediction
Please let me know if this order looks good or if something can be done to improve it.
--EDIT, clarifying to reduce downvotes.
Specifically,
1. Should the SelectFromModel be done after cross validation?
Should grid search be done before cross validation?
The main problem with your approach is you are confusing the feature selection transformer with the final estimator. What you will need to do is create two stages, the transformer first:
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
Then you need a second phase where you use the reduced feature set to train a classifier on the reduced feature set.
clf = RandomForestClassifier(5000)
Once you have your phases, you can build a pipeline to combine the two into a final model.
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
Now you can perform a GridSearch on your model. Keep in mind you have two stages, so the parameters must be specified by stage fs or clf. In terms of the feature selection stage, you can also access the base estimator using fs__estimator. Below is an example of how to search parameters on any of the three objects.
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
You can then make predictions with gs directly or using gs.best_estimator_.

Categories