Related
I'm building a logistic regression model to predict a binary target feature. I want to try different values of different parameters using the param_grid argument, to find the best fit with the best values. This is my code:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)
logModel = LogisticRegression(C = 1, penalty='l1',solver='liblinear');
Grid_params = {
"penalty" : ['l1','l2','elasticnet','none'],
"C" : [0.001, 0.01, 0.1, 1, 10, 100, 1000], # Basically smaller C specify stronger regularization.
'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
'max_iter' : [50,100,200,500,1000,2500]
}
clf = GridSearchCV(logModel, param_grid=Grid_params, cv = 10, verbose = True, n_jobs=-1,error_score='raise')
clf_fitted = clf.fit(X_train,Y_train)
And this is where I get the error. I have read already that some solvers dont work with l1, and some don't work with l2. How can I tune the param_grid in this case?
I tried also using only simple logModel = LogisticRegression() but didn't work.
Full error:
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.
Gridsearch accepts the list of dicts for that purpose, given you absolutely need to include solvers into grid, you should be able to do something like this:
Grid_params = [
{'solver' : ['saga'],
'penalty' : ['elasticnet', 'l1', 'l2', 'none'],
'max_iter' : [50,100,200,500,1000,2500],
'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
{'solver' : ['newton-cg', 'lbfgs'],
'penalty' : ['l2','none'],
'max_iter' : [50,100,200,500,1000,2500],
'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
# add more parameter sets as needed...
]
I am working on a project with a dataset of aircraft engines and their lifetime. I need to use XGBRegressor to have the highest success rate of my model on my validation data.
I am having trouble understanding the XGBRegressor documentation, I was wondering if you know how I could optimize the search for optimal parameters instead of testing everything by hand.
I attached a part of my code related to XGB.
from xgboost import XGBRegressor
xgb = XGBRegressor(learning_rate = 0.3, max_depth = 7, n_estimators = 230, subsample = 0.7, colsample_bylevel = 0.7, colsample_bytree = 0.7, min_child_weight = 4, reg_alpha = 10, reg_lambda = 10)
xgb.fit(X_train, y_train)
The following answer will help you achieve this; you can add more Hyperparameters or more classifiers to test with different approaches. if you set cv=5 it will do 5-fold cross validation; but if you have a specific validation and only want to get perfect results you can add pass this to cv:
indices = np.arange(len(X_train))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
cv_indices=[(train_idx, test_idx)]
otherwise use: cv=5 to do 5-fold CV while searching for parameters.
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {
"XGB": XGBRegressor()
}
params = {
"XGB": {'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5], "n_estimators": [300, 600],
"learning_rate": [0.001, 0.01, 0.1],
}
}
for classifier_name in dict_classifiers.keys() & params:
print("training: ", classifier_name)
gridSearch = GridSearchCV(
estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=cv_indices)
gridSearch.fit(X_train, # shoud have shape of (n_samples, n_features)
y_train.reshape((-1))) #this should be an array with shape (n_samples,)
print(gridSearch.best_score_, gridSearch.best_params_)
I've been running a Randomized Grid Search in sklearn with LightGBM in Sagemaker, but when I run the fit line, it only displays one message that says Fitting 3 folds for each of 100 candidates, totalling 300 fits and nothing more, no messages showing the process or metrics.
Here's the code that I am using:
fit_params={#'boosting_type': 'gbdt',
#"objective":'binary',
"eval_metric" : 'auc',
"eval_set" : [(X_test,y_test)],
'eval_names': ['valid'],
'verbose': 60,
#'is_unbalance':True,
#'n_estimators':10000,
"early_stopping_rounds":500,
'categorical_feature': 'auto'}
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(6, 50),
'max_depth': sp_randint(3, 9),
'min_child_samples': sp_randint(150, 600),
'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
'learning_rate': sp_uniform(loc=0.001, scale=0.01),
'subsample': sp_uniform(loc=0.2, scale=0.8),
'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
'reg_alpha': [0, 1e-1, 1e-2, 5e-2, 0.5, 0.25, 1, 2, 5, 7, 10, 50, 100],
'reg_lambda': [0, 1e-1, 1e-2, 5e-2, 0.5, 0.25, 1, 5, 10, 20, 50, 100],
'class_weight': [None, 'balanced']
}
n_HP_points_to_test = 100
clf1 = lgb.LGBMClassifier(boosting_type='gbdt', metric='None', objective='binary',
is_unbalance=True, n_estimators=10000, random_state=314, n_jobs=40)
gs1 = RandomizedSearchCV(
estimator=clf1, param_distributions=param_test,
n_iter=n_HP_points_to_test,
scoring='roc_auc',
cv=3,
refit=True,
random_state=314,
n_jobs=30,
verbose=100)
And the final line to fit and launch the Search:
gs1.fit(X_train, y_train, **fit_params)
I've read other questions and they say that the output is being printed on the log terminal, but I don't seem to find it in Sagemaker neither on my local machine.
Has any of you come across this situation when searching for the optimal parameters? Do you have another piece of code to run this GridSearchCV or RandomizedSearchCV that works for you and LightGBM?
Forgot to mention that the fitting never finishes, I even coded a line to print the best_params_ to a .txt file, and it never showed, besides all the cores that I used, like 40, were still busy on htop, but 15 hours for this process seems kinda excessive.
Thank you so much in advance!!
I have 2 regressors:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
params = {
'num_leaves': [7, 14, 21, 28, 31, 50],
'learning_rate': [0.1, 0.03, 0.003],
'max_depth': [-1, 3, 5],
'n_estimators': [50, 100, 200, 500],
}
grid = GridSearchCV(lgb.LGBMRegressor(random_state=0), params, scoring='r2', cv=5)
grid.fit(X_train, y_train)
reg = lgb.LGBMRegressor(random_state=0)
As you see, I defined a random_state for both regressors. GridSearchCV must find the best params for estimator to increace its scroring. But
r2_score(y_train, grid.predict(X_train)) # output is 0.69
r2_score(y_train, reg.predict(X_train)) # output is 0.84
So, how can find best params for LGBMRegressor?
Based on the documentation here, after calling grid.fit() you can find the best estimator (ready model) and params here:
grid.best_estimator_
grid.best_params_
FYI: random_state works just for random cases (when shuffling for example).
In your case the params for models are different and the results of your metric R2 are accordingly different.
So, I believe you would have to script it like:
params = {
'num_leaves': [7, 14, 21, 28, 31, 50],
'learning_rate': [0.1, 0.03, 0.003],
'max_depth': [-1, 3, 5],
'n_estimators': [50, 100, 200, 500],
}
grid = GridSearchCV(lgb.LGBMRegressor(random_state=0), params, scoring='r2', cv=5)
grid.fit(X_train, y_train)
reg = lgb.LGBMRegressor(random_state=0)
reg.fit(X_train,y_train)
lgbm_tuned = grid.best_estimator_
r2_tuned = grid.best_params_
r2_regular = r2_score(y_train, reg.predict(X_train))
when r2_tuned is the best score found with Grid Search, lgbm_tuned is your model defined with the best parameters and r2_regular is your score with default parameters.
It is weird to find a worst result after gridsearch, specially when the parameters for the gridsearch includes the default parameters for LightGBM.
I have a dataset with the following dimensions for training and testing sets:
X_train = (58149, 9)
y_train = (58149,)
X_test = (24921, 9)
y_test = (24921,)
The code that I have for RandomizedSearchCV using LightGBM classifier is as follows:
# Parameters to be used for RandomizedSearchCV-
rs_params = {
# 'bagging_fraction': [0.6, 0.66, 0.7],
'bagging_fraction': sp_uniform(0.5, 0.8),
'bagging_frequency': sp_randint(5, 8),
# 'feature_fraction': [0.6, 0.66, 0.7],
'feature_fraction': sp_uniform(0.5, 0.8),
'max_depth': sp_randint(10, 13),
'min_data_in_leaf': sp_randint(90, 120),
'num_leaves': sp_randint(1200, 1550)
}
# Initialize a RandomizedSearchCV object using 5-fold CV-
rs_cv = RandomizedSearchCV(estimator=lgb.LGBMClassifier(), param_distributions=rs_params, cv = 5, n_iter=100)
# Train on training data-
rs_cv.fit(X_train, y_train)
When I execute this code, it gives me the following error:
LightGBMError: Check failed: bagging_fraction <=1.0 at
/__w/1/s/python-package/compile/src/io/config_auto.cpp, line 295.
Any idea as to what's going wrong?
I have removed sp_uniform and sp_randint from your code and it is working well
from sklearn.model_selection import RandomizedSearchCV
import lightgbm as lgb
np.random.seed(0)
d1 = np.random.randint(2, size=(100, 9))
d2 = np.random.randint(3, size=(100, 9))
d3 = np.random.randint(4, size=(100, 9))
Y = np.random.randint(7, size=(100,))
X = np.column_stack([d1, d2, d3])
rs_params = {
'bagging_fraction': (0.5, 0.8),
'bagging_frequency': (5, 8),
'feature_fraction': (0.5, 0.8),
'max_depth': (10, 13),
'min_data_in_leaf': (90, 120),
'num_leaves': (1200, 1550)
}
# Initialize a RandomizedSearchCV object using 5-fold CV-
rs_cv = RandomizedSearchCV(estimator=lgb.LGBMClassifier(), param_distributions=rs_params, cv = 5, n_iter=100,verbose=1)
# Train on training data-
rs_cv.fit(X, Y,verbose=1)
And according to the documentation
bagging_fraction will be <=0 || >=1.
Add verbose=1 so that you can see fittings of your model,
verbose gives us the information of your model.