Calibrating Probabilities in lightgbm or XGBoost - python

Hi i need help in calibrating probabilities in lightgbm
below is my code
cv_results = lgb.cv(params,
lgtrain,
nfold=10,
stratified=False ,
num_boost_round = num_rounds,
verbose_eval=10,
early_stopping_rounds = 50,
seed = 50)
best_nrounds = cv_results.shape[0] - 1
lgb_clf = lgb.train(params,
lgtrain,
num_boost_round=10000 ,
valid_sets=[lgtrain,lgvalid],
early_stopping_rounds=50,
verbose_eval=10)
ypred = lgb_clf.predict(test, num_iteration=lgb_clf.best_iteration)

I am not sure about LighGBM, but in the case of XGBoost, if you want to calibrate the probabilities the best and most probably the only way is to use CalibratedClassifierCV from sklearn.
You can find it here - https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
The only catch here is that the CalibratedClassifierCV only takes sklearn's estimators as input, so you might have to use the sklearn wrapper for XGBoost instead of the traditional XGBoost API's .train function.
You can find the XGBoost's sklearn wrapper here - https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier
I hope it answers your question.

Related

lightgbm how to predict_proba?

if i use lightgbm
there is two methods of using lightgbm. first method: -
model=lgb.LGBMClassifier()
model.fit(X,y)
model.predict_proba(values)
i can get predict_proba method to predict probabilities.
if i use it natively
import lightgbm as lgb
dset = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
model = lgb.train(params, dset, 100)
I can get predict method but predict_proba method dont exist if i use this method. anyone can advise what i did wrong?
You can only use predict_proba in sklearn fit(). It does not exist in lightgbm native train().

Using Custom Metric for Score Method in XGBoost

I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website

Using Multiple Metric Evaluation with GridSearchCV

I am attempting to use multiple metrics in GridSearchCV. My project needs multiple metrics including "accuracy" and "f1 score". However, after following the sklearn models and online posts, I can't seem to get mine to work. Here is my code:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
clf = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1,30), 'algorithm': ['auto','ball_tree','kd_tree', 'brute'], 'weights': ['uniform', 'distance'],'p': range(1,5)}
#Metrics for Evualation:
met_grid= ['accuracy', 'f1'] #The metric codes from sklearn
custom_knn = GridSearchCV(clf, param_grid, scoring=met_grid, refit='accuracy', return_train_score=True)
custom_knn.fit(X_train, y_train)
y_pred = custom_knn.predict(X_test)
My error occurs on the custom_knn.fit(X_train,y_train). Further more, if you comment-out the scoring=met_grid, refit='accuracy', return_train_score=True, it works.
Here is my error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Also, if you could explain multiple metric evaluation or refer me to someone who can, that would be much appreciated!
Thanks
f1 is a binary classification metric. For multi-class classification, you have to use averaged f1 based on different aggregation. You can find the exhaustive list of scoring available in Sklearn here.
Try this!
scoring = ['accuracy','f1_macro']
custom_knn = GridSearchCV(clf, param_grid, scoring=scoring,
refit='accuracy', return_train_score=True,cv =3)

Using cross_val_predict against test data set

I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()

Use both Recursive Feature Eliminiation and Grid Search in SciKit-Learn

I have a machine learning problem and want to optimize my SVC estimators as well as the feature selection.
For optimizing SVC estimators I use essentially the code from the docs. Now my question is, how can I combine this with recursive feature elimination cross validation (RCEV)? That is, for each estimator-combination I want to do the RCEV in order to determine the best combination of estimators and features.
I tried the solution from this thread, but it yields the following error:
ValueError: Invalid parameter C for estimator RFECV. Check the list of available parameters with `estimator.get_params().keys()`.
My code looks like this:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-4,1e-3],'C': [1,10]},
{'kernel': ['linear'],'C': [1, 10]}]
estimator = SVC(kernel="linear")
selector = RFECV(estimator, step=1, cv=3, scoring=None)
clf = GridSearchCV(selector, tuned_parameters, cv=3)
clf.fit(X_train, y_train)
The error appears at clf = GridSearchCV(selector, tuned_parameters, cv=3).
I would use a Pipeline, but here you have a more adequate response
Recursive feature elimination and grid search using scikit-learn

Categories