if i use lightgbm
there is two methods of using lightgbm. first method: -
model=lgb.LGBMClassifier()
model.fit(X,y)
model.predict_proba(values)
i can get predict_proba method to predict probabilities.
if i use it natively
import lightgbm as lgb
dset = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
model = lgb.train(params, dset, 100)
I can get predict method but predict_proba method dont exist if i use this method. anyone can advise what i did wrong?
You can only use predict_proba in sklearn fit(). It does not exist in lightgbm native train().
Related
I'm trying to finds the best estimator using GridSearchCV and I'm using refit = True as per default. Given that the documentation states:
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance
Should I do .fit on the training data afterwards as such:
classifier = GridSearchCV(estimator=model,param_grid = parameter_grid['param_grid'], scoring='balanced_accuracy', cv = 5, verbose=3, n_jobs=4,return_train_score=True, refit=True)
classifier.fit(x_training, y_train_encoded_local)
predictions = classifier.predict(x_testing)
balanced_error = balanced_accuracy_score(y_true=y_test_encoded_local,y_pred=predictions)
Or should I do it like this instead:
classifier = GridSearchCV(estimator=model,param_grid = parameter_grid['param_grid'], scoring='balanced_accuracy', cv = 5, verbose=3, n_jobs=4,return_train_score=True, refit=True)
predictions = classifier.predict(x_testing)
balanced_error = balanced_accuracy_score(y_true=y_test_encoded_local,y_pred=predictions)
You should do it like your first verison. You need to always call classifier.fit otherwise it doesn't do anything. Refit=True means that it trains on the entire training set after the cross validation is done.
I am using sklearn's OneVsOneClassifier in an pipeline like so:
smt = SMOTE(random_state=42)
base_model = LogisticRegression()
pipeline = Pipeline([('sampler', smt), ('model', base_model)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X_train, y_train)
# prediction
yhat = classifier.predict(X_test)
But then I cannot do:
yhat_prob = predict_proba(X_test)
AttributeError: 'OneVsOneClassifier' object has no attribute 'predict_proba'
scikit-learns OneVsRestClassifier does provide predict_proba method. I am suprised OneVsOneClassifier doesn't have this method.
How do I then get class probability estimates from my pipeline above?
It's not clear how to use OvO to get probabilities, so it's not implemented. https://github.com/scikit-learn/scikit-learn/issues/6164
There is the decision_function method for a more nuanced version of predict.
Issue
I am trying to use the XGBoost classifier and define a custom matric that uses f-beta-score but I am getting negative predicted values returned to the custom function and after np.round I ended up getting three values [-1,0,1]
Question
How do I solve this issue, as I want to know about the f_beta_score during the training eval?
Function of Custom performance matrics
def fbeta_func(y_predicted, y_true):
f = fbeta_score(list(y_test),np.round(y_predicted),beta=0.5)
return 'f_beta', f
And then I am calling this function in the fit function:
from xgboost import XGBClassifier
clf = XGBClassifier(**param)
clf.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric=fbeta_func,
verbose=True)
Here are the parameters:
Hi i need help in calibrating probabilities in lightgbm
below is my code
cv_results = lgb.cv(params,
lgtrain,
nfold=10,
stratified=False ,
num_boost_round = num_rounds,
verbose_eval=10,
early_stopping_rounds = 50,
seed = 50)
best_nrounds = cv_results.shape[0] - 1
lgb_clf = lgb.train(params,
lgtrain,
num_boost_round=10000 ,
valid_sets=[lgtrain,lgvalid],
early_stopping_rounds=50,
verbose_eval=10)
ypred = lgb_clf.predict(test, num_iteration=lgb_clf.best_iteration)
I am not sure about LighGBM, but in the case of XGBoost, if you want to calibrate the probabilities the best and most probably the only way is to use CalibratedClassifierCV from sklearn.
You can find it here - https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
The only catch here is that the CalibratedClassifierCV only takes sklearn's estimators as input, so you might have to use the sklearn wrapper for XGBoost instead of the traditional XGBoost API's .train function.
You can find the XGBoost's sklearn wrapper here - https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier
I hope it answers your question.
I'm working on a predictive model using XGBoost (latest version on PyPl: 0.6) in Python, and have been developing it training on about half of my data. Now that I have my final model, I trained it on all my data, but got this message, which I've never seen before:
Tree method is automatically selected to be 'approx' for faster speed.
to use old behavior(exact greedy algorithm on single machine), set
tree_method to 'exact'"
As a reproduceable example, the following code also produces that message on my machine:
import numpy as np
import xgboost as xgb
rows = 10**7
cols = 20
X = np.random.randint(0, 100, (rows, cols))
y = np.random.randint(0,2, size=rows)
clf = xgb.XGBClassifier(max_depth=5)
clf.fit(X,y)
I've tried setting tree_method to 'exact' in both the initialization and fit() steps of my model, but each throws errors:
import xgboost as xgb
clf = xgb.XGBClassifier(tree_method = 'exact')
clf
> __init__() got an unexpected keyword argument 'tree_method'
my_pipeline.fit(X_train, Y_train, clf__tree_method='exact')
> self._final_estimator.fit(Xt, y, **fit_params) TypeError: fit() got an
> unexpected keyword argument 'tree_method'
How can I specify tree_method='exact' with XGBoost in Python?
According to the XGBoost parameter documentation, this is because the default for tree_method is "auto". The "auto" setting is data-dependent: for "small-to-medium" data, it will use the "exact" approach and for "very-large" datasets, it will use "approximate". When you started to use your whole training set (instead of 50%), you must have crossed the training-size threshold that changes the auto-value for tree_method. It's unclear from the docs how many observations are required to reach that threshold, but it seems that it's somewhere between 5 and 10 million rows (since you have rows = 10**7).
I don't know if the tree_method argument is exposed in the XGBoost Python module (it sounds like it's not, so maybe file a bug report?), but tree_method is exposed in the R API.
The docs describe why you see that warning message:
It is still not implemented in the scikit-learn API for xgboost.
Hence I'm referencing the below code example from here.
import xgboost as xgb
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits(2)
X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)
param = {'objective': 'binary:logistic',
'tree_method':'hist',
'grow_policy':"lossguide",
'eval_metric': 'auc'}
res = {}
bst = xgb.train(param, dtrain, 10, [(dtrain, 'train'), (dtest, 'test')], evals_result=res)
You can use GPU from sklearn API in xGBoost. You can use it like this:
import xgboost
xgb = xgboost.XGBClassifier(n_estimators=200, tree_method='gpu_hist', predictor='gpu_predictor')
xgb.fit(X_train, y_train)
You can use different tree methods. Refer to the documentation to choose the most appropriate methods for your need.