This is in reference to understanding, internally, how the probabilities for a class are predicted using LightGBM.
Other packages, like sklearn, provide thorough detail for their classifiers. For example:
LogisticRegression returns:
Probability estimates.
The returned estimates for all classes are ordered by the label of
classes.
For a multi_class problem, if multi_class is set to be “multinomial”
the softmax function is used to find the predicted probability of each
class. Else use a one-vs-rest approach, i.e calculate the probability
of each class assuming it to be positive using the logistic function.
and normalize these values across all the classes.
RandomForest returns:
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as
the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
There are additional Stack Overflow questions which provide additional details, such as for:
Support Vector Machines
Multilayer Perceptron
I am trying to uncover those same details for LightGBM's predict_proba function. The documentation does not list the details of how the probabilities are calculated.
The documentation simply states:
Return the predicted probability for each class for each sample.
The source code is below:
def predict_proba(self, X, raw_score=False, start_iteration=0, num_iteration=None,
pred_leaf=False, pred_contrib=False, **kwargs):
"""Return the predicted probability for each class for each sample.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
Input features matrix.
raw_score : bool, optional (default=False)
Whether to predict raw scores.
start_iteration : int, optional (default=0)
Start index of the iteration to predict.
If <= 0, starts from the first iteration.
num_iteration : int or None, optional (default=None)
Total number of iterations used in the prediction.
If None, if the best iteration exists and start_iteration <= 0, the best iteration is used;
otherwise, all iterations from ``start_iteration`` are used (no limits).
If <= 0, all iterations from ``start_iteration`` are used (no limits).
pred_leaf : bool, optional (default=False)
Whether to predict leaf index.
pred_contrib : bool, optional (default=False)
Whether to predict feature contributions.
.. note::
If you want to get more explanations for your model's predictions using SHAP values,
like SHAP interaction values,
you can install the shap package (https://github.com/slundberg/shap).
Note that unlike the shap package, with ``pred_contrib`` we return a matrix with an extra
column, where the last column is the expected value.
**kwargs
Other parameters for the prediction.
Returns
-------
predicted_probability : array-like of shape = [n_samples, n_classes]
The predicted probability for each class for each sample.
X_leaves : array-like of shape = [n_samples, n_trees * n_classes]
If ``pred_leaf=True``, the predicted leaf of every tree for each sample.
X_SHAP_values : array-like of shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects
If ``pred_contrib=True``, the feature contributions for each sample.
"""
result = super(LGBMClassifier, self).predict(X, raw_score, start_iteration, num_iteration,
pred_leaf, pred_contrib, **kwargs)
if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):
warnings.warn("Cannot compute class probabilities or labels "
"due to the usage of customized objective function.\n"
"Returning raw scores instead.")
return result
elif self._n_classes > 2 or raw_score or pred_leaf or pred_contrib:
return result
else:
return np.vstack((1. - result, result)).transpose()
How can I understand how exactly the predict_proba function for LightGBM is working internally?
LightGBM, like all gradient boosting methods for classification, essentially combines decision trees and logistic regression. We start with the same logistic function representing the probabilities (a.k.a. softmax):
P(y = 1 | X) = 1/(1 + exp(Xw))
The interesting twist is that the feature matrix X is composed from the terminal nodes from a decision tree ensemble. These are all then weighted by w, a parameter that must be learned. The mechanism used to learn the weights depends on the precise learning algorithm used. Similarly, the construction of X also depends on the algorithm. LightGBM, for example, introduced two novel features which won them the performance improvements over XGBoost: "Gradient-based One-Side Sampling" and "Exclusive Feature Bundling". Generally though, each row collects the terminal leafs for each sample and the columns represent the terminal leafs.
So here is what the docs could say...
Probability estimates.
The predicted class probabilities of an input sample are computed as the
softmax of the weighted terminal leaves from the decision tree ensemble corresponding to the provided sample.
For further details, you'd have to delve into the details of boosting, XGBoost, and finally the LightGBM paper, but that seems a bit heavy handed given the other documentation examples you've given.
Short Explanation
Below we can see an illustration of what each method is calling under the hood. First, the predict_proba() method of the class LGBMClassifier is calling the predict() method from LGBMModel (it inherits from it).
LGBMClassifier.predict_proba() (inherits from LGBMModel)
|---->LGBMModel().predict() (calls LightGBM Booster)
|---->Booster.predict()
Then, it calls the predict() method from the LightGBM Booster (the Booster class). In order to call this method, the Booster should be trained first.
Basically, the Booster is the one that generates the predicted value for each sample by calling it's predict() method. See below, for a detailed follow up of how this booster works.
Detailed Explanation or How does the LightGBM Booster works?
We seek to answer the question how does LightGBM booster works?. By going through the Python code we can get a general idea of how it is trained and updated. But, there are some further references to the C++ libraries of LightGBM that I'm not in a position to explain. However, a general glimpse of LightGBM's Booster workflow is explained.
A. Initializing and Training the Booster
The _Booster of LGBMModel is initialized by calling the train() function, on line 595 of sklearn.py we see the following code
self._Booster = train(params, train_set,
self.n_estimators, valid_sets=valid_sets, valid_names=eval_names,
early_stopping_rounds=early_stopping_rounds,
evals_result=evals_result, fobj=self._fobj, feval=feval,
verbose_eval=verbose, feature_name=feature_name,
callbacks=callbacks, init_model=init_model)
Note. train() comes from engine.py.
Inside train() we see that the Booster is initialized (line 231)
# construct booster
try:
booster = Booster(params=params, train_set=train_set)
...
and updated at every training iteration (line 242).
for i in range_(init_iteration, init_iteration + num_boost_round):
...
...
booster.update(fobj=fobj)
...
B. How does booster.update() works?
To understand how the update() method works we should go to line 2315 of basic.py. Here, we see that this function updates the Booster for one iteration.
There two alternatives to update the booster, depending on wether or not you provide an objective function.
Objective Function is None
On line 2367 we get to the following code
if fobj is None:
...
...
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
self.handle,
ctypes.byref(is_finished)))
self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
return is_finished.value == 1
notice that as the objective function (fobj) is not provided it updates the booster by calling LGBM_BoosterUpdateOneIter from _LIB. For short, _LIB are the loaded C++ LightGBM libraries.
What is _LIB?
_LIB is a variable that stores the loaded LightGBM library by calling _load_lib() (line 29 of basic.py).
Then _load_lib() loads the LightGBM library by finding on your system the path to lib_lightgbm.dll(Windows) or lib_lightgbm.so (Linux).
Objective Function provided
When a custom object function is encountered, we get to the following case
else:
...
...
grad, hess = fobj(self.__inner_predict(0), self.train_set)
where __inner_predict() is a method from LightGBM's Booster (see line 1930 from basic.py for more details of the Booster class), which predicts for training and validation data. Inside __inner_predict() (line 3142 of basic.py) we see that it calls LGBM_BoosterGetPredict from _LIB to get the predictions, that is,
_safe_call(_LIB.LGBM_BoosterGetPredict(
self.handle,
ctypes.c_int(data_idx),
ctypes.byref(tmp_out_len),
data_ptr))
Finally, after updating range_(init_iteration, init_iteration + num_boost_round) times the booster it will be trained. Thus, Booster.predict() can be called by LightGBMClassifier.predict_proba().
Note. The booster is trained as part of the model fitting step, especifically by LGBMModel.fit(), see line 595 of sklearn.py for code details.
Related
I trained a simple randomforest classifier, then when I test the prediction with the same test input:
rf_clf.predict([[50,0,500,0,20,0,250000,1.5,110,0,0,2]])
rf_clf.predict_proba([[50,0,500,0,20,0,250000,1.5,110,0,0,2]])
The first line returns array([1.]), whereas the second line returns array([[0.14, 0.86]]) where the prediction is the first float 0.14 right?
How come those two don't match? I'm a bit confused. Thanks.
predict() function returns the class to which the feature belongs to and predict_proba() function returns the probability of the feature belonging to the diffrent output classes.
Example:
Output of predict() function gives you the result that the feature belongs to class 1 (i.e) array([1.])
Output of predict_proba() function gives you the probabilities of the feature belonging to each output class array([[0.14, 0.86]]). 14% probability of feature belonging to class 0 and 86% probability of feature belonging to class 1.
Refer Docs: predict() docs, predict_proba() docs
Take a look at the documentation part of sklearn.ensemble.RandomForestClassifier, specifically the predict_proba method.
Returns: ndarray of shape (n_samples, n_classes), or a list of n_outputs. such arrays if n_outputs > 1. The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
The output you're getting (array([[0.14, 0.86]])) is thus a list of the probabilities for each of the classes that are present in your sample, for each sample input. The method predict() simply predicts one class for each input (so that's why you're getting array([1.]) as return).
I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X). Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions, which just sums the probability estimates, followed by the division by the number of estimators.
I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?
To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)
My dataset already has weighted examples. And in this binary classification I also have far more of the first class compared to the second.
Can I use both sample_weight and further re-weight it with class_weight in the model.fit() function?
Or do I first make a new array of new_weights and pass it to the fit function as sample_weight?
Edit:
TO further clarify, I already have individual weights for each sample in my dataset, and to further add to the complexity, the total sum of sample weights of the first class is far more than the total sample weights of the second class.
For example I currently have:
y = [0,0,0,0,1,1]
sample_weights = [0.01,0.03,0.05,0.02, 0.01,0.02]
so the sum of weights for class '0' is 0.11 and for class '1' is 0.03. So I should have:
class_weight = {0 : 1. , 1: 0.11/0.03}
I need to use both sample_weight AND class_weight features. If one overrides the other then I will have to create new sample_weights and then use fit() or train_on_batch().
So my question is, can I use both, or does one override the other?
You can surely do both if you want, the thing is if that is what you need. According to the keras docs:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data [...].
So given that you mention that you "have far more of the first class compared to the second" I think that you should go for the class_weight parameter. There you can indicate that ratio your dataset presents so you can compensate for imbalanced data classes. The sample_weight is more when you want to define a weight or importance for each data element.
For example if you pass:
class_weight = {0 : 1. , 1: 50.}
you will be saying that every sample from class 1 would count as 50 samples from class 0, therefore giving more "importance" to your elements from class 1 (as you have less of those samples surely). You can custom this to fit your own needs. More info con imbalanced datasets on this great question.
Note: To further compare both parameters, have in mind that passing class_weight as {0:1., 1:50.} would be equivalent to pass sample_weight as [1.,1.,1.,...,50.,50.,...], given you had samples whose classes where [0,0,0,...,1,1,...].
As we can see it is more practical to use class_weight on this case, and sample_weight could be of use on more specific cases where you actually want to give an "importance" to each sample individually. Using both can also be done if the case requires it, but one has to have in mind its cumulative effect.
Edit: As per your new question, digging on the Keras source code it seems that indeed sample_weights overrides class_weights, here is the piece of code that does it on the _standarize_weigths method (line 499):
if sample_weight is not None:
#...Does some error handling...
return sample_weight #simply returns the weights you passed
elif isinstance(class_weight, dict):
#...Some error handling and computations...
#Then creates an array repeating class weight to match your target classes
weights = np.asarray([class_weight[cls] for cls in y_classes
if cls in class_weight])
#...more error handling...
return weights
This means that you can only use one or the other, but not both. Therefore you will indeed need to multiply your sample_weights by the ratio you need to compensate for the imbalance.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.
To add a little to DarkCygnus answer, for those who actually need to use class weight & sample weights simultaneously:
Here is a code, that I use for generating sample weights for classifying multiclass temporal data in sequences:
(targets is an array of dimension [#temporal, #categories] with values being in set(#classes), class_weights is an array of [#categories, #classes]).
The generated sequence has the same length as the targets array and the common usecase in batching is to pad the targets with zeros and the sample weights also up to the same size, thus making the network ignore the padded data.
def multiclass_temoral_class_weights(targets, class_weights):
s_weights = np.ones((targets.shape[0],))
# if we are counting the classes, the weights do not exist yet!
if class_weights is not None:
for i in range(len(s_weights)):
weight = 0.0
for itarget, target in enumerate(targets[i]):
weight += class_weights[itarget][int(round(target))]
s_weights[i] = weight
return s_weights
I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?
Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.
Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.