My dataset already has weighted examples. And in this binary classification I also have far more of the first class compared to the second.
Can I use both sample_weight and further re-weight it with class_weight in the model.fit() function?
Or do I first make a new array of new_weights and pass it to the fit function as sample_weight?
Edit:
TO further clarify, I already have individual weights for each sample in my dataset, and to further add to the complexity, the total sum of sample weights of the first class is far more than the total sample weights of the second class.
For example I currently have:
y = [0,0,0,0,1,1]
sample_weights = [0.01,0.03,0.05,0.02, 0.01,0.02]
so the sum of weights for class '0' is 0.11 and for class '1' is 0.03. So I should have:
class_weight = {0 : 1. , 1: 0.11/0.03}
I need to use both sample_weight AND class_weight features. If one overrides the other then I will have to create new sample_weights and then use fit() or train_on_batch().
So my question is, can I use both, or does one override the other?
You can surely do both if you want, the thing is if that is what you need. According to the keras docs:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data [...].
So given that you mention that you "have far more of the first class compared to the second" I think that you should go for the class_weight parameter. There you can indicate that ratio your dataset presents so you can compensate for imbalanced data classes. The sample_weight is more when you want to define a weight or importance for each data element.
For example if you pass:
class_weight = {0 : 1. , 1: 50.}
you will be saying that every sample from class 1 would count as 50 samples from class 0, therefore giving more "importance" to your elements from class 1 (as you have less of those samples surely). You can custom this to fit your own needs. More info con imbalanced datasets on this great question.
Note: To further compare both parameters, have in mind that passing class_weight as {0:1., 1:50.} would be equivalent to pass sample_weight as [1.,1.,1.,...,50.,50.,...], given you had samples whose classes where [0,0,0,...,1,1,...].
As we can see it is more practical to use class_weight on this case, and sample_weight could be of use on more specific cases where you actually want to give an "importance" to each sample individually. Using both can also be done if the case requires it, but one has to have in mind its cumulative effect.
Edit: As per your new question, digging on the Keras source code it seems that indeed sample_weights overrides class_weights, here is the piece of code that does it on the _standarize_weigths method (line 499):
if sample_weight is not None:
#...Does some error handling...
return sample_weight #simply returns the weights you passed
elif isinstance(class_weight, dict):
#...Some error handling and computations...
#Then creates an array repeating class weight to match your target classes
weights = np.asarray([class_weight[cls] for cls in y_classes
if cls in class_weight])
#...more error handling...
return weights
This means that you can only use one or the other, but not both. Therefore you will indeed need to multiply your sample_weights by the ratio you need to compensate for the imbalance.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.
To add a little to DarkCygnus answer, for those who actually need to use class weight & sample weights simultaneously:
Here is a code, that I use for generating sample weights for classifying multiclass temporal data in sequences:
(targets is an array of dimension [#temporal, #categories] with values being in set(#classes), class_weights is an array of [#categories, #classes]).
The generated sequence has the same length as the targets array and the common usecase in batching is to pad the targets with zeros and the sample weights also up to the same size, thus making the network ignore the padded data.
def multiclass_temoral_class_weights(targets, class_weights):
s_weights = np.ones((targets.shape[0],))
# if we are counting the classes, the weights do not exist yet!
if class_weights is not None:
for i in range(len(s_weights)):
weight = 0.0
for itarget, target in enumerate(targets[i]):
weight += class_weights[itarget][int(round(target))]
s_weights[i] = weight
return s_weights
Related
This is in reference to understanding, internally, how the probabilities for a class are predicted using LightGBM.
Other packages, like sklearn, provide thorough detail for their classifiers. For example:
LogisticRegression returns:
Probability estimates.
The returned estimates for all classes are ordered by the label of
classes.
For a multi_class problem, if multi_class is set to be “multinomial”
the softmax function is used to find the predicted probability of each
class. Else use a one-vs-rest approach, i.e calculate the probability
of each class assuming it to be positive using the logistic function.
and normalize these values across all the classes.
RandomForest returns:
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as
the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
There are additional Stack Overflow questions which provide additional details, such as for:
Support Vector Machines
Multilayer Perceptron
I am trying to uncover those same details for LightGBM's predict_proba function. The documentation does not list the details of how the probabilities are calculated.
The documentation simply states:
Return the predicted probability for each class for each sample.
The source code is below:
def predict_proba(self, X, raw_score=False, start_iteration=0, num_iteration=None,
pred_leaf=False, pred_contrib=False, **kwargs):
"""Return the predicted probability for each class for each sample.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
Input features matrix.
raw_score : bool, optional (default=False)
Whether to predict raw scores.
start_iteration : int, optional (default=0)
Start index of the iteration to predict.
If <= 0, starts from the first iteration.
num_iteration : int or None, optional (default=None)
Total number of iterations used in the prediction.
If None, if the best iteration exists and start_iteration <= 0, the best iteration is used;
otherwise, all iterations from ``start_iteration`` are used (no limits).
If <= 0, all iterations from ``start_iteration`` are used (no limits).
pred_leaf : bool, optional (default=False)
Whether to predict leaf index.
pred_contrib : bool, optional (default=False)
Whether to predict feature contributions.
.. note::
If you want to get more explanations for your model's predictions using SHAP values,
like SHAP interaction values,
you can install the shap package (https://github.com/slundberg/shap).
Note that unlike the shap package, with ``pred_contrib`` we return a matrix with an extra
column, where the last column is the expected value.
**kwargs
Other parameters for the prediction.
Returns
-------
predicted_probability : array-like of shape = [n_samples, n_classes]
The predicted probability for each class for each sample.
X_leaves : array-like of shape = [n_samples, n_trees * n_classes]
If ``pred_leaf=True``, the predicted leaf of every tree for each sample.
X_SHAP_values : array-like of shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects
If ``pred_contrib=True``, the feature contributions for each sample.
"""
result = super(LGBMClassifier, self).predict(X, raw_score, start_iteration, num_iteration,
pred_leaf, pred_contrib, **kwargs)
if callable(self._objective) and not (raw_score or pred_leaf or pred_contrib):
warnings.warn("Cannot compute class probabilities or labels "
"due to the usage of customized objective function.\n"
"Returning raw scores instead.")
return result
elif self._n_classes > 2 or raw_score or pred_leaf or pred_contrib:
return result
else:
return np.vstack((1. - result, result)).transpose()
How can I understand how exactly the predict_proba function for LightGBM is working internally?
LightGBM, like all gradient boosting methods for classification, essentially combines decision trees and logistic regression. We start with the same logistic function representing the probabilities (a.k.a. softmax):
P(y = 1 | X) = 1/(1 + exp(Xw))
The interesting twist is that the feature matrix X is composed from the terminal nodes from a decision tree ensemble. These are all then weighted by w, a parameter that must be learned. The mechanism used to learn the weights depends on the precise learning algorithm used. Similarly, the construction of X also depends on the algorithm. LightGBM, for example, introduced two novel features which won them the performance improvements over XGBoost: "Gradient-based One-Side Sampling" and "Exclusive Feature Bundling". Generally though, each row collects the terminal leafs for each sample and the columns represent the terminal leafs.
So here is what the docs could say...
Probability estimates.
The predicted class probabilities of an input sample are computed as the
softmax of the weighted terminal leaves from the decision tree ensemble corresponding to the provided sample.
For further details, you'd have to delve into the details of boosting, XGBoost, and finally the LightGBM paper, but that seems a bit heavy handed given the other documentation examples you've given.
Short Explanation
Below we can see an illustration of what each method is calling under the hood. First, the predict_proba() method of the class LGBMClassifier is calling the predict() method from LGBMModel (it inherits from it).
LGBMClassifier.predict_proba() (inherits from LGBMModel)
|---->LGBMModel().predict() (calls LightGBM Booster)
|---->Booster.predict()
Then, it calls the predict() method from the LightGBM Booster (the Booster class). In order to call this method, the Booster should be trained first.
Basically, the Booster is the one that generates the predicted value for each sample by calling it's predict() method. See below, for a detailed follow up of how this booster works.
Detailed Explanation or How does the LightGBM Booster works?
We seek to answer the question how does LightGBM booster works?. By going through the Python code we can get a general idea of how it is trained and updated. But, there are some further references to the C++ libraries of LightGBM that I'm not in a position to explain. However, a general glimpse of LightGBM's Booster workflow is explained.
A. Initializing and Training the Booster
The _Booster of LGBMModel is initialized by calling the train() function, on line 595 of sklearn.py we see the following code
self._Booster = train(params, train_set,
self.n_estimators, valid_sets=valid_sets, valid_names=eval_names,
early_stopping_rounds=early_stopping_rounds,
evals_result=evals_result, fobj=self._fobj, feval=feval,
verbose_eval=verbose, feature_name=feature_name,
callbacks=callbacks, init_model=init_model)
Note. train() comes from engine.py.
Inside train() we see that the Booster is initialized (line 231)
# construct booster
try:
booster = Booster(params=params, train_set=train_set)
...
and updated at every training iteration (line 242).
for i in range_(init_iteration, init_iteration + num_boost_round):
...
...
booster.update(fobj=fobj)
...
B. How does booster.update() works?
To understand how the update() method works we should go to line 2315 of basic.py. Here, we see that this function updates the Booster for one iteration.
There two alternatives to update the booster, depending on wether or not you provide an objective function.
Objective Function is None
On line 2367 we get to the following code
if fobj is None:
...
...
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
self.handle,
ctypes.byref(is_finished)))
self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)]
return is_finished.value == 1
notice that as the objective function (fobj) is not provided it updates the booster by calling LGBM_BoosterUpdateOneIter from _LIB. For short, _LIB are the loaded C++ LightGBM libraries.
What is _LIB?
_LIB is a variable that stores the loaded LightGBM library by calling _load_lib() (line 29 of basic.py).
Then _load_lib() loads the LightGBM library by finding on your system the path to lib_lightgbm.dll(Windows) or lib_lightgbm.so (Linux).
Objective Function provided
When a custom object function is encountered, we get to the following case
else:
...
...
grad, hess = fobj(self.__inner_predict(0), self.train_set)
where __inner_predict() is a method from LightGBM's Booster (see line 1930 from basic.py for more details of the Booster class), which predicts for training and validation data. Inside __inner_predict() (line 3142 of basic.py) we see that it calls LGBM_BoosterGetPredict from _LIB to get the predictions, that is,
_safe_call(_LIB.LGBM_BoosterGetPredict(
self.handle,
ctypes.c_int(data_idx),
ctypes.byref(tmp_out_len),
data_ptr))
Finally, after updating range_(init_iteration, init_iteration + num_boost_round) times the booster it will be trained. Thus, Booster.predict() can be called by LightGBMClassifier.predict_proba().
Note. The booster is trained as part of the model fitting step, especifically by LGBMModel.fit(), see line 595 of sklearn.py for code details.
I've set up a neural network regression model using Keras with one target. This works fine,
now I'd like to include multiple targets. The dataset includes a total of 30 targets, and I'd rather train one neural network instead of 30 different ones.
My problem is that in the preprocessing of the data I have to remove some target values, for a given example, as they represent unphysical values that are not to be predicted.
This creates the issues that I have a varying number of targets/output.
For example:
Targets =
None, 0.007798, 0.012522
0.261140, 2110.000000, 2440.000000
0.048799, None, None
How would I go about creating a keras.Sequential model(or functional) with a varying number of outputs for a given input?
edit: Could I perhaps first train a classification model that predicts the number of outputs given some test inputs, and then vary the number of outputs in the output layer according to this prediction? I guess I would have to use the functional API for something like that.
The "classification" edit here is unnecessary, i.e. ignore it. The number of outputs of the test targets is a known quantity.
(Sorry, I don't have enough reputation to comment)
First, do you know up front whether some of the output values will be invalid or is part of the problem predicting which outputs will actually be valid?
If you don't know up front which outputs to disregard, you could go with something like the 2-step approach you described in your comment.
If it is deterministic (and you know how so) which outputs will be valid for any given input and your problem is just how to set up a proper model, here's how I would do that in keras:
Use the functional API
Create 30 named output layers (e.g. out_0, out_1, ... out_29)
When creating the model, just use the outputs argument to list all 30 outputs
When compiling the model, specify a loss for each separate output, you can do this by passing a dictionary to the loss argument where the keys are the names of your output layers and the values are the respective losses
Assuming you'll use mean-squared error for all outputs, the dictionary will look something like {'out_0': 'mse', 'out_1': 'mse', ..., 'out_29': 'mse'}
When passing inputs to the models, pass three things per input: x, y, loss-weights
y has to be a dictionary where the key is the output layer name and the value is the target output value
The loss-weights are also a dictionary in the same format as y. The weights in your case can just be binary, 1 for each output that corresponds to a real value, 0 for each output that corresponds to unphysical values (so they are disregarded during training) for any given sample
Don't pass None's for the unphysical value targets, use some kind of numeric filler, otherwise you'll get issues. It is completely irrelevant what you use for your filler as it will not affect gradients during training
This will give you a trainable model. BUT once you move on from training and try to predict on new data, YOU will have to decide which outputs to disregard for each sample, the network will likely still give you "valid"-looking outputs for those inputs.
One possible solution would be to have a separate output of "validity flags" which takes values in range from zero to one. For example, your first target will be
y=[0.0, 0.007798, 0.012522]
yf=[0.0, 1.0, 1.0]
where zeros indicate invalid values.
Use sigmoid activation function for yf.
Loss function can be the sum of losses for y and yf.
During inference, analyze the network output for yf and only consider y value valid if corresponding yf exceeds 0.5 threshold
I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?
To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)
First of all I narrate you about my question and situation.
I want to do multi-label classification in chainer and my class imbalance problem is very serious.
In this cases I must slice the vector inorder to calculate loss function, For example, In multi-label classification, ground truth label vector most elements is 0, only few of them is 1, In this situation, directly use F.sigmoid_cross_entropy to apply all the 0/1 elements may cause training not convergence, So I decide to use a[[xx,xxx,...,xxx]] slice( a is chainer.Variable output by last FC layer) to slice specific elements to calculate loss function.
In this case, because of label imbalance may cause rare class low classification performance, so I want to set rare gt-label variable high loss weight during back propagation, but set major label(occur too many in gt) variable low weight during back propagation.
How should I do it? What is your suggestion about multi-label imbalance class problem training in chainer?
You can use sigmoid_cross_entropy() of no-reduce mode (by passing reduce='no') to obtain a loss value at each spatial location and the average function for weighted averaging.
sigmoid_cross_entropy() first computes the loss value at each spatial location and each data along the batch dimension, and then take the mean or summation over the spatial dimensions and batch dimension (depending on the normalize option). You can disable the reduction part by passing reduce='no'. If you want to do the weighted average, you should specify it so that you can get the loss value at each location and reduce them by yourself.
After that, the simplest way to manually do weighted averaging is using average(), which can accept weight argument that indicates the weights for averaging. It first does weighted summation using the input and weight, and then divides the result by the summation of weight. You can pass appropriate weight array that has the same shape as the input and pass it to average() along with the raw (unreduced) loss values obtained by sigmoid_cross_entropy(..., reduce='no'). It is also ok to manually multiply a weight array and take summation like F.sum(score * weight) if weight is appropriately scaled (e.g. summing up to 1).
If you work on multi-label classification, how about using softmax_crossentropy loss?
softmax_crossentropy can take into account the class imbalance by specifying the class_weight attribute.
https://github.com/chainer/chainer/blob/v3.0.0rc1/chainer/functions/loss/softmax_cross_entropy.py#L57
https://docs.chainer.org/en/stable/reference/generated/chainer.functions.softmax_cross_entropy.html
I want to weight the training data based on a column in the training data set. Thereby giving more importance to certain training items than others. The weighting column should not be included as a feature for the input layer.
The Tensorflow documentation holds an example how to use the label of the item to assign a custom loss and thereby assigning weight:
# Ensures that the loss for examples whose ground truth class is `3` is 5x
# higher than the loss for all other examples.
weight = tf.multiply(4, tf.cast(tf.equal(labels, 3), tf.float32)) + 1
onehot_labels = tf.one_hot(labels, num_classes=5)
tf.contrib.losses.softmax_cross_entropy(logits, onehot_labels, weight=weight)
I am using this in a custom DNN with three hidden layers. In theory i simply need to replace labels in the example above with a tensor containing the weight column.
I am aware that there are several threads that already discuss similar problems e.g. defined loss function in tensorflow?
For some reason i am running into a lot of problems trying to bring my weight column in. It's probably two easy lines of code or maybe there is an easier way to achieve the same result.
I believe i found the answer:
weight_tf = tf.range(features.get_shape()[0]-1, features.get_shape()[0])
loss = tf.losses.softmax_cross_entropy(target, logits, weights=weight_tf)
The weight is the last column in the features tensorflow.