I know that Tensorflow/Keras provides stateful metrics which can be updated using metric.update_state(). I understand this updating of metrics in a stateful manner is performed via taking average/mean using MeanMetricWrapper class.
What should I do in case I would like to update the metrics using another operation for example, addition (let's say I would like to accumulate loss across all batches instead of taking the average, so I can print the loss across entire epoch instead of per-batch average).
I am more interested in solutions that can work seamlessly with model.train_on_batch(). Thank you.
Related
Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.
I'm training an image classification model with PyTorch Lightning and running on a machine with more than one GPU, so I use the recommended distributed backend for best performance ddp (DataDistributedParallel). This naturally splits up the dataset, so each GPU will only ever see one part of the data.
However, for validation, I would like to compute metrics like accuracy on the entire validation set and not just on a part. How would I do that? I found some hints in the official documentation, but they do not work as expected or are confusing to me. What's happening is that validation_epoch_end is called num_gpus times with 1/num_gpus of the validation data each. I would like to aggregate all results and only run the validation_epoch_end once.
In this section they state that when using dp/ddp2 you can add an additional function called like this
def validation_step(self, batch, batch_idx):
loss, x, y, y_hat = self.step(batch)
return {"val_loss": loss, 'y': y, 'y_hat': y_hat}
def validation_step_end(self, self, *args, **kwargs):
# do something here, I'm not sure what,
# as it gets called in ddp directly after validation_step with the exact same values
return args[0]
However, the results are not being aggregated and validation_epoch_end is still called num_gpu times. Is this kind of behavior not available for ddp? Is there some other way how achieve this aggregation behavior?
training_epoch_end() and validation_epoch_end() receive data that is aggregated from all training / validation batches of the particular process. They simply receive a list of what you returned in each training or validation step.
When using the DDP backend, there's a separate process running for every GPU. There's no simple way to access the data that another process is processing, but there's a mechanism for synchronizing a particular tensor between the processes.
The easiest approach for computing a metric on the entire validation set is to calculate the metric in pieces and then synchronize the resulting tensor, for example by taking the average. self.log() calls will automatically synchronize the value between GPUs when you use sync_dist=True. How the value is synchronized is determined by the reduce_fx argument, which by default is torch.mean.
If you're happy with averaging the metric over batches too, you don't need to override training_epoch_end() or validation_epoch_end() — self.log() will do the averaging for you.
If the metric cannot be calculated separately for each GPU and then averaged, it can get a bit more challenging. It's possible to update some state variables at each step, and then synchronize the state variables at the end of an epoch and calculate the metric. The recommended way is to create a class that derives from the Metric class from the TorchMetrics project. Add the state variables in the constructor using add_state() and override the update() and compute() methods. The API will take care of synchronizing the state variables between the GPU processes.
There's already an accuracy metric in TorchMetrics and the source code is a good example of how to use the API.
I think you are looking for training_step_end/validation_step_end.
...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.
I am trying to write some custom layers in Keras. The ultimate goal is that certain parameters (updated according to a fixed formula after each batch of data is optimized over in the training process) be passed to the loss function. I do not believe it is possible to use dynamic loss functions in Keras, but that I should be able to pass these parameters to the loss function using multiple inputs and a custom layer.
I want to know whether it is possible to create a layer in Keras having parameters that are not trainable (and not optimized over at all in the training process), but instead updated according to a fixed formula at the end of each batch optimization in the training process.
The simplest example I can give: instead of optimizing a generic cost function (like cross-entropy), I want to optimize something proportional to the cross entropy (c*cross_entropy). After one batch of data is processed in the training procedure, I want to set, for example, c = 1.2*c, and this to be used as the c value in the batch of data.
(This should be more or less useless in this case as a positive constant times the loss function shouldn't affect the minima but it's fairly close to what I actually need to do).
I know that one can simply do it for all of them using something as in the tutorials:
opt = tf.train.GradientDescentOptimizer(learning_rate)
however it would be nice it one could pass a dictionary that maps the variable name to its corresponding learning rate. Is that possible?
I know that one could simply use compute_gradients() followed by apply_gradients() and do it manually but that seems silly. Is there a smarter way to assign specific learning rates to specific variables?
Is the only way to do this to create specific optimizer as in:
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains tf.Variable
# objects.
opt_op = opt.minimize(cost, var_list=<list of variables>)
and simply give the specific learning rate to each optimizer? But that would mean we have a list of optimizers and hence, we would need to apply the learning rule with sess.run to each optimizer. Right?
As far as I can tell this is not possible. Mostly because this is not really a valid gradient descent then. There are plenty of optimizers which learn on their own variable specific scaling factors (like Adam or AdaGrad). Specyfing per-variable learning rate (constant one) would mean that you do not follow the gradient anymore, and while it makes sense for well formulated mathematically methods, simply setting them to a pre-defined values is just a heuristic, which I believe is a reason for not implementing this in core TF.
As you said - you can always do it on your own, define your own optimizer, iterate over variables between compute gradients and apply them, which would be around 3-4 lines of code (one to compute the gradients, one to iterate and add multiplication ops, and one to apply them back), and as far as I know - this is the simplest solution to achieve your goal.
In the tensorflow CNN tutorial, it computes the accuracy, but I want to leverage from that to the confusion matrix.
Immediately, three different approaches hit on my mind:
I tried to directly compute the prediction result instead of top_k_op in tensorflow, then I could utilize sklearn. But I failed, because the it used multiple threads to compute(line 88);
I tried to load the trained Variables and give new placeholder to cifar10.inference, but failed again, because it defined batch_image as input(line 225);
The last approach is to defined a new operation to replace the line 128
top_k_op = tf.nn.in_top_k(logits, labels, 1)
but I could not find a proper operations could do that.
This has afflicted me for several days. Please help. Thank you in advance.
You can utilize sklearn's confusion_matrix only after running 'inference' on all the dataset.
Meaning, if you are modifying eval_only function, you should just accumulate all the scores into some thread-safe container (list). And then after all threads are stopped (line 113) you can run single confusion matrix computation.
Additionally, if you want to do it in the graph, TensorFlow recently got confusion_matrix op you can try using. That said, it only works on the batch so you will need to increase your batch to get any kind of resolution or write a custom aggregator.