I am an intermediate learner in PyTorch and in some recent cases, I have seen people use the torch.inference_mode() instead of the famous torch.no_grad() while validating your trained agent in reinforcement learning (RL) experiments. I checked the documentation and they have a table that consists of two flags to disable the gradient computation. And to be honest, if I read the description it sounds exactly the same to me. Has someone figured out an explanation?
So I have been scraping the web for a few days and I think I got my explanation. The torch.inference() mode has been added as an even more optimized way of doing inference with PyTorch (versus the torch.no_grad()). I listened to the PyTorch podcast and they have an explanation as to why there exists to different flags.
Version control of tensors: Let's say, you have a code in PyTorch and you have used it to train an agent. When you do torch.no_grad() and just run inference on the trained model, there are still some functionalities of PyTorch like version counting of tensor which are still in play, which gets allocated every time a tensor is created and increments (version bumps) when you mutate that specific tensor. Keeping a check of all the versions of all the tensors requires extra cost from computation and we can't just get rid of them as we have to keep an eye out for tensor mutations, either (directly) to that specific tensor or (indirectly) aliasing to some other tensor which is saved for backward computation.
View Tracking of Tensor: Pytorch tensors are strided. What that means is PyTorch uses stride in the backend for indexing, which can be used if you want to directly access specific elements in the memory block. But in the case of torch.autograd, what if you took a tensor and created a new view, and mutated it with a tensor that is associated with the backward computation? With torch.no_grad they keep record of some view metadata which is required to keep track of which tensors require gradients and which not. This also add up an extra overhead to you computation resources.
So torch.autograd check for these changes which don't get tracked when you switch to torch.inference_mode() (instead of torch.no_grad()) and if you code is not exploiting the above two points then inference mode works and reduces the code execution time. (PyTorch dev team says they have seen a bump of 5-10% while deploying models in production at Facebook.)
I would like to integrate an attentional component into the LSTM model I'm creating. Unfortunately, with tensorflow 2.3.1 that I'm using, it appears that if you subclass the LSTMCell, you have to run the model on CPU. From the tensorflow documentation:
CuDNN is only available at the layer level, and not at the cell level.
Which means I'm relegated to the CPU if I try something like this:
output=keras.layers.RNN(AttentionLSTMCell(400), return_sequences=True, stateful=False)(input_layer);
Where AttentionLSTMCell is a custom class I made that will take in some additional constants (generally an output of the previous timestamp and some new input) that will condition the output of the LSTM. In fact, the documentation seems to suggest that even only specific conditioning is allowed. I am about to dig into creating a full custom Layer (perhaps copy the existing and see if I can add my new inputs in call), but is there a better way? It makes prototyping quite difficult. Large recurrent networks are slow to train, especially in my case where I integrate image data as input.
I am trying to deploy a trained U-Net with TensorRT. The model was trained using Keras (with Tensorflow as backend). The code is very similar to this one: https://github.com/zhixuhao/unet/blob/master/model.py
When I converted the model to UFF format, using some code like this:
import uff
import os
uff_fname = os.path.join("./models/", "model_" + idx + ".uff")
uff_model = uff.from_tensorflow_frozen_model(
frozen_file = os.path.join('./models', trt_fname), output_nodes = output_names,
output_filename = uff_fname
)
I will get the following warning:
Warning: No conversion function registered for layer: ResizeNearestNeighbor yet.
Converting up_sampling2d_32_12/ResizeNearestNeighbor as custom op: ResizeNearestNeighbor
Warning: No conversion function registered for layer: DataFormatVecPermute yet.
Converting up_sampling2d_32_12/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer as custom op: DataFormatVecPermute
I tried to avoid this by replacing the upsampling layer with upsampling(bilinear interpolation) and transpose convolution. But the converter would throw me similar errors. I checked https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html and it seemed all these operations are not supported yet.
I am wondering if there is any workaround to this problem? Is there any other format/framework that TensorRT likes and has upsampling supported? Or is it possible to replace it with some other supported operations?
I also saw somewhere that one can add customized operations to replace those unsupported ones for TensorRT. Though I am not so sure how the workflow would be. It would also be really helpful if someone could point out an example of custom layers.
Thank you in advance!
The warnings are because these operations are not supported yet by TensorRT, as you already mentioned.
Unfortunately there is no easy way to fix this. You either have to modify the graph (even after training) to use a combination supported operation only; or write these operation yourself as custom layer.
However, there is a better way to run inference on other devices in C++. You can use TensorFlow mixed with TensorRT together. TensorRT will analyze the graph for ops that it supports and convert them to TensorRT nodes, and the remaining of the graph will be handled by TensorFlow as usual. More information here. This solution is much faster than rewriting the operations yourself. The only complicated part is to build TensorFlow from sources on your target device and generating the dynamic library tensorflow_cc. Recently there are many guides and support for TensorFlow ports to various architectures e.g. ARM.
Update 09/28/2019
Nvidia released TensorRT 6.0.1 about two weeks ago and added a new API called "IResizeLayer". This layer supports "Nearest" interpolation and can thus be used to implement upsampling. No need to use custom layers/plugins any more!
Original answer:
thanks for all the answers and suggestions posted here!
In the end, we implemented the network in TensorRT C++ API directly and loaded the weights from the .h5 model file. We haven't got the time to profile and polish the solution yet, but the inference seems to be working according to the test images we fed in.
Here's the workflow we've adopted:
Step 1: Code the upsampling layer.
In our U-Net model, all the upsampling layer has a scaling factor of (2, 2) and they all use ResizeNearestNeighbor interpolation. Essentially, pixel value at (x,y) in the original tensor will go to four pixels: (2x, 2y), (2x+1, 2y), (2x, 2y+1) and (2x+1, 2y+1) in the new tensor. This can be easily coded up into a CUDA kernel function.
Once we got the upsampling kernel we need to wrap it with TensorRT API, specifically the IPluginV2Ext class. The developer reference has some descriptions of what functions need to be implemented. I'd say enqueue() is the most important function because the CUDA kernel gets executed there.
There are also examples in the TensorRT Samples folder. For my version, these resources are helpful:
Github: Leaky Relu as custom layer
TensorRT-5.1.2.2/samples/sampleUffSSD
TensorRT-5.1.2.2/samples/sampleSSD
Step 2: Code the rest of the network using TensorRT API
The rest of the network should be quite straightforward. Just find call different "addxxxLayer" function from TensorRT network definitions.
One thing to keep in mind:
depending on which version of TRT you are using, the way to add padding can be different. I think the newest version (5.1.5) allows developers to add parameters in addConvolution() so that the proper padding mode can be selected.
My model was trained using Keras, the default padding mode is that the right and bottom get more padding if the total number of padding is not even. Check this Stack Overflow link for details. There's a mode in 5.1.5 that represents this padding scheme.
If you are on an older version (5.1.2.2), you will need to add the padding as a separate layer before the convolution layer, which has two parameters: pre-padding and post-padding.
Also, all things are NCHW in TensorRT
Helpful sample:
TensorRT-5.1.2.2/samples/sampleMNISTAP
Step 3: Load the weights
TensorRT wants weights in format [out_c, in_c, filter_h, filter_w], which is mentioned in an archived documentation. Keras has weights in format [filter_h, filter_w, c_in, c_out].
We got a pure weights file by calling model.save_weights('weight.h5') in Python. Then we can read the weights into a Numpy array using h5py, performed transposing and saved the transposed weights as a new file. We also figured out the Group and Dataset name using h5py. This info was used when loading weights into C++ code using HDF5 C++ API.
We compared the output layer by layer between C++ code and Python code. For our U-Net, all the activation maps are the same till maybe the third block (after 2 pooling). After that, there is a tiny difference between pixel values. The absolute percentage error is 10^-8 so we don't think it's that bad. We are still in the process of polishing the C++ implementation.
Again, thanks for all the suggestions and answers we got in this post. Hope our solution can be helpful as well!
Hey I've done something similar, I'd say the best way to tackle the issue is to export your model to .onnx with a good like this one, if you check the support matrix for onnx, upsample is supported:
Then you can use https://github.com/onnx/onnx-tensorrt to convert the onnx-model to tensorrt, I've this to convert a network that I trained in pytorch and that had upsample. The repo for onnx-tensorrt is a bit more active, and if you check the pr tab you can check other people writing custom layers and fork from there.
In the tensorflow python API, tf.metrics features a few metrics for Information Retrieval.
In particular:
tf.precision_at_k and tf.precision_at_top_k
tf.recall_at_k and tf.recall_at_top_k
What is the difference between the _at_k and _at_top_k metrics?
The API documentation does not seem to give information on this.
Looking at their implementation, precision_at_k is a simple wrapper around precision_at_top_k. The difference is actually mentioned in the API docs: precision_at_k expects a tensor of logits as predictions whereas precision_at_top_k expects the predictions to be the indices of the top k classes. In essence, precision_at_k simply performs tf.nn.top_k on the predictions and then calls precision_at_top_k.
I notice there are two APIs in TensorFlow concerning with dropout, one is tf.nn.dropout, the other is tf.layers.dropout. I just wonder what's the purpose of tf.nn.dropout?
According to https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, there should be a parameter to distinguish between training and testing stage. I see tf.layers.dropout provides the proper behavior, so why another function tf.nn.dropout? Anyone has any idea? Thanks.
tf.layers.dropout uses tf.nn.dropout function internally.
tf.nn.dropout might be useful if you just want to use a higher level abstraction and do not want to control many facets of the dropout.
Look at the api docs:
1)https://www.tensorflow.org/api_docs/python/tf/layers/dropout
2)https://www.tensorflow.org/api_docs/python/tf/nn/dropout
tf.layers.dropout is a wrapper around tf.nn.dropout and there's a slight difference in terms that tf.layers uses "rate of dropout" while tf.nn "uses the probability to keep the inputs". Though a direct relation can be established between them.
Also there's an extra argument "Training" in tf.layers.dropout which is used to control Whether to return the output in training mode (apply dropout) or in inference mode (return the input untouched).