Using circular buffers inside a TF(2.0) model - python

I'm looking for a way to optimize my tensorflow (2.0) UNet model for streaming data. The goal is to store activations of previous predictions in circular buffers, in order to reuse part of the buffers in next predictions.
I've found the following TensorFlow op FIFOQueue and conceptually, this is in the right direction. However, it only seems to be used in the context of data loading. I could not find any good leads in the .layers documentation.
Can this be tackled with the Tensorflow API? Would this problem be sufficiently low-level to break the barrier into the C++ API? Currently I'm developing in Python.

Related

Deploy Semantic Segmentation Network (U-Net) with TensorRT (no upsampling support)

I am trying to deploy a trained U-Net with TensorRT. The model was trained using Keras (with Tensorflow as backend). The code is very similar to this one: https://github.com/zhixuhao/unet/blob/master/model.py
When I converted the model to UFF format, using some code like this:
import uff
import os
uff_fname = os.path.join("./models/", "model_" + idx + ".uff")
uff_model = uff.from_tensorflow_frozen_model(
frozen_file = os.path.join('./models', trt_fname), output_nodes = output_names,
output_filename = uff_fname
)
I will get the following warning:
Warning: No conversion function registered for layer: ResizeNearestNeighbor yet.
Converting up_sampling2d_32_12/ResizeNearestNeighbor as custom op: ResizeNearestNeighbor
Warning: No conversion function registered for layer: DataFormatVecPermute yet.
Converting up_sampling2d_32_12/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer as custom op: DataFormatVecPermute
I tried to avoid this by replacing the upsampling layer with upsampling(bilinear interpolation) and transpose convolution. But the converter would throw me similar errors. I checked https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html and it seemed all these operations are not supported yet.
I am wondering if there is any workaround to this problem? Is there any other format/framework that TensorRT likes and has upsampling supported? Or is it possible to replace it with some other supported operations?
I also saw somewhere that one can add customized operations to replace those unsupported ones for TensorRT. Though I am not so sure how the workflow would be. It would also be really helpful if someone could point out an example of custom layers.
Thank you in advance!
The warnings are because these operations are not supported yet by TensorRT, as you already mentioned.
Unfortunately there is no easy way to fix this. You either have to modify the graph (even after training) to use a combination supported operation only; or write these operation yourself as custom layer.
However, there is a better way to run inference on other devices in C++. You can use TensorFlow mixed with TensorRT together. TensorRT will analyze the graph for ops that it supports and convert them to TensorRT nodes, and the remaining of the graph will be handled by TensorFlow as usual. More information here. This solution is much faster than rewriting the operations yourself. The only complicated part is to build TensorFlow from sources on your target device and generating the dynamic library tensorflow_cc. Recently there are many guides and support for TensorFlow ports to various architectures e.g. ARM.
Update 09/28/2019
Nvidia released TensorRT 6.0.1 about two weeks ago and added a new API called "IResizeLayer". This layer supports "Nearest" interpolation and can thus be used to implement upsampling. No need to use custom layers/plugins any more!
Original answer:
thanks for all the answers and suggestions posted here!
In the end, we implemented the network in TensorRT C++ API directly and loaded the weights from the .h5 model file. We haven't got the time to profile and polish the solution yet, but the inference seems to be working according to the test images we fed in.
Here's the workflow we've adopted:
Step 1: Code the upsampling layer.
In our U-Net model, all the upsampling layer has a scaling factor of (2, 2) and they all use ResizeNearestNeighbor interpolation. Essentially, pixel value at (x,y) in the original tensor will go to four pixels: (2x, 2y), (2x+1, 2y), (2x, 2y+1) and (2x+1, 2y+1) in the new tensor. This can be easily coded up into a CUDA kernel function.
Once we got the upsampling kernel we need to wrap it with TensorRT API, specifically the IPluginV2Ext class. The developer reference has some descriptions of what functions need to be implemented. I'd say enqueue() is the most important function because the CUDA kernel gets executed there.
There are also examples in the TensorRT Samples folder. For my version, these resources are helpful:
Github: Leaky Relu as custom layer
TensorRT-5.1.2.2/samples/sampleUffSSD
TensorRT-5.1.2.2/samples/sampleSSD
Step 2: Code the rest of the network using TensorRT API
The rest of the network should be quite straightforward. Just find call different "addxxxLayer" function from TensorRT network definitions.
One thing to keep in mind:
depending on which version of TRT you are using, the way to add padding can be different. I think the newest version (5.1.5) allows developers to add parameters in addConvolution() so that the proper padding mode can be selected.
My model was trained using Keras, the default padding mode is that the right and bottom get more padding if the total number of padding is not even. Check this Stack Overflow link for details. There's a mode in 5.1.5 that represents this padding scheme.
If you are on an older version (5.1.2.2), you will need to add the padding as a separate layer before the convolution layer, which has two parameters: pre-padding and post-padding.
Also, all things are NCHW in TensorRT
Helpful sample:
TensorRT-5.1.2.2/samples/sampleMNISTAP
Step 3: Load the weights
TensorRT wants weights in format [out_c, in_c, filter_h, filter_w], which is mentioned in an archived documentation. Keras has weights in format [filter_h, filter_w, c_in, c_out].
We got a pure weights file by calling model.save_weights('weight.h5') in Python. Then we can read the weights into a Numpy array using h5py, performed transposing and saved the transposed weights as a new file. We also figured out the Group and Dataset name using h5py. This info was used when loading weights into C++ code using HDF5 C++ API.
We compared the output layer by layer between C++ code and Python code. For our U-Net, all the activation maps are the same till maybe the third block (after 2 pooling). After that, there is a tiny difference between pixel values. The absolute percentage error is 10^-8 so we don't think it's that bad. We are still in the process of polishing the C++ implementation.
Again, thanks for all the suggestions and answers we got in this post. Hope our solution can be helpful as well!
Hey I've done something similar, I'd say the best way to tackle the issue is to export your model to .onnx with a good like this one, if you check the support matrix for onnx, upsample is supported:
Then you can use https://github.com/onnx/onnx-tensorrt to convert the onnx-model to tensorrt, I've this to convert a network that I trained in pytorch and that had upsample. The repo for onnx-tensorrt is a bit more active, and if you check the pr tab you can check other people writing custom layers and fork from there.

What's the best way to use distributed multi-GPU inferencing in tensorflow?

I am new to Tensorflow and I am working on distributing testing images to multiple GPUs. I have read a lot of Stack overflow answers and Github examples, and I think there might be two ways to do that.
1) using tf.FIFOQueue() to feed each GPU images, however the queue is not recommended in a lot of answers (due to the new tf.data API). And it has some issues (https://github.com/tensorflow/tensorflow/issues/8061)
2) using tf.data API. I am not sure if this API support GPU or not. In this issue (https://github.com/tensorflow/tensorflow/issues/13610), it seems that input pipeline with tf.data API can not support GPU feeding yet.
Distributed Tensorflow is not within my consideration (since our model and scale of server is not that large)
I will appreciate it very much if some one can give me any advice.
Use tf.data. The tf.data API is meant to replace almost every functionality of queues and makes everything easier and more performant.
It can also feed data to the GPU. The second issue you link to just says that the preprocessing will not happen on the GPU, but data will be processed on CPU and then sent to your multiple GPUs.

Using Tensorflow in a low-latency high-throughput kinda way

Processed data is real-time video (a bunch of sequential frames) and it all needs to end up in a DX12 buffer.
I don't care too much if data gets copied to system memory during training, but during evaluation, it must stay on GPU.
I would train the network separately in python with high latency being allowed but then after it is trained, I would use it entirely on the GPU (because my frames are already there). From my standpoint (experienced with GPGPU programming but not so much with Tensorflow) there are two ways of doing this:
Extracting the parameters from the trained model in python (weights and biases) and uploading them to the c++ program that has the same network topology on the GPU and running it there. It should behave like a Tensorflow network it was trained on.
Using Tensorlow in the c++ program as well and just passing the buffer handles for input and output (the way you would do with GPGPU) and then interop-ing with DX12 (because I need the evaluations to end up here).
Would like to know if any of those options are possible and if so, which one is better and why?
If I left anything unclear, let me know in the comments.

Reading multiple images, process them to one image and feed through model

Is there a way to build following graph in Tensorflow:
Load some N images (N can vary for each set) using TF Queues and TF Image Readers.
Process these images to get fixed size image and prepare batches.
Feed these batches through the CNN model
Some questions/info:
I am trying to build data loading part in TF instead of Python functions and feed_dict. I guess, TF Data loading can train the model faster compared to python and feed_dict. Is that right ?
Building the graph for small N (N<5) is easy. Define exclusive nodes for each image in N and process on them. (working)
Can I use TF "while_loop" to build such functionality to read N images ??
Does Keras supports such functionality ?
Thanks for your suggestions.
I just did this last week! It was awesome, I learned a ton about tensorflow using things like tf.map_fn, and tf.cond. And it worked.
This week I just refactored my code to eliminate it all, because it was a bad idea.
Issues I ran into:
Doing preprocessing in tensorflow is messy to debug. Doing proper TDD will definitely benefit you here, but still not going to be particularly pretty or easy to debug.
You should be offloading the preprocessing to the CPU and leaving the GPU (assuming you're using one) to do training. A better approach is to just have a queue and load it from a thread/class that's dedicated to your preprocessing task. And doing the work in numpy/scikit/scikit-image is going to be easier to configure and test.
I thought I was so smart, corralling all my code into a single model. But the complexity of the preprocessing meant my model was really hard to iterate on, it got to be rigid code quickly - example is when I added my test set evaluation in, the preprocessing requirement was slightly different. Suddenly I had to add large sections of conditional code to my model and it got ugly quick.
That being said, my preprocessing steps were maybe more complex than yours. If you're sticking to simple things where you can just apply some of the simple image preprocessing steps it might still be easier for you to go this approach.
To answer your questions specifically:
Queues won't give any benefit over feed_dict that I know of. You still have a problem of moving data from a TF queue on the CPU to the GPU memory each iteration same as feed_dict does, watch this thread if you care about that topic, GPU queues are coming: https://github.com/tensorflow/tensorflow/issues/7679
You should just dequeue_many from the queue, process them as a batch. If you need to do something to each individual image just use tf.map_fn which will remove the first dimension and pass individual 3D images to your specified function. But heed my warning above when you go this route - you'll probably be happier just doing this in a separate thread.
Already answered in #2, use tf.map_fn to iterate over multiple images in a batch. it's pretty easy to use actually.
I don't know Keras.

Tensorflow examples for RNN

I am trying to implement basic NLP tasks in Tensorflow without using the build in modules as much as possible (just for learning sake)
I have been trying to implement a Part of Speech tagger using data from http://www.cnts.ua.ac.be/conll2000/chunking/
I am having a little difficulty with implementing a RNN code from scratch using an embedding layer in front and was wondering if there are examples and implementations on the same.
I have seen tons of examples using Theano and on MNIST data but havent been able to find a concrete implementation of RNNs from scratch in Tensorflow on textual data.
Any suggestions?
after searching for a long time, I decided to do it on my own.
Here are a couple of repos I created:
recurrent-neural-nets-tensorflow
lstm-tensorflow

Categories