Quantization-aware training in Tensorflow using the highlevel keras api

Quantization-aware training in Tensorflow using the highlevel keras api - python

I built my first covnet using the process described in this colab. Now I would like to run the model on Googles shiny new edge tpu.
But according to the Model Requirments described here, I need to use quantization-aware training (post-training quantization is not supported). to be able to convert the model into a format that I can use on the EdgeTPU.
How do I modify the example colab to do this quantization-aware training thing?

well because the keras API does not support quantization in the current edition you are left with 3 options:
wait for keras to have the required functionality
rewrite your model with a different API that has this functionality
find a different TPU that does not require you to quantize your data
either way though the solution is not great though.

Related

How to use custom Tensorflow Lite model

I am very new to machine learning. I have a python file with a very simple TensorFlow model that I need to deploy on Android using Google's ML Kit (which would be creating tflite file). I absolutely don't understand what should be the structure of my python file and Google's documentation doesn't make it any easier. Maybe someone has a good example of converting CUSTOM MODEL WRITTEN FROM SCRATCH and then using it in Java. I need to pass a string from Android's text field and get a predicted answer.

You need to first train your model on whatever the dataset you have. The layers in the model must comply with the supported layers by the TFLite library. Here is a list of layers that are supported and unsupported.
Once you have trained it, based on how you saved it (Let's say using kerasmodel.save). Convert it to TFLite following this tutorial or other tutorials on this page.
Now you can use this .tflite model in Android studio. For this you can follow this good tutorial.

Deploy Semantic Segmentation Network (U-Net) with TensorRT (no upsampling support)

I am trying to deploy a trained U-Net with TensorRT. The model was trained using Keras (with Tensorflow as backend). The code is very similar to this one: https://github.com/zhixuhao/unet/blob/master/model.py
When I converted the model to UFF format, using some code like this:
import uff
import os
uff_fname = os.path.join("./models/", "model_" + idx + ".uff")
uff_model = uff.from_tensorflow_frozen_model(
frozen_file = os.path.join('./models', trt_fname), output_nodes = output_names,
output_filename = uff_fname
)
I will get the following warning:
Warning: No conversion function registered for layer: ResizeNearestNeighbor yet.
Converting up_sampling2d_32_12/ResizeNearestNeighbor as custom op: ResizeNearestNeighbor
Warning: No conversion function registered for layer: DataFormatVecPermute yet.
Converting up_sampling2d_32_12/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer as custom op: DataFormatVecPermute
I tried to avoid this by replacing the upsampling layer with upsampling(bilinear interpolation) and transpose convolution. But the converter would throw me similar errors. I checked https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html and it seemed all these operations are not supported yet.
I am wondering if there is any workaround to this problem? Is there any other format/framework that TensorRT likes and has upsampling supported? Or is it possible to replace it with some other supported operations?
I also saw somewhere that one can add customized operations to replace those unsupported ones for TensorRT. Though I am not so sure how the workflow would be. It would also be really helpful if someone could point out an example of custom layers.
Thank you in advance!

The warnings are because these operations are not supported yet by TensorRT, as you already mentioned.
Unfortunately there is no easy way to fix this. You either have to modify the graph (even after training) to use a combination supported operation only; or write these operation yourself as custom layer.
However, there is a better way to run inference on other devices in C++. You can use TensorFlow mixed with TensorRT together. TensorRT will analyze the graph for ops that it supports and convert them to TensorRT nodes, and the remaining of the graph will be handled by TensorFlow as usual. More information here. This solution is much faster than rewriting the operations yourself. The only complicated part is to build TensorFlow from sources on your target device and generating the dynamic library tensorflow_cc. Recently there are many guides and support for TensorFlow ports to various architectures e.g. ARM.

Update 09/28/2019
Nvidia released TensorRT 6.0.1 about two weeks ago and added a new API called "IResizeLayer". This layer supports "Nearest" interpolation and can thus be used to implement upsampling. No need to use custom layers/plugins any more!
Original answer:
thanks for all the answers and suggestions posted here!
In the end, we implemented the network in TensorRT C++ API directly and loaded the weights from the .h5 model file. We haven't got the time to profile and polish the solution yet, but the inference seems to be working according to the test images we fed in.
Here's the workflow we've adopted:
Step 1: Code the upsampling layer.
In our U-Net model, all the upsampling layer has a scaling factor of (2, 2) and they all use ResizeNearestNeighbor interpolation. Essentially, pixel value at (x,y) in the original tensor will go to four pixels: (2x, 2y), (2x+1, 2y), (2x, 2y+1) and (2x+1, 2y+1) in the new tensor. This can be easily coded up into a CUDA kernel function.
Once we got the upsampling kernel we need to wrap it with TensorRT API, specifically the IPluginV2Ext class. The developer reference has some descriptions of what functions need to be implemented. I'd say enqueue() is the most important function because the CUDA kernel gets executed there.
There are also examples in the TensorRT Samples folder. For my version, these resources are helpful:
Github: Leaky Relu as custom layer
TensorRT-5.1.2.2/samples/sampleUffSSD
TensorRT-5.1.2.2/samples/sampleSSD
Step 2: Code the rest of the network using TensorRT API
The rest of the network should be quite straightforward. Just find call different "addxxxLayer" function from TensorRT network definitions.
One thing to keep in mind:
depending on which version of TRT you are using, the way to add padding can be different. I think the newest version (5.1.5) allows developers to add parameters in addConvolution() so that the proper padding mode can be selected.
My model was trained using Keras, the default padding mode is that the right and bottom get more padding if the total number of padding is not even. Check this Stack Overflow link for details. There's a mode in 5.1.5 that represents this padding scheme.
If you are on an older version (5.1.2.2), you will need to add the padding as a separate layer before the convolution layer, which has two parameters: pre-padding and post-padding.
Also, all things are NCHW in TensorRT
Helpful sample:
TensorRT-5.1.2.2/samples/sampleMNISTAP
Step 3: Load the weights
TensorRT wants weights in format [out_c, in_c, filter_h, filter_w], which is mentioned in an archived documentation. Keras has weights in format [filter_h, filter_w, c_in, c_out].
We got a pure weights file by calling model.save_weights('weight.h5') in Python. Then we can read the weights into a Numpy array using h5py, performed transposing and saved the transposed weights as a new file. We also figured out the Group and Dataset name using h5py. This info was used when loading weights into C++ code using HDF5 C++ API.
We compared the output layer by layer between C++ code and Python code. For our U-Net, all the activation maps are the same till maybe the third block (after 2 pooling). After that, there is a tiny difference between pixel values. The absolute percentage error is 10^-8 so we don't think it's that bad. We are still in the process of polishing the C++ implementation.
Again, thanks for all the suggestions and answers we got in this post. Hope our solution can be helpful as well!

Hey I've done something similar, I'd say the best way to tackle the issue is to export your model to .onnx with a good like this one, if you check the support matrix for onnx, upsample is supported:
Then you can use https://github.com/onnx/onnx-tensorrt to convert the onnx-model to tensorrt, I've this to convert a network that I trained in pytorch and that had upsample. The repo for onnx-tensorrt is a bit more active, and if you check the pr tab you can check other people writing custom layers and fork from there.

Is there an established way to convert a Tensorflow network architecture written in graph/session, to Keras API for TPU use?

In order to use Google TPU, your code must either use the Estimator API or Keras API.
Converting a graph to use the Estimator API is pretty straight forward, as you mostly allocate the code among model_fn, features, input_fn, etc.
Converting to graph to Keras is not as straight forward, as Keras has unique functions and datatypes to handle various operations. However, from Tensorflow's blog post, they seem to recommend Keras over Estimator
https://medium.com/tensorflow/standardizing-on-keras-guidance-on-high-level-apis-in-tensorflow-2-0-bad2b04c819a.
By establishing Keras as the high-level API for TensorFlow, we are
making it easier for developers new to machine learning to get started
with TensorFlow.
That said, if you are working on custom architectures, we suggest
using tf.keras to build your models instead of Estimator.
The Tensorflow Keras API is built on top of Tensorflow though, so I'm guessing there's a way to make anything in Tensorflow also in keras, especially since there seems to be functions that directly convert a Tensorflow function to keras, such as keras.optimizers.TFOptimizer(<tensorflow optimizer>)
Is there an established way to convert any architecture coded using graph/session to the Keras API?

How to use trained neural network in different platform/technology?

Given I trained a simple neural network using Tensorflow and Python on my laptop and I want to use this model on my phone in C++ app.
Is there any compatibility format I can use? What is the minimal framework to run neural networks (not to train)?
UDP. I'm also interested in Tensorflow to NOT-Tensorflow compatibility. Do I need to build it up from scratch or there're any best practices?

Yes if you are using iOS or Android. Depending on your specific needs, you have a choice between TensorFlow for Mobile and TensorFlow Lite
https://www.tensorflow.org/mobile/
In particular, to load pre-trained models
https://www.tensorflow.org/mobile/prepare_models

Technically you don't need a framework at all. A conventional fully connected neural network is simple enough that you can implement it in straight C++. It's about 100 lines of code for the matrix multiplication and a dozen or so for the non-linear activation function.
The biggest part is figuring out how to parse a serialized Tensorflow model, especially given that there are quite a few ways to do so. You probably will want to freeze your TensorFlow model; this inserts the weights from the latest training into the model.

Deploying Tensorflow app on the customer's infra

I'm currently developing a prediction model using Tensorflow and my model works well for a customer, so I'm tring to make it as a real product.
My model needs to be retrained using customer's input as time passes, and it should be deployed on customers infrastructure. (Not a SaaS or cloud.) Moreover, I'd like to protect my codes and models.
From my understanding of Tensorflow, trained model can be exported as protobuf, freezed and kept nodes that are required by prediction. freeze_graph.py at Tensorflow repo, I tried it and I successfully ran my prediction model using Golang + libtensorflow.so runtime. (Or, I could use Tensorflow Serving & C++)
If I can train my model on our company's infra, I could say "Okay, let's get some beers". However, my model has to be trained on the customer's infra, and without python code, it seems like I cannot train my model.
https://www.tensorflow.org/versions/r0.12/how_tos/language_bindings/index.html
At this time, support for gradients, functions and control flow operations ("if" and "while") is not available in languages other than Python. This will be updated when the C API provides necessary support.
Is there any workaround deploying TF app without exposing python code or model? Thanks in advance.

You can still use Python with a pre-trained model, without exposing all the code you needed to build it in the first place. As an example of this, have a look at the Inception retraining code, which loads a pretrained GraphDef and then retrains a new top layer:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.