how to speedup tensorflow RNN inference time

how to speedup tensorflow RNN inference time - python

We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq. We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.
We limit maximum length to 100 words. It mostly just generates 10~20 words.
For model inference, we try two cases:
normal(greedy algorithm). Its inference time is about 40ms~100ms
beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.
So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.
So are there any suggestion to decrease inference time for our case? Thanks.

you can do network compression.
cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM.
you can try faster softmax like adaptive softmax](https://arxiv.org/pdf/1609.04309.pdf)
try cudnnLSTM
try dilated RNN
switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support

If you require improved performance, I'd propose that you use OpenVINO. It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.
Here are some performance benchmarks for NLP model (BERT) and various CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Related

Does converting a seq2seq NLP model to the ONNX format negatively affect its performance?

I was looking at potentially converting an ml NLP model to the ONNX format in order to take advantage of its speed increase (ONNX Runtime). However, I don't really understand what is fundamentally changed in the new models compared to the old models. Also, I don't know if there are any drawbacks. Any thoughts on this would be very appreciated.

performance of the model by accuracy will be the same (just considering the output of encoder and decoder). inference performance may vary based on the method you used for inferencing (eg: greedy search, beam search, top-k & top-p ). for more info on this.
for onnx seq2seq model, you need to implement model.generate() method by hand. But onnxt5 lib has done a good job of implementing greedy search (for onnx model). However, most NLP generative models yield good results by beam search method (you can refer to the linked source for how huggingface implemented beam search for their models). Unfortunately for onnx models, you have to implement it by yourself.
the inference speed definitely increases as shown in this notebook by onnx-runtime (the example is on bert).
you have to run both the encoder and decoder separately on the onnx-runtime & can take advantage of the onnx-runtime. if you wanna know more about onnx and its runtime refer to this link.
Update: you can refer to fastT5 library, it implements both greedy and beam search for t5. for bart have a look at this issue.

Advantages of going from the PyTorch eager world to ONNX include:
ONNX Runtime is much lighter than PyTorch.
General and transformer-specific optimizations and quantization from ONNX Runtime can be leveraged
ONNX makes it easy to use many backends, first through the many execution providers supported in ONNX Runtime, from TensorRT to OpenVINO, to TVM. Some of them are top notch for inference speed on CPU/GPU.
For some specific seq2seq architectures (gpt2, bart, t5), ONNX Runtime supports native BeamSearch and GreedySearch operators: https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models , allowing to avoid making use of PyTorch generate() method, but at the cost of less flexibility.
A decent compromise / alternative to fastT5 with more flexibility could be to export separately encoder and decoder parts of the model, do the execution with ONNX Runtime, but use PyTorch to handle generation. This is exactly what is implemented in the ORTModelForSeq2SeqLM from Optimum library:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True, use_cache=True)
inputs = tokenizer("Translate English to German: Is this model actually good?", return_tensors="pt")
gen_tokens = model.generate(**inputs, use_cache=True)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['<pad> Ist dieses Modell tatsächlich gut?</s>']
As a side note, PyTorch will unveil an official support for torchdynamo in PyTorch 2.0, which is in my opinion a strong competitor to the ONNX + ONNX Runtime deployment path. I personally believe that PyTorch XLA + a good torchdynamo backend will rock it for generation.
Disclaimer: I am a contributor to the Optimum library

How to use keras model inside other model in TPU

I am trying to convert a keras model to tpu model in google colab, but this model has another model inside.
Take a look at the code:
https://colab.research.google.com/drive/1EmIrheKnrNYNNHPp0J7EBjw2WjsPXFVJ
This is a modified version of one of the examples in the google tpu documentation:
https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb
If the sub_model is converted and used directly it works, but if the sub model is inside another model it does not work. I need the sub model type of network because i am trying to train a GAN network that has 2 networks inside (gan=generator+discriminator) so if this test works probably it will work with the gan too.
I have tried several things:
Convert to tpu the model without converting the sub model, in that case when training starts an error is prompted related to the inputs of the sub model.
Convert both the model and sub model to tpu, in that case an error is prompted when converting the "parent" model, the exception only says at the end "layers".
Convert only the sub model to tpu, in that case no error is prompted but the training is not accelerated by the tpu and it is extremely slow like if no conversion to tpu was made at all.
Using fixed batch size or not, both have the same result, the model does not work.
Any ideas? Thanks a lot.

Divide into parts only use submodel at tpu first. Then put something simple instead of submodel and use the model in TPU. If this does not work , create something very simple which includes similar structure with models you are sure that are working and then step by step add things to converge your complex model which you want to use in TPU.
I am struggling with such things. What I did at the very beginning using MNIST is trained the model and get the coefficients outside rewrite relu dense dropout and NN matricies myself and run the model using numpy and then cupy and then pyopencl and then I replaced functions with my own raw cuda C and opencl functions so that getting deeper and simpler I can find what is wrong when something does not work. At last I write my genetic selective training algo and learned a lot.
And most important it gave me the opportunity to try some crazy ideas for training and modelling and manuplating and making sense of NN coffecients.
The problem in my opinion is TF - Keras etc are too high level. Optimizers - Solvers , there is too much unknown. Even neural networks are not under control. GAN is problematic while training it does not converge everytime takes days to train most of the time. Even if you train. You dont know any idea how it converges. Most of the tricks - techniques which protects you from vanishing gradient are not mathematically backed they are nevertheless works very amazingly. (?!?)
**Go simpler deeper and and complexity step by step. Follow a practicing on which you comprehend as much as you can ** It will cost some time and energy but you will benefit it tremendously in my opinion.

Keras model predict iteration getting slower.

Hi I have some problem about Keras with python 3.6
My enviroment is keras with Python and Only CPU.
but the problem is when I iterate same Keras model for predict some diferrent input, its getting slower and slower..
my code is so simple just like that
for i in range(100):
model.predict(x)
the First run is fast. it takes 2 seconds may be. but second run takes 3 seconds and Third takes 5 seconds... its getting slower and slower even if I use same input.
what can I iterate predict keras model hold fast? I don't want any getting slower.. it will be very critical.
How can I Fix IT??

Try using the __call__ method directly. The documentation of the predict method states the following:
For small numbers of inputs that fit in one batch, directly use __call__() for faster execution, e.g., model(x).
I see the performance is critical in this case. So, if it doesn't help, you could use OpenVINO which is optimized for Intel hardware but it should work with any CPU. Your performance should be much better than using Keras directly.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

If your model calls the fit function in batches, there are different samples in the same batch with slightly different times over the course of the iteration, and then you try again and again to get more and more groups of predictive model performance time will be longer and longer.

Mixture usage of CPU and GPU in Keras

I am building a neural network on Keras, including multiple layers of LSTM, Permute and Dense.
It seems LSTM is GPU-unfriendly. So I did research and use
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
But based on my understanding about with, with is try...finally block to ensure that clean-up code is executed. I don't know whether the following CPU/GPU mixture usage code works or not? Will they accelerate speed of training?
With tf.device('/cpu:0'):
out = LSTM(cells)(inp)
With tf.device('/gpu:0'):
out = Permute(some_shape)(out)
With tf.device('/cpu:0'):
out = LSTM(cells)(out)
With tf.device('/gpu:0'):
out = Dense(output_size)(out)

As you may read here - tf.device is a context manager which switches a default device to this passed as its argument in a context (block) created by it. So this code should run all '/cpu:0' device at CPU and rest on GPU.
The question will it speed up your training is really hard to answer because it depends on the machine you use - but I don't expect computations to be faster as each change of a device makes data to be copied between GPU RAM and machine RAM. This could even slow down your computations.

I have created a model using 2 LSTM and 1 dense layers and trained it in my GPU (NVidia GTX 10150Ti) Here is my observations.
use CUDA LSTM https://keras.io/layers/recurrent/#cudnnlstm
Use a bath size which helps more GPU parallelism, if I use a very small batch size(2-10) GPU multi cores are not utilized; so I used 100 as batch size
If I train my network on GPU and try to use it for predictions on CPU, it works in-terms of compiling and running but the predictions are weird. In my case I have the luxury to use a GPU for prediction as well.
for multi layer LSTM, need to use
here is some sample snippet
model = keras.Sequential()
model.add(keras.layers.cudnn_recurrent.CuDNNLSTM(neurons
, batch_input_shape=(nbatch_size, reshapedX.shape[1], reshapedX.shape[2])
, return_sequences=True
, stateful=True))

Seq2seq pytorch Inference slow

I tried the seq2seq pytorch implementation available here seq2seq . After profiling the evaluation(evaluate.py) code, the piece of code taking longer time was the decode_minibatch method
def decode_minibatch(
config,
model,
input_lines_src,
input_lines_trg,
output_lines_trg_gold
):
"""Decode a minibatch."""
for i in xrange(config['data']['max_trg_length']):
decoder_logit = model(input_lines_src, input_lines_trg)
word_probs = model.decode(decoder_logit)
decoder_argmax = word_probs.data.cpu().numpy().argmax(axis=-1)
next_preds = Variable(
torch.from_numpy(decoder_argmax[:, -1])
).cuda()
input_lines_trg = torch.cat(
(input_lines_trg, next_preds.unsqueeze(1)),
1
)
return input_lines_trg
Trained the model on GPU and loaded the model in CPU mode to make inference. But unfortunately, every sentence seems to take ~10sec. Is slow prediction expected on pytorch?
Any fixes, suggestions to speed up would be much appreciated. Thanks.

One solution for slow performance may be to use a toolkit optimized for the inference, such as OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together.
You can find a full tutorial on how to convert the PyTorch model here (FastSeg) and here (BERT). Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). Run in command line:
mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Following on dragon7 answer, I'd recommend using the ONNX export from Optimum that can handle out of the box the export for encoder/decoder models, as well as make use of past key values in the decoder:
optimum-cli export onnx --model gpt2 --task causal-lm-with-past --for-ort gpt2_onnx/
If you want to use OpenVINO, a good option can be the OVModel that handle the inference with OpenVINO (especially for seq2seq models) out of the box!
Disclaimer: I am a contributor to Optimum library

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.