I tried the seq2seq pytorch implementation available here seq2seq . After profiling the evaluation(evaluate.py) code, the piece of code taking longer time was the decode_minibatch method
def decode_minibatch(
config,
model,
input_lines_src,
input_lines_trg,
output_lines_trg_gold
):
"""Decode a minibatch."""
for i in xrange(config['data']['max_trg_length']):
decoder_logit = model(input_lines_src, input_lines_trg)
word_probs = model.decode(decoder_logit)
decoder_argmax = word_probs.data.cpu().numpy().argmax(axis=-1)
next_preds = Variable(
torch.from_numpy(decoder_argmax[:, -1])
).cuda()
input_lines_trg = torch.cat(
(input_lines_trg, next_preds.unsqueeze(1)),
1
)
return input_lines_trg
Trained the model on GPU and loaded the model in CPU mode to make inference. But unfortunately, every sentence seems to take ~10sec. Is slow prediction expected on pytorch?
Any fixes, suggestions to speed up would be much appreciated. Thanks.
One solution for slow performance may be to use a toolkit optimized for the inference, such as OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together.
You can find a full tutorial on how to convert the PyTorch model here (FastSeg) and here (BERT). Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). Run in command line:
mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Following on dragon7 answer, I'd recommend using the ONNX export from Optimum that can handle out of the box the export for encoder/decoder models, as well as make use of past key values in the decoder:
optimum-cli export onnx --model gpt2 --task causal-lm-with-past --for-ort gpt2_onnx/
If you want to use OpenVINO, a good option can be the OVModel that handle the inference with OpenVINO (especially for seq2seq models) out of the box!
Disclaimer: I am a contributor to Optimum library
Related
I was looking at potentially converting an ml NLP model to the ONNX format in order to take advantage of its speed increase (ONNX Runtime). However, I don't really understand what is fundamentally changed in the new models compared to the old models. Also, I don't know if there are any drawbacks. Any thoughts on this would be very appreciated.
performance of the model by accuracy will be the same (just considering the output of encoder and decoder). inference performance may vary based on the method you used for inferencing (eg: greedy search, beam search, top-k & top-p ). for more info on this.
for onnx seq2seq model, you need to implement model.generate() method by hand. But onnxt5 lib has done a good job of implementing greedy search (for onnx model). However, most NLP generative models yield good results by beam search method (you can refer to the linked source for how huggingface implemented beam search for their models). Unfortunately for onnx models, you have to implement it by yourself.
the inference speed definitely increases as shown in this notebook by onnx-runtime (the example is on bert).
you have to run both the encoder and decoder separately on the onnx-runtime & can take advantage of the onnx-runtime. if you wanna know more about onnx and its runtime refer to this link.
Update: you can refer to fastT5 library, it implements both greedy and beam search for t5. for bart have a look at this issue.
Advantages of going from the PyTorch eager world to ONNX include:
ONNX Runtime is much lighter than PyTorch.
General and transformer-specific optimizations and quantization from ONNX Runtime can be leveraged
ONNX makes it easy to use many backends, first through the many execution providers supported in ONNX Runtime, from TensorRT to OpenVINO, to TVM. Some of them are top notch for inference speed on CPU/GPU.
For some specific seq2seq architectures (gpt2, bart, t5), ONNX Runtime supports native BeamSearch and GreedySearch operators: https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models , allowing to avoid making use of PyTorch generate() method, but at the cost of less flexibility.
A decent compromise / alternative to fastT5 with more flexibility could be to export separately encoder and decoder parts of the model, do the execution with ONNX Runtime, but use PyTorch to handle generation. This is exactly what is implemented in the ORTModelForSeq2SeqLM from Optimum library:
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small", from_transformers=True, use_cache=True)
inputs = tokenizer("Translate English to German: Is this model actually good?", return_tensors="pt")
gen_tokens = model.generate(**inputs, use_cache=True)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['<pad> Ist dieses Modell tatsächlich gut?</s>']
As a side note, PyTorch will unveil an official support for torchdynamo in PyTorch 2.0, which is in my opinion a strong competitor to the ONNX + ONNX Runtime deployment path. I personally believe that PyTorch XLA + a good torchdynamo backend will rock it for generation.
Disclaimer: I am a contributor to the Optimum library
We have trained a Mask R-CNN model on a NVIDIA GPU to do object instance segmentation and tested on some images with sufficient performance. Now we are looking into deploy the trained model on Neural Compute Stick 2. I'm just getting started with OpenVINO toolkit and here is what I have done:
I downloaded mask_rcnn_inception_v2_coco.tar.gz from TensorFlow detection model zoo and decompressed it.
I used ModelOptimizer as follows to get an Intermediate Representation:
python3 mo_tf.py \
--input_model ./frozen_inference_graph.pb \
-- tensorflow_use_custom_operations_config extensions/front/tf/mask_rcnn_support.json \
--tensorflow_object_detection_api_pipeline_config ./pipeline.config \
--data_type FP16
(I used data type of FP16 as the default FP32 is not supported on VPU)
Then I used Inference Engine in the mask_rcnn_demo as follows:
./mask_rcnn_demo -m ./frozen_graph.xml -i ./image.jpg -d MYRIAD
However, I got the following error:
[ ERROR ] [VPU] Softmax input or output
SecondStageBoxPredictor/ClassPredictor/BiasAdd/softmax has invalid batch
Could someone please point me to the source of this error?
I understand from the documentation that currently Mask RCNN is only supported on CPU and GPU, but I would like to know is there anything I can do to get it run on VPU (such as write custom layers for layers not supported in Model Optimizer?). I haven't found any explanation on why Mask RCNN is not supported on VPU in the documentation yet.
Thanks,
give a try with HETERO plugin
-d HETERO:MYRIAD, CPU
Hi I have some problem about Keras with python 3.6
My enviroment is keras with Python and Only CPU.
but the problem is when I iterate same Keras model for predict some diferrent input, its getting slower and slower..
my code is so simple just like that
for i in range(100):
model.predict(x)
the First run is fast. it takes 2 seconds may be. but second run takes 3 seconds and Third takes 5 seconds... its getting slower and slower even if I use same input.
what can I iterate predict keras model hold fast? I don't want any getting slower.. it will be very critical.
How can I Fix IT??
Try using the __call__ method directly. The documentation of the predict method states the following:
For small numbers of inputs that fit in one batch, directly use __call__() for faster execution, e.g., model(x).
I see the performance is critical in this case. So, if it doesn't help, you could use OpenVINO which is optimized for Intel hardware but it should work with any CPU. Your performance should be much better than using Keras directly.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
If your model calls the fit function in batches, there are different samples in the same batch with slightly different times over the course of the iteration, and then you try again and again to get more and more groups of predictive model performance time will be longer and longer.
We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq. We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.
We limit maximum length to 100 words. It mostly just generates 10~20 words.
For model inference, we try two cases:
normal(greedy algorithm). Its inference time is about 40ms~100ms
beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.
So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.
So are there any suggestion to decrease inference time for our case? Thanks.
you can do network compression.
cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM.
you can try faster softmax like adaptive softmax](https://arxiv.org/pdf/1609.04309.pdf)
try cudnnLSTM
try dilated RNN
switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support
If you require improved performance, I'd propose that you use OpenVINO. It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.
Here are some performance benchmarks for NLP model (BERT) and various CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
I've been walking through the RNN code in TenserFlow tutorial: https://www.tensorflow.org/tutorials/recurrent
The original RNN code is here: https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
I saved the trained RNN model as 'train-model' through
if FLAGS.save_path:
print("Saving model to %s." % FLAGS.save_path)
sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step)
Now I'm trying to restore the saved model and run additional test with it by
with tf.name_scope("Test"):
test_input = PTBInput(config=eval_config, data=test_data, name="TestInput")
with tf.variable_scope("Model", reuse=None, initializer=initializer):
mtest = PTBModel(is_training=False, config=eval_config,
input_=test_input)
save = tf.train.Saver()
with tf.Session() as session:
save.restore(session, tf.train.latest_checkpoint("./"))
test_perplexity = run_epoch(session, mtest)
It seem that the model is loaded correctly, but it hangs at the line
vals = session.run(fetches, feed_dict)
in function run_epoch, when called for computing test_perplexity. CTRL-C is unable to quit the program, and the GPU utilization is at 0%, so it is most probably blocked on something.
Any help would be greatly appreciated!
Try installing Tensorflow from source. It is recommended because you can build the desired Tensorflow binary for the specific architecture (GPU, CUDA, cuDNN).
This is even mentioned as one of the Best Practices for improving Tensorflow performance. Check the Tensorflow performance Guide. A small excerpt from the same :
Building from source with compiler optimizations for the target hardware and ensuring the latest CUDA platform and cuDNN libraries are installed results in the highest performing installs.
The problem you mentioned typically occurs when the configured Compute Capability with which the Tensorflow binary was built is different than that of your GPU. But installing it from source gives you an option to configure the specific Compute Capability. Check the guide for installing from source.