c++ TensorFlow lite inference of tfhub models

c++ TensorFlow lite inference of tfhub models - python

BTHOG.
I defined and fine tuned mobile multi lingual bert model constructed using following keras code:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/mobilebert_multi_cased_L-24_H-128_B-512_A-4_F-4_OPT/1", trainable=True)
i = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
x = bert_preprocess(i)
x = bert_encoder(x)
x = tf.keras.layers.Dropout(0.2, name="dropout")(x['pooled_output'])
x = tf.keras.layers.Dense(num_classes, activation='softmax', name="output")(x)
model = tf.keras.Model(i, x)
Eventually I saved the model as tf savedmodel,
And then converted it to tflite version with the following supported ops:
tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
The problem started when I tried to load the tflite converted model using the code from tensorflow github:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/examples/minimal
I get the following errors:
ERROR: Select TensorFlow op(s), included in the given model, is(are) not supported by this interpreter. Make sure you apply/link the Flex delegate before inference. For the Android, it can be resolved by adding "org.tensorflow:tensorflow-lite-select-tf-ops" dependency. See instructions: https://www.tensorflow.org/lite/guide/ops_select
ERROR: Node number 0 (FlexHashTableV2) failed to prepare.
Error at /home/nativ/dev/tflite_inference/minimal.cc:60
What can I do to fix those errors?

Your model has some TF ops, and you need to link the Select ops (Also known as flex delegate) which allows you to run TF kernels for the operation(s) that are not available in TFLite as native operations.
Mainly you will need to build the delegate and link it to the binary
bazel build -c opt --config=monolithic tensorflow/lite/delegates/flex:tensorflowlite_flex
See more details here

Related

Pytorch NLP model doesn’t use GPU when making inference

I have a NLP model trained on Pytorch to be run in Jetson Xavier. I installed Jetson stats to monitor usage of CPU and GPU. When I run the Python script, only CPU cores work on-load, GPU bar does not increase. I have searched on Google about that with keywords of " How to check if pytorch is using the GPU?" and checked results on stackoverflow.com etc. According to their advices to someone else facing similar issue, cuda is available and there is cuda device in my Jetson Xavier. However, I don’t understand why GPU bar does not change, CPU core bars go to the ends.
I don’t want to use CPU, it takes so long to compute. In my opinion, it uses CPU, not GPU. How can I be sure and if it uses CPU, how can I change it to GPU?
Note: Model is taken from huggingface transformers library. I have tried to use cuda() method on the model. (model.cuda()) In this scenario, GPU is used but I can not get an output from model and raises exception.
Here is the code:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
BERT_DIR = "savasy/bert-base-turkish-squad"
tokenizer = AutoTokenizer.from_pretrained(BERT_DIR)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_DIR)
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
def infer(question,corpus):
try:
ans = nlp(question=question, context=corpus)
return ans["answer"], ans["score"]
except:
ans = None
pass
return None, 0

The problem has been solved with loading pipeline containing device parameter:
nlp = pipeline("question-answering", model=BERT_DIR, device=0)

For the model to work on GPU, the data and the model has to be loaded to the GPU:
you can do this as follows:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch
BERT_DIR = "savasy/bert-base-turkish-squad"
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(BERT_DIR)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_DIR)
model.to(device) ## model to GPU
nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
def infer(question,corpus):
try:
ans = nlp(question=question.to(device), context=corpus.to(device)) ## data to GPU
return ans["answer"], ans["score"]
except:
ans = None
pass
return None, 0

Running multiple TensorRT optimized models in Tensorflow

My project uses multiple Keras models. Those models can have an input with different batch size, that varies from 1 to 24. I decided to optimize those models using TF-TRT.
I tried 2 conversion approaches:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
First approach converts the model but does not create a TensorRT engines for the model:
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(
precision_mode=trt.TrtPrecisionMode.FP32)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_path,
conversion_params=conversion_params)
converter.convert()
converter.save(output_saved_model_dir=trt_fp32_model_path)
Second approach converts the model and and builds TensorRT engine for all possible input shapes:
def input_function():
def input_function():
input_shapes = [(x, MODEL_INPUT_H, MODEL_INPUT_W, 3) for x in range(1, 25)]
for shape in input_shapes:
yield [np.random.normal(size=shape).astype(np.float32)]
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(
precision_mode=trt.TrtPrecisionMode.FP32,
maximum_cached_engines=100
)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_path,
conversion_params=conversion_params)
converter.convert()
converter.build(input_fn=input_function)
converter.save(output_saved_model_dir=trt_fp32_model_path)
In script that uses my models, I use those models consecutively:
some loop:
model1.predict(model1_input)
model2.predict(model2_input)
model3.predict(model3_input)
When the first conversion approach is used to optimize the models, I am able to load all models, but at runtime Tensorflow rebuilds TensorRT engines every time a model execution context changes. This causes a large performance overhead, which I was trying to overcome by caching TensorRT engines for those models (second conversion approach).
The problem is that when I am trying to load more than one TensorRT optimized model with pre-built engines, Tensorflow throws the following error:
2020-04-01 09:11:44.820866: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: Expect engine cache to be empty, but got 24 entries.
[[{{node StatefulPartitionedCall/InitializeTRTResource}}]]
Error - Expect engine cache to be empty, but got 24 entries.
[[{{node StatefulPartitionedCall/InitializeTRTResource}}]] [Op:__inference_restored_function_body_64832]
Function call stack:
restored_function_body
The same error occurs when only one engine is saved for each model.
I use the following code to load TensorRT optimized SavedModel:
saved_model_loaded = tf.saved_model.load(
trt_fp32_model_path,
tags=[tag_constants.SERVING]
)
graph_func = saved_model_loaded.signatures['serving_default']
I also tried to convert graph_func to frozen_func, but this didn't make any difference:
graph_func = saved_model_loaded.signatures[signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
frozen_func = convert_to_constants.convert_variables_to_constants_v2(
graph_func)
I am using nvcr.io/nvidia/tensorflow:19.12-tf2-py3 docker container to optimize/run the models.
Is it possible at all to run simultaneously multiple TensorRT-optimized models with pre-built engines using Tensorflow? Or this can only be done using TensorRT inference server?
If case it is a valid usage scenario, what am I missing in my workflow?

You can avoid the issue with "Error - Expect engine cache to be empty, but got X entries."
You need to perform TensorRT optimizations and build engines for both models in the same Python file.
example:
if __name__ == "__main__":
optimize_and_build_engines_1()
optimize_and_build_engines_2()

Get summary of tensorflow model

Model.summary() gives me a this output
Now how can i check sequential_1 layers and sequential_3 layer?
I want whole model summary but it gives two sequential so that means two model are combined so how can i get summary of both model?
I only have model.h5 file nothing else

Models saved in .h5 format includes everything about the model.
To inspect the layers summary inside the Model in a Model, like in your case.
You could extract the layers, then call the summary method from each of them.
ie.
layer_summary = [layer.summary() for layer in loaded_model.layers]
Here is the complete code I used in reproducing your scenario.
import tensorflow as tf
print('Running Tensorflow version {}'.format(tf.__version__)) # Tensorflow 2.1.0
model_path = '/content/keras_model.h5'
loaded_model = tf.keras.models.load_model(model_path)
loaded_model.summary()
inp = loaded_model.input
layer_summary = [layer.summary() for layer in loaded_model.layers]
I've also used the model.h5 file you uploaded.

Can I get metrics on a tensorflow lite model?

I have been working on a complicated Keras model with a custom metric, and I recently converted it to tensorflow lite. The models are not exactly the same, and the outputs are different, however it is difficult to evaluate because the output is a tensor of size 128. Is there any way I can run my custom metric on this model? I have been using Tf 1.14. Below is some relevant code.
# compiler and train the model
model.save('model.h5')
# save the model in TFLite
converter = tf.lite.TFLiteConverter.from_keras_model_file('model.h5', custom_objects={'custom_metric': custom_metric})
tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)
# run the model
interpreter = tf.lite.Interpreter(model_path='model.tflite')
interpreter.allocate_tensors()
input_dets = interpreter.get_input_details()
output_dets = interpreter.get_output_details()
input_shape = input_dets[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_dets[0]['index'], input_data)
interpreter.invoke()

The models are supposed to be different because the converter does graph transformations (such as fuse activation and fold batch norm) and the resulting graph is targeted in inference only scenarios.
To run metrics: the interpreter provides an API to get the output value (as an array):
output = interpreter.tensor(interpreter.get_output_details()[0]["index"])
Then you apply your metric on the output.

Keras : Create MobileNet_V2 model "AttributeError"

I have successfully built several model based on mobileNet using keras. I noticed that MobileNet_V2 as been added in Keras 2.2.0, but I could not manage to make it work :
from keras.applications.mobilenet_v2 import mobilenet_v2
base_model = mobilenet_v2.MobileNetV2(weights='imagenet', include_top=False)
I get the following error : AttributeError: 'NoneType' object has no attribute 'image_data_format' on this line from mobilenet_v2.py data_format=backend.image_data_format()
It seems to me that backendhas a definition problem... I am using Tensorflow backend, maybe it does not work with this one ?

The problem is from the import. The proper way to do this is to do the following :
from keras.applications import MobileNetV2
m = MobileNetV2(weights='imagenet', include_top=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

c++ TensorFlow lite inference of tfhub models - python

Related

Pytorch NLP model doesn’t use GPU when making inference

Running multiple TensorRT optimized models in Tensorflow

Get summary of tensorflow model

Can I get metrics on a tensorflow lite model?

Keras : Create MobileNet_V2 model "AttributeError"

Categories

Resources