My project uses multiple Keras models. Those models can have an input with different batch size, that varies from 1 to 24. I decided to optimize those models using TF-TRT.
I tried 2 conversion approaches:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
First approach converts the model but does not create a TensorRT engines for the model:
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(
precision_mode=trt.TrtPrecisionMode.FP32)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_path,
conversion_params=conversion_params)
converter.convert()
converter.save(output_saved_model_dir=trt_fp32_model_path)
Second approach converts the model and and builds TensorRT engine for all possible input shapes:
def input_function():
def input_function():
input_shapes = [(x, MODEL_INPUT_H, MODEL_INPUT_W, 3) for x in range(1, 25)]
for shape in input_shapes:
yield [np.random.normal(size=shape).astype(np.float32)]
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(
precision_mode=trt.TrtPrecisionMode.FP32,
maximum_cached_engines=100
)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_path,
conversion_params=conversion_params)
converter.convert()
converter.build(input_fn=input_function)
converter.save(output_saved_model_dir=trt_fp32_model_path)
In script that uses my models, I use those models consecutively:
some loop:
model1.predict(model1_input)
model2.predict(model2_input)
model3.predict(model3_input)
When the first conversion approach is used to optimize the models, I am able to load all models, but at runtime Tensorflow rebuilds TensorRT engines every time a model execution context changes. This causes a large performance overhead, which I was trying to overcome by caching TensorRT engines for those models (second conversion approach).
The problem is that when I am trying to load more than one TensorRT optimized model with pre-built engines, Tensorflow throws the following error:
2020-04-01 09:11:44.820866: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: Expect engine cache to be empty, but got 24 entries.
[[{{node StatefulPartitionedCall/InitializeTRTResource}}]]
Error - Expect engine cache to be empty, but got 24 entries.
[[{{node StatefulPartitionedCall/InitializeTRTResource}}]] [Op:__inference_restored_function_body_64832]
Function call stack:
restored_function_body
The same error occurs when only one engine is saved for each model.
I use the following code to load TensorRT optimized SavedModel:
saved_model_loaded = tf.saved_model.load(
trt_fp32_model_path,
tags=[tag_constants.SERVING]
)
graph_func = saved_model_loaded.signatures['serving_default']
I also tried to convert graph_func to frozen_func, but this didn't make any difference:
graph_func = saved_model_loaded.signatures[signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
frozen_func = convert_to_constants.convert_variables_to_constants_v2(
graph_func)
I am using nvcr.io/nvidia/tensorflow:19.12-tf2-py3 docker container to optimize/run the models.
Is it possible at all to run simultaneously multiple TensorRT-optimized models with pre-built engines using Tensorflow? Or this can only be done using TensorRT inference server?
If case it is a valid usage scenario, what am I missing in my workflow?
You can avoid the issue with "Error - Expect engine cache to be empty, but got X entries."
You need to perform TensorRT optimizations and build engines for both models in the same Python file.
example:
if __name__ == "__main__":
optimize_and_build_engines_1()
optimize_and_build_engines_2()
Related
I have deployed a model on real-time inference in a single gpu instance, it works fine.
Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?
Here is some of my code:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
logger.info("Loading first model...")
model = Model().to(DEVICE)
with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
model = model.eval()
logger.info("Loading second model...")
model_2 = Model_2()
model_2.to(DEVICE)
checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
model_2 = model_2()
logger.info('Done loading models')
return {'first_model': model, 'second_model': model_2}
def input_fn(request_body, request_content_type):
assert request_content_type=='application/json'
url = json.loads(request_body)['url']
save_name = json.loads(request_body)['save_name']
logger.info(f'Image url: {url}')
img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
w, h = img.size
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0).to(DEVICE)
logger.info('Image ready to predict!')
return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}
def predict_fn(input_object, model):
data = input_object['tensor']
logger.info('Generating prediction based on the input image')
model_1 = model['first_model']
model_2 = model['second_model']
d0, d1, d2, d3, d4, d5, d6 = model_1(data)
torch.cuda.empty_cache()
mask = torch.argmax(d0[0], axis=0).cpu().numpy()
mask = np.where(mask==2, 255, mask)
mask = np.where(mask==1, 128, mask)
img = input_object['image']
final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
img = np.array(img)[:,:,::-1]
final_image = np.array(final_image)
image_dict = to_dict(img, final_image)
final_image = model_2_process(model_2, image_dict)
torch.cuda.empty_cache()
return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}
I was thinking that maybe with torch multiprocessing, any tips?
The answer mentioning Torch DDP and DP is not exactly appropriate since the value of those libraries is to conduct multi-GPU gradient descent (averaging the gradient inter-GPU in particular), which, as mentioned in 1., does not happen at inference. Actually, a well-done, optimized inference ideally doesn't even use PyTorch or TensorFlow at all, but instead a prediction-only optimized runtime such as SageMaker Neo, ONNXRuntime or NVIDIA TensorRT, to reduce memory footprint and latency.
to infer a single model that fits in a GPU, multi-GPU instances are generally not advised: inference is a share-nothing task, so that you can use N single-GPU instance and things are simpler and equally performant.
Inference on Multi-GPU host is useful in 2 cases: (1) if you do model parallel inference (not your case) or (2) if your service inference consists of a graph of models that are calling each other. In which case, the proximity of the various models called in the DAG can reduce latency. That seems to be your situation
My recommendations are the following:
Try using NVIDIA Triton, that supports well those DAG use-cases and is supported on SageMaker. https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/
If you want to do things custom, you could try assigning the 2 models to different cuda device id in PyTorch. Because cuda kernels are run asynchronously this could be enough to have some parallelism and a bit of acceleration vs 1 GPU if your models can run parallel
I saw multiprocessing used once (with MXNet) to load-balance inference requests across GPUs (in this AWS blog post) but it was for share-nothing, map-style distribution of batches of inferences. In your case you seem to have to connection between your model so Triton is probably a better fit.
Eventually, if your goal is to reduce latency, there are other ideas:
Fix any CPU bottleneck Your code seem to have a lot of CPU work (pre-processing, numpy...). Are you sure GPU is the bottleneck? If CPU is at 80%+, try large single-GPU G5, such as G5.16xlarge. They are great for computer vision inference
Use a better GPU if you are using a P2, P3 or G4dn, try G5 instead
Optimize code. 2 things to try, depending on the bottleneck:
If you do the inference in Torch, try to avoid doing algebra with Numpy, and do as much as possible with torch tensors on GPU.
If GPU is the bottleneck, try to replace PyTorch by ONNXRuntime or NVIDIA TensorRT.
You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").
You must call the function by passing at least these three parameters:
module (Module) – module to be parallelized (your model)
device_ids (list of python:int or torch.device) – CUDA devices.
For single-device modules, device_ids can contain
exactly one device id, which represents the only CUDA device where the
input module corresponding to this process resides. Alternatively,
device_ids can also be None.
For multi-device modules and CPU
modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the
forward pass and the actual module must be placed on the correct
device. (default: None)
output_device (int or torch.device) – Device location of output for single-device CUDA modules.
For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)
for example:
from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)
I would like to create a simple loss for two sparse tensors in
Pytorch.
def criterion_sparse(x,y):
return torch.sparse.sum(
torch.pow((x-y).coalesce(),2)
It is giving me the error
NotImplementedError: Could not run 'aten::is_coalesced' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::is_coalesced' is only available for these backends: [SparseCPU, SparseCUDA, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
What can I do in this situation, it seems to be a very simple problem.
I want to deploy a keras model with tensorflow-serving.
The model is converted from a keras .h5 model to a .pb file.
( the original model comes from [here][https://github.com/shaoanlu/face_toolbox_keras))
When performing inference with keras on this model, if I'm using only my CPU, the 12 cores are working and inference takes on average 0.7s.
When converting the model and using tensorflow serving, it uses only one core, and takes on average 2.7s.
I tried setting options like --tensorflow_session_parallelism, --tensorflow_intra_op_parallelism and --tensorflow_inter_op_parallelism to 12, but nothing changes, only one core working when looking at top from inside the tfserving container.
I tried also compiling the tensorflow-serving for my machine's architecture, and i'm getting a slight improvement (2.7s to 2.5s), but I can't control the number of cores used per session.
I supposed it's nice that the other cores are available for concurrent requests, but I'd like to have more control.
The issue can be because of constant folding pass. Using tf.placeholder should resolve the issue.
if args.const_fold:
A = tf.ones([size, size], name=("A%s" % i))
B = tf.ones([size, size], name=("B%s" % i))
else:
A_name = "A%s" % i
B_name = "B%s" % i
A = tf.placeholder(tf.float32, shape=[size, size], name=A_name)
B = tf.placeholder(tf.float32, shape=[size, size], name=B_name)
feed["%s:0" % A_name] = np.random.rand(size, size)
feed["%s:0" % B_name] = np.random.rand(size, size)
As per my understanding, your code might be as shown in the if block as shown above. Changing it to else block should resolve the issue.
For more information, please refer this Stack Overflow Link
Couple of questions about this
For occasions when I'd like to do something like the following in Tensorflow (assume I'm creating training examples by loading WAV files):
import tensorflow as tf
def _some_audio_preprocessing_func(filename):
# ... some logic here which mostly uses Tensorflow ops ...
with tf.Session(graph=tf.Graph()) as sess:
wav_filename_placeholder = tf.placeholder(tf.string, [])
wav_loader = io_ops.read_file(wav_filename_placeholder)
wav_decoder = contrib_audio.decode_wav(wav_loader, desired_channels=1)
data = sess.run(
[wav_decoder],
feed_dict={wav_filename_placeholder: filename})
return data
dataset = tf.data.Dataset.list_files('*.wav')
dataset = dataset.map(_some_preprocessing_func)
If I have a parse_image() function that uses tensor ops - should
this be part of the main Graph? Following the example set in Google's own audio TF tutorial, it looks like they create a separate graph! Doesn't this ruin the point of using Tensorflow to make things faster?
Do I use tf.py_func() any time any single line isn't from the tensorflow library? Again, I wonder what the performance implications are and when I should use this...
Thanks!
When you use Dataset.map(map_func), TensorFlow defines a subgraph for all the ops created in the function map_func, and arranges to execute it efficiently in the same session as the rest of your graph. There is almost never any need to create a tf.Graph or tf.Session inside map_func: if your parsing function is made up of TensorFlow ops, these ops can be embedded directly in the graph that defines the input pipeline.
The modified version of the code using tf.data would look like this:
import tensorflow as tf
from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
def _some_audio_preprocessing_func(filename):
wav_loader = tf.read_file(filename)
return contrib_audio.decode_wav(wav_loader, desired_channels=1)
dataset = tf.data.Dataset.list_files('*.wav')
dataset = dataset.map(_some_preprocessing_func)
If your map_func contains non-TensorFlow operations that you want to apply to each element, you should wrap them in a tf.py_func() (or Dataset.from_generator(), if the data generation process is defined in Python logic). The main performance implication is that any code running in a tf.py_func() is subject to the Global Interpreter Lock, so I would generally recommend trying to find a native TensorFlow implementation for anything that is performance critical.
I'm using Keras with tensorflow as backend.
I have one compiled/trained model.
My prediction loop is slow so I would like to find a way to parallelize the predict_proba calls to speed things up.
I would like to take a list of batches (of data) and then per available gpu, run model.predict_proba() over a subset of those batches.
Essentially:
data = [ batch_0, batch_1, ... , batch_N ]
on gpu_0 => return predict_proba(batch_0)
on gpu_1 => return predict_proba(batch_1)
...
on gpu_N => return predict_proba(batch_N)
I know that it's possible in pure Tensorflow to assign ops to a given gpu (https://www.tensorflow.org/tutorials/using_gpu). However, I don't know how this translates to my situation since I've built/compiled/trained my model using Keras' api.
I had thought that maybe I just needed to use python's multiprocessing module and start a process per gpu that would run predict_proba(batch_n). I know this is theoretically possible given another SO post of mine: Keras + Tensorflow and Multiprocessing in Python. However, this still leaves me with the dilemma of not knowing how to actually "choose" a gpu to operate the process on.
My question boils down to: how does one parallelize prediction for one model in Keras across multiple gpus when using Tensorflow as Keras' backend?
Additionally I am curious if similar parallelization for prediction is possible with only one gpu.
A high level description or code example would be greatly appreciated!
Thanks!
I created one simple example to show how to run keras model across multiple gpus. Basically, multiple processes are created and each of process owns a gpu. To specify the gpu id in process, setting env variable CUDA_VISIBLE_DEVICES is a very straightforward way (os.environ["CUDA_VISIBLE_DEVICES"]). Hope this git repo can help you.
https://github.com/yuanyuanli85/Keras-Multiple-Process-Prediction
You can use this function to parallelize a Keras model (credits to kuza55).
https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py
.
from keras.layers import merge
from keras.layers.core import Lambda
from keras.models import Model
import tensorflow as tf
def make_parallel(model, gpu_count):
def get_slice(data, idx, parts):
shape = tf.shape(data)
size = tf.concat([ shape[:1] // parts, shape[1:] ],axis=0)
stride = tf.concat([ shape[:1] // parts, shape[1:]*0 ],axis=0)
start = stride * idx
return tf.slice(data, start, size)
outputs_all = []
for i in range(len(model.outputs)):
outputs_all.append([])
#Place a copy of the model on each GPU, each getting a slice of the batch
for i in range(gpu_count):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
inputs = []
#Slice each input into a piece for processing on this GPU
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
inputs.append(slice_n)
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
#Save all the outputs for merging back together later
for l in range(len(outputs)):
outputs_all[l].append(outputs[l])
# merge outputs on CPU
with tf.device('/cpu:0'):
merged = []
for outputs in outputs_all:
merged.append(merge(outputs, mode='concat', concat_axis=0))
return Model(input=model.inputs, output=merged)