I was trying to compile a py file to binary that simply reads the model from a json file and predicts the output via the imported model.
Now the issue is that when I try to compile the program through pyinstaller the resulting file is around 290mb as it tries to compile the whole tensorflow package and it unwanted dependencies. Not to mention that it is very very slow to start up as tries to extract the files.
As you can see it is just a simple code that runs through a folder of images and identifies them as either a meme or a non meme content to clean my whatsapp folder.
import os
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import model_from_json
import shutil
BATCH_SIZE = 8
SHAPE = (150,150,3)
p = input("Enter path of folder to extract memes from")
p = p.replace("'","")
p = p.replace('"','')
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.load_weights('weights.best.hdf5')
test_datagen = ImageDataGenerator(rescale=1./255)
generator = test_datagen.flow_from_directory(
p,
target_size=SHAPE[:-1],
batch_size=BATCH_SIZE,
class_mode=None,
shuffle=False)
pred = model.predict_generator(generator)
# print(pred)
# print(model.summary())
pred[pred>0.5] = 1
pred[pred<=0.5] = 0
d = ["garbage","good"]
for i in range(len(pred)):
filename = generator.filenames[i].split('\\')[1]
if(pred[i][0] == 0):
shutil.move(generator.filepaths[i],os.path.join(os.path.join('test',str(int(pred[i][0]))),filename))
So my question is that, is there an alternative to the model.predict function that may be a lot lighter than the one tensorflow has as I do not want to include the whole 600mb tensorflow package in my distribution.
Why don't you use quantization on your Tensorflow models? Model size can be reduced to 75%. This is the processing that enables Tensorflow Lite to make predictions on pictures in real-time, on nothing more than a mobile phone CPU.
Essentially, weights can be converted to types with reduced precision, such as 16 bit floats or 8 bit integers. Tensorflow generally recommends 16-bit floats for GPU acceleration and 8-bit integer for CPU execution. Read the guide here.
Quantization brings improvements via model compression and latency reduction. With the API defaults, the model size shrinks by 4x, and we typically see between 1.5 - 4x improvements in CPU latency in the tested backends. Eventually, latency improvements can be seen on compatible machine learning accelerators, such as the EdgeTPU and NNAPI.
Post-training quantization includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy. These techniques can be performed on an already-trained float TensorFlow model and applied during TensorFlow Lite conversion.
Read more about post-training model quantization here.
Related
I've been training an efficientnetV2 network using this repository.
The train process goes well and I reach around 93-95% validation accuracy. After that I run an inference process over a set test which contains new images with an acceptable accuracy, around 88% (for example).
After I check if the model works fine on pytorch I need to convert it to ONNX and then to a tensorrt engine. I have a script to run inference with an ONNX model to check if I'm having some problems with the conversion process.
I'm using this code to convert the model:
import torch
from timm.models import create_model
import os
# create model
base_model = create_model(
model_arch,
num_classes=num_classes,
in_chans=3,
checkpoint_path=model_path)
model = torch.nn.Sequential(
base_model,
torch.nn.Softmax(dim=1)
)
model.cpu()
model.eval()
dummy_input = torch.randn(1, 3, 224, 224, requires_grad=True)
torch.onnx.export(model,
dummy_input,
model_export,
verbose=False,
export_params=True,
do_constant_folding=True
)
I've tried several tutorials like this one but unfortunately I'm getting the same result.
I've tried different onset combinations, with and without do_constant_folding, I've even trained another model with parameter called 'exportable', which is a bool and tells the train script if the model is exportable or not (is an experimental feature according to repository's documentation).
Do you have any idea about this issue?
Thanks in advance.
Hard to guess which bug you get, here is few typical:
some layers have not properly converted even after eval
you may need to write timm.create_model(...scriptable=True, exportable=True)
different preprocessing, e.g. timm model input normalized to specific values, after conversion - not.
Do those models output the near values on model(dummy_input) ?
I am training a model using transfer learning based on Resnet152. Based on PyTorch tutorial, I have no problem in saving a trained model, and loading it for inference. However, the time needed to load the model is slow. I don't know if I did it correct, here is my code:
To save the trained model as state dict:
torch.save(model.state_dict(), 'model.pkl')
To load it for inference:
model = models.resnet152()
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, len(classes))
st = torch.load('model.pkl', map_location='cuda:0' if torch.cuda.is_available() else 'cpu')
model.load_state_dict(st)
model.eval()
I timed the code and found that the first line model = models.resnet152() takes the longest time to load. On CPU, it takes 10 seconds to test one image. So my thinking is that this might not be the proper way to load it?
If I save the entire model instead of the state.dict like this:
torch.save(model, 'model_entire.pkl')
and test it like this:
model = torch.load('model_entire.pkl')
model.eval()
on the same machine it takes only 5 seconds to test one image.
So my question is: is it the proper way to load the state_dict for inference?
In the first code snippet, you are downloading a model from TorchVision (with random weights), and then loading your (locally stored) weights to it.
In the second example you are loading a locally stored model (and its weights).
The former will be slower since you need to connect to the server hosting the model and download it, as opposed to a local file, but it is more reproduceable not relying on your local file. Also, the time difference should be a one-off initialisation, and they should have the same time complexity (as by the point you are performing inference the model has already been loaded in both, and they are equivalent).
Hi I have some problem about Keras with python 3.6
My enviroment is keras with Python and Only CPU.
but the problem is when I iterate same Keras model for predict some diferrent input, its getting slower and slower..
my code is so simple just like that
for i in range(100):
model.predict(x)
the First run is fast. it takes 2 seconds may be. but second run takes 3 seconds and Third takes 5 seconds... its getting slower and slower even if I use same input.
what can I iterate predict keras model hold fast? I don't want any getting slower.. it will be very critical.
How can I Fix IT??
Try using the __call__ method directly. The documentation of the predict method states the following:
For small numbers of inputs that fit in one batch, directly use __call__() for faster execution, e.g., model(x).
I see the performance is critical in this case. So, if it doesn't help, you could use OpenVINO which is optimized for Intel hardware but it should work with any CPU. Your performance should be much better than using Keras directly.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
If your model calls the fit function in batches, there are different samples in the same batch with slightly different times over the course of the iteration, and then you try again and again to get more and more groups of predictive model performance time will be longer and longer.
We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq. We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.
We limit maximum length to 100 words. It mostly just generates 10~20 words.
For model inference, we try two cases:
normal(greedy algorithm). Its inference time is about 40ms~100ms
beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.
So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.
So are there any suggestion to decrease inference time for our case? Thanks.
you can do network compression.
cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM.
you can try faster softmax like adaptive softmax](https://arxiv.org/pdf/1609.04309.pdf)
try cudnnLSTM
try dilated RNN
switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support
If you require improved performance, I'd propose that you use OpenVINO. It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.
Here are some performance benchmarks for NLP model (BERT) and various CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
I tried the seq2seq pytorch implementation available here seq2seq . After profiling the evaluation(evaluate.py) code, the piece of code taking longer time was the decode_minibatch method
def decode_minibatch(
config,
model,
input_lines_src,
input_lines_trg,
output_lines_trg_gold
):
"""Decode a minibatch."""
for i in xrange(config['data']['max_trg_length']):
decoder_logit = model(input_lines_src, input_lines_trg)
word_probs = model.decode(decoder_logit)
decoder_argmax = word_probs.data.cpu().numpy().argmax(axis=-1)
next_preds = Variable(
torch.from_numpy(decoder_argmax[:, -1])
).cuda()
input_lines_trg = torch.cat(
(input_lines_trg, next_preds.unsqueeze(1)),
1
)
return input_lines_trg
Trained the model on GPU and loaded the model in CPU mode to make inference. But unfortunately, every sentence seems to take ~10sec. Is slow prediction expected on pytorch?
Any fixes, suggestions to speed up would be much appreciated. Thanks.
One solution for slow performance may be to use a toolkit optimized for the inference, such as OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together.
You can find a full tutorial on how to convert the PyTorch model here (FastSeg) and here (BERT). Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). Run in command line:
mo --input_model "model.onnx" --input_shape "[1, 3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Following on dragon7 answer, I'd recommend using the ONNX export from Optimum that can handle out of the box the export for encoder/decoder models, as well as make use of past key values in the decoder:
optimum-cli export onnx --model gpt2 --task causal-lm-with-past --for-ort gpt2_onnx/
If you want to use OpenVINO, a good option can be the OVModel that handle the inference with OpenVINO (especially for seq2seq models) out of the box!
Disclaimer: I am a contributor to Optimum library