I'm using YOLOv3 custom trained model with OpenCV 4.2.0 compiled with CUDA. When I'm testing code in Python I'm using OpenCV on GPU (GTX1050 Ti) but detection on single image (416px x 416px) takes 0.055 s (~20 FPS). My config file is set to small object detection, because I need to detect ~ 10px x 10px objects on 2500px x 2000px images so I split original image into 30 smaller pieces. My goal is to reach 0.013 s (~80 FPS) on 416px x 416px image. Is it possible in Python with OpenCV? If not, how to do it in proper way?
PS. Currently detection takes like 50% of CPU, 5GB RAM and 6% GPU.
Some of the preferred ways to improve detection time with already trained Yolov3 model are:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose.
Use Inference accelerator such as TensorRT since you're using Nvidia's GPU. The tool includes good amount of inference oriented optimisations along with quantisation optimisations INT8 and FP16 to reduce detection time. This thread talks about Yolov3 inference with TensorRT5. Use this repo for Yolov3 on TensorRT7.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards.
If you're open to do the model training there are few more options other than the ones mentioned above:
You can train the models with tinier versions rather than full Yolo versions, of course this comes at the cost of drop in accuracy/mAP. You can train tiny-yolov4 (latest model) or train tiny-yolov3.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.
Related
I work on medical imaging so I need to train massive 3D CNNs that are difficult to fit into one GPU. I wonder if there is way to split a massive Keras or TensorFlow graph amongst multiple GPUs such that each GPU only runs a small part of the graph during training and inference. Is this type of distribute training possible with either Keras or TensorFlow?
I have tried using with tf.device('\gpu:#') when building the graph but I am experiencing memory overflow. The logs seem to indicate the entire graph is still being run on gpu:0.
My YOLO model works fine for detecting objects such as bottle, person, cellphone, backpack et cetera. But I want to make my model detect a ring or a bracelet or a helmet (objects which are not in the present in the present yolo model). Without GPU can I make a custom object detection yolo model? What are the risks involved? (if any).
My System is Windows 10 Home single language with 8GB RAM.
Re-compile darknet.exe to run on CPU is terribly slow. I've tried before. It's totally unpractical.
Recommend you study Intel OpenVINO toolkit.
https://software.intel.com/en-us/openvino-toolkit
and
https://docs.openvinotoolkit.org/latest/_docs_MO_DG_prepare_model_convert_model_tf_specific_Convert_YOLO_From_Tensorflow.html
OpenVINO toolkit can load and run any frameworks on their CPU/integrated GPU.
You can still use regular NVIDIA cards to train your custom objects by Darknet YOLO.
Then use 3rd-party converter tools (which can be easily found on the GitHub) to convert YOLO weight files you trained to the Tensorflow PB file.
Then use Intel's Model Optimizer to transform the PB file and label file into their so-called "Inference Representation" files (named in *.bin, *.xml, *.labels, and *.mapping files) which later can be loaded and run on Intel's CPU or integrated GPU.
Their Model Optimizer will automatically optimize and remove some unused nodes in YOLO convolutional network file and improve the overall inference speed, which is much faster than simply using re-compiled CPU version of darknet.exe to run YOLO weight on CPU.
Yes you can do that.
Just change the following lines in the Makefile of darknet folder-
GPU=1
CUDNN=1 (for GPU)
change it to -
GPU=0
CUDNN=0 (for CPU)
Yes you can train your YOLO model to detect custom objects too.. Just follow this blog - Link
First of all: this question is connected to neural network inference and not training.
I have discovered, that when doing inference of a trained neural network with only one image over and over on a GPU (e.g. P100) the utilization of the computing power with Tensorflow is not reaching 100%, but instead around 70%. This is also the case if the image does not have to be transferred to the GPU. Therefore, the issue has to be connected to constraints in the parallelization of the calculations. My best guesses for the reasons are:
Tensorflow can only utilize the parallelization capabilities of a GPU up to a certain level. (Also the higher utilization of the same model as a TensorRT models suggest that). In this case, the question is: What is the reason for that?
The inherent neural network structure with several subsequent layers avoids that a higher usage is possible. Therefore the problem is not overhead of a framework but lies in the general design of neural networks. In this case, the question is: What are the restrictions to that?
Both of the above combined.
Thanks for your ideas on the issue!
Why do you expect the GPU utilization to go to 100% when you run the neuronal network prediction for one image?
The GPU utilization is per time unit (e.g. 1 second). This means, when the neuronal network algorithm finished before this time unit elapsed (e.g within 0.5s) Then the rest of the time the GPU may get used by other programs or not get used at all. If the GPU is not used by any other programs neither then well you will not reach 100%.
We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq. We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.
We limit maximum length to 100 words. It mostly just generates 10~20 words.
For model inference, we try two cases:
normal(greedy algorithm). Its inference time is about 40ms~100ms
beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.
So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.
So are there any suggestion to decrease inference time for our case? Thanks.
you can do network compression.
cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM.
you can try faster softmax like adaptive softmax](https://arxiv.org/pdf/1609.04309.pdf)
try cudnnLSTM
try dilated RNN
switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support
If you require improved performance, I'd propose that you use OpenVINO. It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.
Here are some performance benchmarks for NLP model (BERT) and various CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Is it possible to have bounding boxes prediction using TensorFlow?
I found TensorBox on github but I'm looking for a better supported or maybe official way to address this problem.
I need to retrain the model for my own classes.
It is unclear what exactly do you mean. Do you need object detection? I assume it from the 'bounding boxes'. If so, inception networks are not directly applicable for your task, they are classification networks.
You should look for object detection models, like Single Shot Detector (SSD) or You Only Look Once (YOLO). They often use pre-trained convolutional layers from classification networks, but have additional layers on the top of it. If you want Inception (aka GoogLeNet), YOLO is based on that. Take a look at this implementation: https://github.com/thtrieu/darkflow or any other you can find in Google.
The COCO2016 winner for object detection was implemented in tensorflow. Some state of the art techniques are Faster R-CNN, R-FCN and SSD. Check the slides from http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf (Slide 14 has key tensorflow ops for you to recreate this pipeline).
Edit 6/19/2017:
Tensorflow released some techniques to predict bboxes:
https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html