Can we Train YOLOv3 on custom objects with no GPU?

Can we Train YOLOv3 on custom objects with no GPU? - python

My YOLO model works fine for detecting objects such as bottle, person, cellphone, backpack et cetera. But I want to make my model detect a ring or a bracelet or a helmet (objects which are not in the present in the present yolo model). Without GPU can I make a custom object detection yolo model? What are the risks involved? (if any).
My System is Windows 10 Home single language with 8GB RAM.

Re-compile darknet.exe to run on CPU is terribly slow. I've tried before. It's totally unpractical.
Recommend you study Intel OpenVINO toolkit.
https://software.intel.com/en-us/openvino-toolkit
and
https://docs.openvinotoolkit.org/latest/_docs_MO_DG_prepare_model_convert_model_tf_specific_Convert_YOLO_From_Tensorflow.html
OpenVINO toolkit can load and run any frameworks on their CPU/integrated GPU.
You can still use regular NVIDIA cards to train your custom objects by Darknet YOLO.
Then use 3rd-party converter tools (which can be easily found on the GitHub) to convert YOLO weight files you trained to the Tensorflow PB file.
Then use Intel's Model Optimizer to transform the PB file and label file into their so-called "Inference Representation" files (named in *.bin, *.xml, *.labels, and *.mapping files) which later can be loaded and run on Intel's CPU or integrated GPU.
Their Model Optimizer will automatically optimize and remove some unused nodes in YOLO convolutional network file and improve the overall inference speed, which is much faster than simply using re-compiled CPU version of darknet.exe to run YOLO weight on CPU.

Yes you can do that.
Just change the following lines in the Makefile of darknet folder-
GPU=1
CUDNN=1 (for GPU)
change it to -
GPU=0
CUDNN=0 (for CPU)
Yes you can train your YOLO model to detect custom objects too.. Just follow this blog - Link

Related

How to use Yolo 5 converted models

Yolo 5 is an object detection model that can be exported to several different frameworks including TensorFlow and Core ML.
https://github.com/ultralytics/yolov5
I have been able to train a Yolo 5 model, and export it to TensorFlow (TF1 graph def, or TF2 savemodel), and tried Apple Core ML.
I have not been able to find any examples for Yolo 5 on how to use these models once exported.
i.e. how to take an image file and get the detected objects/labels/coordinates
I tried similar python code to TF1 object detection, but the exported model does not seem compatible,
https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/tensorflow-1.14/
neither with TF2,
https://www.tensorflow.org/hub/tutorials/object_detection
or TFlite,
https://www.tensorflow.org/lite/examples/object_detection/overview

Use the weights that YOLOv5 calculated after training. They are usually in this route:
yolov5/runs/train/your_yolo_model/weights/best.pt

How to improve YOLOv3 detection time? (OpenCV + Python)

I'm using YOLOv3 custom trained model with OpenCV 4.2.0 compiled with CUDA. When I'm testing code in Python I'm using OpenCV on GPU (GTX1050 Ti) but detection on single image (416px x 416px) takes 0.055 s (~20 FPS). My config file is set to small object detection, because I need to detect ~ 10px x 10px objects on 2500px x 2000px images so I split original image into 30 smaller pieces. My goal is to reach 0.013 s (~80 FPS) on 416px x 416px image. Is it possible in Python with OpenCV? If not, how to do it in proper way?
PS. Currently detection takes like 50% of CPU, 5GB RAM and 6% GPU.

Some of the preferred ways to improve detection time with already trained Yolov3 model are:
Quantisation: Run inference with INT8 instead of FP32. You can use this repo for this purpose.
Use Inference accelerator such as TensorRT since you're using Nvidia's GPU. The tool includes good amount of inference oriented optimisations along with quantisation optimisations INT8 and FP16 to reduce detection time. This thread talks about Yolov3 inference with TensorRT5. Use this repo for Yolov3 on TensorRT7.
Use inference library such as tkDNN, which is a Deep Neural Network library built with cuDNN and tensorRT primitives, specifically thought to work on NVIDIA Jetson Boards.
If you're open to do the model training there are few more options other than the ones mentioned above:
You can train the models with tinier versions rather than full Yolo versions, of course this comes at the cost of drop in accuracy/mAP. You can train tiny-yolov4 (latest model) or train tiny-yolov3.
Model Pruning - If you could rank the neurons in the network according to how much they contribute, you could then remove the low ranking neurons from the network, resulting in a smaller and faster network. Pruned yolov3 research paper and it's implementation. This is another pruned Yolov3 implementation.

Improve accuracy with Tensorflow Object detection pretrained model

I am working on building an object detection model which I would like to create with 22 new classes (most of them are not in COCO or PETS datasets)
What I've already done is:
Prepared images with multiple labels using LabelIMG.
Decrease image size in 2 for images that are bigger than 500k
Convert XML to CSV file
Convert CSV and images to TFRecord
Using the Tensorflow sample config files I've trained with several pretrained checkpoints.
Results: SSD_Mobilenet and SSD_Inception resulted in no classes
found(loss ~10.0) while faster RCNN Inception did succeed to detect some of the
objects(loss ~0.7).
My questions are:
What is the difference between train.py from Object detection, which I used in the above, to retrain.py from image_retraining to train_image_classifier.py from Slim?
Which is better for my task? Or should I do it in a different way?
While running the train.py on FRCNN inception I found that the loss was around 0.7 and not going lower even after 100k steps. Is there any goal in terms of loss to achieve?
How do you suggest to change the config file to improve this?
I found other models for instance Inception V4 etc... which doesn't have sample config files - TF slim. Should I try them and if so how can I use them?
I am pretty new in this field and I need some support in understanding the terms and actions.
BTW: I am using GTX 1060 (GPU) for training but eval is not working in parallel so I can't get the mAP for validation. I tried to force eval for CPU but with no success.
Thanks.

1) What is the difference between train.py from Object detection, which I used in the above, to retrain.py from image_retraining to train_image_classifier.py from Slim
Ans : To what i know, none. Because train.py imports trainer.py which imports slim.learning.train(the same class which is used in train_image_classifier.py) to train.
2) Which is better for my task? Or should I do it in a different way?
Ans: The above answer answers this question too.
3) While running the train.py on FRCNN inception I found that the loss was around 0.7 and not going lower even after 100k steps. Is there any goal in terms of loss to achieve?
Ans: If you use tensorboard to visualize your results, you will find that when your classification loss graph is not changing a lot(has converged), your model is trained. Regarding the loss of 0.7, thats high after training for so many steps. Just check your pipeline config file parameters.
4) How you suggest to change the config file to improve this?
Ans: Learning rate value can be a good start
5) I found other models for instance Inception V4 etc... which doesn't have sample config files - TF slim ? Should I try them and if som how can I use them?
Ans: currently, I dont have an answer for this. but will get back to you.

(Not a complete answer, but I hope it helps in some way!)
Are your annotated objects small relative to the image size?
I had the same problems with no or few detections with SSD and found that model is very sensitive to the setting which determines the size of the box proposals (anchor generator). Here is a link with some details
Further, having an active eval job running is very important when debugging and tuning a model. TotalLoss or any of the parameters returned from the train job does not inform you of the performance of the actual model, only whether it is converging. The eval job gives you e.g. mAP which is a real measure of performance.
A simple way to force an eval job on cpu is by doing the following:
a) install a virtual environment dedicated for the eval job, instructions here
b) activate the virtual environment and install tensorflow cpu in the virtual environment (yes, you should install tensorflow again, and without gpu support)
c) start the train job as usual on your tensorflow-gpu (in whatever way you have installed it)
d) run the eval job in the virtual environment (this will force it to run on the cpu and works great! I also run tensorboard from this installation to minimise risk of interference with the train job)

Retrain is used to add a level in top of pretrained model... You can win time like this.. Useful for thousand of picture, useless for million labelised picture... Less efficient than train from skratch. There is template for config file. If thereis not config file create your own.. Look at tensorflow github explainations...

how to speedup tensorflow RNN inference time

We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq. We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.
We limit maximum length to 100 words. It mostly just generates 10~20 words.
For model inference, we try two cases:
normal(greedy algorithm). Its inference time is about 40ms~100ms
beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.
So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.
So are there any suggestion to decrease inference time for our case? Thanks.

you can do network compression.
cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM.
you can try faster softmax like adaptive softmax](https://arxiv.org/pdf/1609.04309.pdf)
try cudnnLSTM
try dilated RNN
switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support

If you require improved performance, I'd propose that you use OpenVINO. It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.
Here are some performance benchmarks for NLP model (BERT) and various CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

How to deploy trained tensorflow network on e.g. Raspberry Pi

I'm trying to make a simple gesture recognition system to use with my Raspberry Pi equipped with a camera. I would like to train a neural network with tensorflow on my more powerful laptop and then transfer it to the RPi for prediction (as part of a Magic Mirror). Is there a way to export the trained network and weights and use a lightweight version of tensorflow for the linear algebra and prediction without the overhead of all the symbolic graph machinery that are necessary for training? I have seen the tutorials on tensorflow server, but I'd rather not set up a server and just have it run the prediction on the RPi.

Yes, possible and available in the source repository. This allows to deploy and run a model trained on your laptop. Note that this is the same model, which can be big.
To deal with size and efficiency, TF is currently moving along a quantization approach. After your model is trained, a few extra steps allow to "translate" it into a lighter model with similar accuracy. Currently, the implementation is quite slow, though. There is a recent post that shows the whole process for iOS---pretty similar to RaspberryPI overall.
The Makefile contribution is also quite relevant for tuning and extra configuration.
Beware that this code moves often and breaks. It is sometimes useful to checkout an old "release" tag to get something that works end to end.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.