I have the following problem: When I retrain the TF object detection API with my own dataset, the training is often killed and I don't know the reason. there is no errors log, just killed.
Moreover, why in my MODEL_DIR only few model.ckpt-XXXX are saved?
Secondly, when I try to export the above model to a frozen graph with the provided script, I saw in the analysis that there is incomplete shape:
================== Model Analysis Report ======================
Incomplete shape.
I used a model.cpkt-XXXX after the training process got killed, is it the reason why the shape is incomplete?
The exported model can be use for inference but I guess it is not optimal...
FYI, I have retrained the mobileSSDv2 with 1 class and I have modified the pipeline config file regarding the changes as follow:
I change number of class to 1
in train config {} part, I changed batch size to 12 and put the number of steps to 200
train_input_reader and eval_input_reader {} parts, I have added my path the the TF record and labelmap.pbtxt
in eval_config {} part, I have changed the number of example to 85 ( the number of picture in my eval images repository) and max eval to 5.
I use ubuntu 16.04 with tensoflow-GPU 1.12.0 in a virtualenv with python 2.7.
Thank you in advance.
If you are using tensorflow-gpu and you have a GPU, 200 is a really low number, that you reach in less few minutes (and your conv-net will learn nothing). Increase it to 100.000, at least.
Moreover, due to the low number of training steps, you might expect that training save your model at start (step 0) and end training (step 200), so you get only 2 models.
Tensorflow save models every 600 seconds, if you don't change save_interval_secs inside trainer.py
Related
I am using the TensorFlow Object Detection API for retraining a COCO-pretrained Faster RCNN Inception v2 model on my custom dataset and recently noticed that several of my models BoxClassifierLoss get worse over the duration of the training (from e.g. 0.17 loss up to 0.38 and after 100 epochs down to 0.24 (thereafter getting worse again or fluctuating without improvement)).
Therefore I am interested in freezing the BoxClassifier to preserve the initial weights that apparently work better.
I read that there is a 'freeze_variables' parameter in the train.proto, but I am unsure as to what variables to freeze exactly.
Best to my understanding, Vinod's answer is not related to the question asked.
If you want to freeze your model to export it, then you can use export_inference_graph.
But I understand that what you wish is to freeze variables during training.
As you mentioned yourself, you can specify variables in update_trainable_variables or freeze_variables in order to choose which variables will be trained and which will not.
Essentially these are fed to the filter_variables function on your graph in order to choose the variables to include and exclude from training. As can be seen from the description, it expects a pattern using a regular expression. In order to know your variables' names, to include or exclude them - you can inspect your graph. One way to do so, is by using TensorBoard, Graph tab.
On the other hand, I wish to say that this might not be the solution in your case. At the beginning of a training session it is natural to expect high loss or loss increase. However, if after a full training session, the loss fluctuates - then you should inspect the magnitude of the fluctuation. If it's a minor fluctuation, it's natural, if the magnitude is large - then maybe something is wrong in the training configuration. Further analysis of what is going wrong can only be done with more information, e.g. config file, loss graph, data examples, etc.
You can freeze model.ckpt meta (checkpoint files) files which are stored in following location:
C:\tensorflow1\models\research\object_detection\training
These checkpoint files are stored frequently during training, so you can check the detail of this file when your error reduces then freeze the same checkpoint to your final model.
For freezing the model, you can use following command:
python export_inference_graph.py --input_type image_tensor --pipeline_config_path training/faster_rcnn_inception_v2_pets.config --trained_checkpoint_prefix training/model.ckpt-XXXX --output_directory inference_graph
Where, XXXX is the number in file name model.ckpt-XXXX.meta.
In my case it is model.ckpt-1970.meta, XXXX = 1970.
Checkout my folder structure in the following image.
I'm trying to train my yolo model to identify fire extinguishers and to label it as "Fire Safety". Currently is either I get a overfit or underfit images(see below).
My sample images size with annotations is around ~1500
yolo-new.cfg config of width=608 and height=608
And I have trained using the following command:
python flow --model cfg/yolo-new.cfg --labels one_label.txt --train
--trainer adam --dataset "C://Users//G//Desktop//Development//ML//YOLO//BBox-Label-Tool//Images//002"
--annotation "C://Users//G//Desktop//Development//ML//YOLO//BBox-Label-Tool//AnnotationsXML//002"
--batch 4 --gpu 0.8
So after 13000 steps:
So I went to validate my results and this is what I get(Checkpoint 13000):
So perhaps I thought this might be a case of severe overfitting, thus I iterate through the checkpoints to see which has the closest fit.
This is what I get using checkpoint 6500
This is what I get using checkpoint 6000
This is what I get using checkpoint 5500
So, as you can see checkpoint 6000 is the best possible result in my case but it isn't good enough. How do I improve on this? Increase batch size ?(My GPU 1070Ti cant handle. Cuda out of memory occurs) Any Ideas to solve this?
Using Yolov3 to train my imageset solved my issue. https://github.com/AlexeyAB/darknet
One thing to note is not to leave any blanks during annotation, perhaps this might be one of the reasons why the detection did not work as planned.
When training a Tensorflow object detection model, I want to view the mean average precision (mAP) of my test data. Running train.py and eval.py in two consoles, I can view the loss rates of the training data, and even the objects detected in the test set images through
tensorboard --logdir=model_dir/
however no precision scalars are being displayed for the test set.
I am using python 3 on windows 10, and successfully installed pycocotools using;
pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI
Cheers
I thought I had this problem, hence the upvote. But I simply wasnt waiting long enough. Leave the eval running for the at least couple of hours or longer alongside the train job.
#Graham Monkman is it a must for the eval and train job to run together so that I can get the evaluation metrics (mAP, etc..) ?
I am working on building an object detection model which I would like to create with 22 new classes (most of them are not in COCO or PETS datasets)
What I've already done is:
Prepared images with multiple labels using LabelIMG.
Decrease image size in 2 for images that are bigger than 500k
Convert XML to CSV file
Convert CSV and images to TFRecord
Using the Tensorflow sample config files I've trained with several pretrained checkpoints.
Results: SSD_Mobilenet and SSD_Inception resulted in no classes
found(loss ~10.0) while faster RCNN Inception did succeed to detect some of the
objects(loss ~0.7).
My questions are:
What is the difference between train.py from Object detection, which I used in the above, to retrain.py from image_retraining to train_image_classifier.py from Slim?
Which is better for my task? Or should I do it in a different way?
While running the train.py on FRCNN inception I found that the loss was around 0.7 and not going lower even after 100k steps. Is there any goal in terms of loss to achieve?
How do you suggest to change the config file to improve this?
I found other models for instance Inception V4 etc... which doesn't have sample config files - TF slim. Should I try them and if so how can I use them?
I am pretty new in this field and I need some support in understanding the terms and actions.
BTW: I am using GTX 1060 (GPU) for training but eval is not working in parallel so I can't get the mAP for validation. I tried to force eval for CPU but with no success.
Thanks.
1) What is the difference between train.py from Object detection, which I used in the above, to retrain.py from image_retraining to train_image_classifier.py from Slim
Ans : To what i know, none. Because train.py imports trainer.py which imports slim.learning.train(the same class which is used in train_image_classifier.py) to train.
2) Which is better for my task? Or should I do it in a different way?
Ans: The above answer answers this question too.
3) While running the train.py on FRCNN inception I found that the loss was around 0.7 and not going lower even after 100k steps. Is there any goal in terms of loss to achieve?
Ans: If you use tensorboard to visualize your results, you will find that when your classification loss graph is not changing a lot(has converged), your model is trained. Regarding the loss of 0.7, thats high after training for so many steps. Just check your pipeline config file parameters.
4) How you suggest to change the config file to improve this?
Ans: Learning rate value can be a good start
5) I found other models for instance Inception V4 etc... which doesn't have sample config files - TF slim ? Should I try them and if som how can I use them?
Ans: currently, I dont have an answer for this. but will get back to you.
(Not a complete answer, but I hope it helps in some way!)
Are your annotated objects small relative to the image size?
I had the same problems with no or few detections with SSD and found that model is very sensitive to the setting which determines the size of the box proposals (anchor generator). Here is a link with some details
Further, having an active eval job running is very important when debugging and tuning a model. TotalLoss or any of the parameters returned from the train job does not inform you of the performance of the actual model, only whether it is converging. The eval job gives you e.g. mAP which is a real measure of performance.
A simple way to force an eval job on cpu is by doing the following:
a) install a virtual environment dedicated for the eval job, instructions here
b) activate the virtual environment and install tensorflow cpu in the virtual environment (yes, you should install tensorflow again, and without gpu support)
c) start the train job as usual on your tensorflow-gpu (in whatever way you have installed it)
d) run the eval job in the virtual environment (this will force it to run on the cpu and works great! I also run tensorboard from this installation to minimise risk of interference with the train job)
Retrain is used to add a level in top of pretrained model... You can win time like this.. Useful for thousand of picture, useless for million labelised picture... Less efficient than train from skratch. There is template for config file. If thereis not config file create your own.. Look at tensorflow github explainations...
I am trying to use ssd_inception_v2_coco pre-trained model from Tensorflow API by training it with a single class dataset and also applying Transfer Learning. I trained the net for around 20k steps(total loss around 1) and using the checkpoint data, I created the inference_graph.pb and used it in the detection code.
To my surprise, when I tested the net with the training data the graph is not able to detect even 1 out of 11 cases (0/11). I am lost in finding the issue.
What might be the possibile mistake?.
P.S : I am not able to run train.py and eval.py at the same time, due to memory issues. So, I don't have info about precision from tensorboard
Has anyone faced similar kind of issue?