When training a Tensorflow object detection model, I want to view the mean average precision (mAP) of my test data. Running train.py and eval.py in two consoles, I can view the loss rates of the training data, and even the objects detected in the test set images through
tensorboard --logdir=model_dir/
however no precision scalars are being displayed for the test set.
I am using python 3 on windows 10, and successfully installed pycocotools using;
pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI
Cheers
I thought I had this problem, hence the upvote. But I simply wasnt waiting long enough. Leave the eval running for the at least couple of hours or longer alongside the train job.
#Graham Monkman is it a must for the eval and train job to run together so that I can get the evaluation metrics (mAP, etc..) ?
Related
I am using tensorflow-2 gpu with tf.data.Dataset.
Training on small models works.
When training a bigger model, everything works at first : gpu is used, the first epoch works with no trouble (but I am using most of my gpu memory).
At validation time, I run into a CUDA_ERROR_OUT_OF_MEMORY with various allocation with a smaller and smaller amount of bytes that could not be allocated (ranging from 922Mb to 337Mb).
I currently have no metrics and no callbacks and am using tf.keras.Model.fit.
If I remove the validation data, the training continues.
What is my issue ? how can I debug this ?
In tf1, I could use RunOptions(report_tensor_allocations_upon_oom=True), does any equivalent exist in tf2 ?
This occurs with tensorflow==2.1.0 .
These did not occur in
2.0 alpha TensorFlow but in 2.0.
Pip installs tensorflow-gpu==2.0.0: has
leaked memory!
Pip install tensorflow-gpu==2.0.0-alpha:
it's all right!
Try it out
I am using the TensorFlow Object Detection API for retraining a COCO-pretrained Faster RCNN Inception v2 model on my custom dataset and recently noticed that several of my models BoxClassifierLoss get worse over the duration of the training (from e.g. 0.17 loss up to 0.38 and after 100 epochs down to 0.24 (thereafter getting worse again or fluctuating without improvement)).
Therefore I am interested in freezing the BoxClassifier to preserve the initial weights that apparently work better.
I read that there is a 'freeze_variables' parameter in the train.proto, but I am unsure as to what variables to freeze exactly.
Best to my understanding, Vinod's answer is not related to the question asked.
If you want to freeze your model to export it, then you can use export_inference_graph.
But I understand that what you wish is to freeze variables during training.
As you mentioned yourself, you can specify variables in update_trainable_variables or freeze_variables in order to choose which variables will be trained and which will not.
Essentially these are fed to the filter_variables function on your graph in order to choose the variables to include and exclude from training. As can be seen from the description, it expects a pattern using a regular expression. In order to know your variables' names, to include or exclude them - you can inspect your graph. One way to do so, is by using TensorBoard, Graph tab.
On the other hand, I wish to say that this might not be the solution in your case. At the beginning of a training session it is natural to expect high loss or loss increase. However, if after a full training session, the loss fluctuates - then you should inspect the magnitude of the fluctuation. If it's a minor fluctuation, it's natural, if the magnitude is large - then maybe something is wrong in the training configuration. Further analysis of what is going wrong can only be done with more information, e.g. config file, loss graph, data examples, etc.
You can freeze model.ckpt meta (checkpoint files) files which are stored in following location:
C:\tensorflow1\models\research\object_detection\training
These checkpoint files are stored frequently during training, so you can check the detail of this file when your error reduces then freeze the same checkpoint to your final model.
For freezing the model, you can use following command:
python export_inference_graph.py --input_type image_tensor --pipeline_config_path training/faster_rcnn_inception_v2_pets.config --trained_checkpoint_prefix training/model.ckpt-XXXX --output_directory inference_graph
Where, XXXX is the number in file name model.ckpt-XXXX.meta.
In my case it is model.ckpt-1970.meta, XXXX = 1970.
Checkout my folder structure in the following image.
I have the following problem: When I retrain the TF object detection API with my own dataset, the training is often killed and I don't know the reason. there is no errors log, just killed.
Moreover, why in my MODEL_DIR only few model.ckpt-XXXX are saved?
Secondly, when I try to export the above model to a frozen graph with the provided script, I saw in the analysis that there is incomplete shape:
================== Model Analysis Report ======================
Incomplete shape.
I used a model.cpkt-XXXX after the training process got killed, is it the reason why the shape is incomplete?
The exported model can be use for inference but I guess it is not optimal...
FYI, I have retrained the mobileSSDv2 with 1 class and I have modified the pipeline config file regarding the changes as follow:
I change number of class to 1
in train config {} part, I changed batch size to 12 and put the number of steps to 200
train_input_reader and eval_input_reader {} parts, I have added my path the the TF record and labelmap.pbtxt
in eval_config {} part, I have changed the number of example to 85 ( the number of picture in my eval images repository) and max eval to 5.
I use ubuntu 16.04 with tensoflow-GPU 1.12.0 in a virtualenv with python 2.7.
Thank you in advance.
If you are using tensorflow-gpu and you have a GPU, 200 is a really low number, that you reach in less few minutes (and your conv-net will learn nothing). Increase it to 100.000, at least.
Moreover, due to the low number of training steps, you might expect that training save your model at start (step 0) and end training (step 200), so you get only 2 models.
Tensorflow save models every 600 seconds, if you don't change save_interval_secs inside trainer.py
Hi I have some problem about Keras with python 3.6
My enviroment is keras with Python and Only CPU.
but the problem is when I iterate same Keras model for predict some diferrent input, its getting slower and slower..
my code is so simple just like that
for i in range(100):
model.predict(x)
the First run is fast. it takes 2 seconds may be. but second run takes 3 seconds and Third takes 5 seconds... its getting slower and slower even if I use same input.
what can I iterate predict keras model hold fast? I don't want any getting slower.. it will be very critical.
How can I Fix IT??
Try using the __call__ method directly. The documentation of the predict method states the following:
For small numbers of inputs that fit in one batch, directly use __call__() for faster execution, e.g., model(x).
I see the performance is critical in this case. So, if it doesn't help, you could use OpenVINO which is optimized for Intel hardware but it should work with any CPU. Your performance should be much better than using Keras directly.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
If your model calls the fit function in batches, there are different samples in the same batch with slightly different times over the course of the iteration, and then you try again and again to get more and more groups of predictive model performance time will be longer and longer.
I am working on building an object detection model which I would like to create with 22 new classes (most of them are not in COCO or PETS datasets)
What I've already done is:
Prepared images with multiple labels using LabelIMG.
Decrease image size in 2 for images that are bigger than 500k
Convert XML to CSV file
Convert CSV and images to TFRecord
Using the Tensorflow sample config files I've trained with several pretrained checkpoints.
Results: SSD_Mobilenet and SSD_Inception resulted in no classes
found(loss ~10.0) while faster RCNN Inception did succeed to detect some of the
objects(loss ~0.7).
My questions are:
What is the difference between train.py from Object detection, which I used in the above, to retrain.py from image_retraining to train_image_classifier.py from Slim?
Which is better for my task? Or should I do it in a different way?
While running the train.py on FRCNN inception I found that the loss was around 0.7 and not going lower even after 100k steps. Is there any goal in terms of loss to achieve?
How do you suggest to change the config file to improve this?
I found other models for instance Inception V4 etc... which doesn't have sample config files - TF slim. Should I try them and if so how can I use them?
I am pretty new in this field and I need some support in understanding the terms and actions.
BTW: I am using GTX 1060 (GPU) for training but eval is not working in parallel so I can't get the mAP for validation. I tried to force eval for CPU but with no success.
Thanks.
1) What is the difference between train.py from Object detection, which I used in the above, to retrain.py from image_retraining to train_image_classifier.py from Slim
Ans : To what i know, none. Because train.py imports trainer.py which imports slim.learning.train(the same class which is used in train_image_classifier.py) to train.
2) Which is better for my task? Or should I do it in a different way?
Ans: The above answer answers this question too.
3) While running the train.py on FRCNN inception I found that the loss was around 0.7 and not going lower even after 100k steps. Is there any goal in terms of loss to achieve?
Ans: If you use tensorboard to visualize your results, you will find that when your classification loss graph is not changing a lot(has converged), your model is trained. Regarding the loss of 0.7, thats high after training for so many steps. Just check your pipeline config file parameters.
4) How you suggest to change the config file to improve this?
Ans: Learning rate value can be a good start
5) I found other models for instance Inception V4 etc... which doesn't have sample config files - TF slim ? Should I try them and if som how can I use them?
Ans: currently, I dont have an answer for this. but will get back to you.
(Not a complete answer, but I hope it helps in some way!)
Are your annotated objects small relative to the image size?
I had the same problems with no or few detections with SSD and found that model is very sensitive to the setting which determines the size of the box proposals (anchor generator). Here is a link with some details
Further, having an active eval job running is very important when debugging and tuning a model. TotalLoss or any of the parameters returned from the train job does not inform you of the performance of the actual model, only whether it is converging. The eval job gives you e.g. mAP which is a real measure of performance.
A simple way to force an eval job on cpu is by doing the following:
a) install a virtual environment dedicated for the eval job, instructions here
b) activate the virtual environment and install tensorflow cpu in the virtual environment (yes, you should install tensorflow again, and without gpu support)
c) start the train job as usual on your tensorflow-gpu (in whatever way you have installed it)
d) run the eval job in the virtual environment (this will force it to run on the cpu and works great! I also run tensorboard from this installation to minimise risk of interference with the train job)
Retrain is used to add a level in top of pretrained model... You can win time like this.. Useful for thousand of picture, useless for million labelised picture... Less efficient than train from skratch. There is template for config file. If thereis not config file create your own.. Look at tensorflow github explainations...