I'm using inception v3 and tensorflow to identify some objects within the image.
However, it just create a list of possible objects and I need it to inform their position in the image.
I'm following the flowers tutorial: https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html
bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir
~/flower_photos
Inception is a classification network, not a localization network.
You need another architecture to predict the bounding boxes, like R-CNN and its newer (and faster) variants (Fast R-CNN, Faster R-CNN).
Optionally, if you want to use inception and you have a train set annotated with class and bounding box coordinates, you can add a regression head to inception, and make the network learn to regress the bounding box coordinates.
It's the same thing of transfer learning, but you just use the last convolutional layer output as a feature extractor, and train this new head to regress 4 coordinates + 1 class for every bounding box in your training set.
By default inception does not output coordinates. There are specific tools for that like Faster R-CNN available for Caffe.
If you want to stick with tensorflow, you can retrain the inception to output the coordinates if you have the human annotated images.
Putting bounding boxes around objects is usually called detection in the lingo of the field, and there is a whole category of networks designed for it. There's a separate category in the PASCAL VOC competition for detection, and that's a good place to find good detection networks
My favorite detection network (which is the currently leader for the 2012 PASCAL VOC dataset) is YOLO, which starts with a typical classifier, but then has some extra layers to support bounding boxes. Instead of just returning a class, it produces downsampled version of the original image, where each pixel has its own class. Then it has a regression layer that predicts the exact position and size of the bounding boxes. You can start with a pre-trained classifier, and then modify it into a YOLO network and retrain it. The procedure is described in the the original paper about YOLO
I like YOLO because it has a simple structure, compared to other detection networks, it allows you to use transfer learning from classification networks (which makes it easier to train), and the detection speed is very fast. It was actually developed for real-time detection in video.
There is an implementation of YOLO in TensorFlow, if you would like to avoid using the custom darknet framework used by the authors of YOLO.
Related
I am new to computer vision, and I still didn't try any kind of neural network detections such as yolo, however, I am wishing to do object tracking before entering the field of detection. I started reading about deep sort and all the projects use deep learning detections that needs training. My question is, can I give an ROI result to my deep SORT tracker instead of detections using YOLO and it continues tracking the object selected with ROI.
Here is a link that i found information about the code of DeepSORT.DeepSORT: Deep Learning to Track Custom Objects in a Video
In DeepSORT, you need to have detection in order to perform tracking. It is a tracking-by-detection method. The detection results are input to the Kalman filter component of DeepSORT. The filter generates tracking predictions. Also, the bounding boxes from detection are used to extract crops of RoI from the input image. These image crops are used by the trained Siamese model for feature extraction. The feature extraction by the Siamese model helps in reducing ID Switch.
If you are only interested in doing tracking and ID switch in case of occlusion is not your concern then you can have look at CenterTrack. It does joint detection and tracking in a single model. In this case, you can avoid model training from scratch. The authors provide pre-trained models for tracking both pedestrians and vehicles. As compared to DeepSORT the CenterTrack is pretty fast.
[Sorry for the late reply] I think you should try Siamese Network for tracking by selecting the ROI region. You can find many variants in this given link.
https://github.com/HonglinChu/SiamTrackers.
I'm new in deep learning and CNN, I understand how convolutional and pooling layers work, I understand how and why feature maps are created. How I can localize from the feature maps important area in the original image? is that possible?
I.e. I use my own data ( medical image )
Feeding those images to my neural network I can receive maps features and calculate from them the probability of each class.
But how from that feature maps I can get know where exactly those areas are?
I will use those areas for another application?
I'm building CNN that will tell me if a person has brain damage. I'm planning to use tf inception v3 model, and build_image_data.py script to build TFRecord.
Dataset is composed of brain scans. Every scan has about 100 images(different head poses, angles). On some images, damage is visible, but on some is not. I can't label all images from the scan as a damage positive(or negative), because some of them would be labeled wrong(if scan is positive on damage, but that is not visible on specific image).
Is there a way to label the whole scan as positive/negative and in that way train the network?
And after training is done, pass scan as input to network(not single image) and classify it.
It looks like multiple instance learning might be your approach. Check out these two papers:
Multiple Instance Learning Convolutional Neural
Networks for Object Recognition
Classifying and segmenting microscopy images
with deep multiple instance learning
The last one is implemented by #dancsalo (not sure if he has a stack overflow account) here.
I looks like the second paper deals with very large images and breaks them into sub images, but labels the entire image. So, it is like labeling a bag of images with a label instead of having to make a label for each sub-image. In your case, you might be able to construct a matrix of images, i.e. a 10 image x 10 image master image for each of the scans...
Let us know if you do this and if it works well on your data set!
Is it possible to have bounding boxes prediction using TensorFlow?
I found TensorBox on github but I'm looking for a better supported or maybe official way to address this problem.
I need to retrain the model for my own classes.
It is unclear what exactly do you mean. Do you need object detection? I assume it from the 'bounding boxes'. If so, inception networks are not directly applicable for your task, they are classification networks.
You should look for object detection models, like Single Shot Detector (SSD) or You Only Look Once (YOLO). They often use pre-trained convolutional layers from classification networks, but have additional layers on the top of it. If you want Inception (aka GoogLeNet), YOLO is based on that. Take a look at this implementation: https://github.com/thtrieu/darkflow or any other you can find in Google.
The COCO2016 winner for object detection was implemented in tensorflow. Some state of the art techniques are Faster R-CNN, R-FCN and SSD. Check the slides from http://image-net.org/challenges/talks/2016/GRMI-COCO-slidedeck.pdf (Slide 14 has key tensorflow ops for you to recreate this pipeline).
Edit 6/19/2017:
Tensorflow released some techniques to predict bboxes:
https://research.googleblog.com/2017/06/supercharge-your-computer-vision-models.html
I am trying to make an end to end unified model that detects(localizes) the object in an image. The object itself can be of many types, like "text in the wild", but the surrounding features of the object should determine where the region of interest is.
Like detecting a human face, without considering the features of the face itself. i.e its some rage distance about the neck.
I'm expecting the output to be coordinates of the object, or like the image-net format to generate bounding boxes like : [xmin , ymin , xmax, ymax]
I have a data-set of 500 images. Are there any examples of object detection in tensorflow based on surrounding features. i.e the feature maps from conv1 or conv2. ?
There is Tensorflow based framework for object detection/localization that you can check out:
https://github.com/Russell91/TensorBox
Though, I am not sure that 500 images would be enough to successfully retrain provided model(s).
Object detection using deep learning is broadly classified in to one-stage detectors (Yolo,SSD) and two stage detectors like Faster RCNN. Google's repo[1] contains pre-trained models for various detection architectures.
You could pick up a pre-trained model and then train it on your dataset. The two-stage model is modular and you have a choice of different feature extractors depending on whether speed/accuracy is crucial for you.
[1] Google's object detection repository