Object Detection API - Processing background images and objects labeled multiply times - python

I use OD-API to train models. I have two question please regarding the way of processing backgrounds images and images that have same object labeled twice (or more) of different label names, and that when using faster_rcnn_resnet101 and SSD_mobilenet_v2.
1- When an image has no ground truth boxes(background image) do we generate Anchor boxes for them in case of using fRCNN (or default boxes for the SSD) even though we don't have GT boxes? Or the whole image in such a case will be a negative example?
2- When an image has two (or more) GT boxes that have same coordinates, but different label names, does this make issues when matching with Anchor boxes (or default boxes for the SSD)? like only one of the GT boxes will be matched here?
I will be glad for any help, I tried reading papers, tutorials and books but couldn't find answers or maybe I am missing something.
Regarding question 2, Prof. Andrew Ng said at 6:55 of this video about Anchor Boxes in YOLO, that such cases, when we have multiply objects in the same grid cell, these cases can't be handled well. So maybe the same applies to my cases, even though I don't know what happens as a result in my cases.
Also I think these files target_assigner.py and argmax_matcher.py have some clues, but also I can't really confirm.
Thank you in advance

1) Anchor boxes are independent of the ground truth boxes and are generated based on the image shape (and the anchor configuration). The targets are what is generated, based on the GT boxes and generated anchors, to train the bounding box regression head. If there are no ground truth boxes, no targets are generated and the whole image is used as negative samples for the classification head, while not affecting the regression head (it only trains on positive samples).
2) I am not 100% sure on this one, but as far as I can tell, the bounding box regression won't have a problem (if the bounding boxes are identical, the IoU with anchors is identical and the target assigner will just pick one of the two), but classification might. IIRC there are ways to enable multi-label classification (although I have no experience in it), so that may help you out a bit. The best solution, though, would be not to have objects annotated multiple times.

Related

Do anchor box size gets refined during training object detection models like Faster R CNN,YOLO and SSD?

I was learning, working of object detection models Faster R CNN,YOLOv3 and SSD.I got confusion with anchor box size refining.
Of course anchor boxes size (and position) get refined during training. As you said in a comment anchor boxes are nothing more than a set of reference boxes present at fixed positions on the output grid. Each grid cell additionally predicts the object-ness score as well as the label and the exact coordinates of the bounding box.
These last coordinates correspond to the box size refining you are talking about. The implementation of such regression differs upon networks (SSD, Yolo, Faster-RCNN, ...).
I encourage you to read the literature, and especially the Yolo papers that are very clear. In "YOLO9000: Better, Faster, Stronger" (available for free online), bounding box refining is explained in great detail page 3.
Of course all of this is learnt during training, take a look at the loss function of Yolo in "You Only Look Once: Unified, Real-Time Object Detection" paper page 4.
Of course, the anchor box is refined during training. It's the only way how the network could learn to predict accurate boxes and correct any localization errors made. The network learns offsets to refine the anchor box in shape and size.
You can read more about how anchor boxes work here

Coordinates of framed text on an image

I would like to get the coordinates of framed text on an image. The paragraphs have thin black borders. The rest of the image contains usual paragraphs and sketchs.
Here is an example:
Do you have any idea of what kind of algorithms should I use in Python with an image library to achieve this ? Thanks.
A few ideas to detect a framed text which largely comes down to searching boxes/rectangles of substantial size:
find contours with OpenCV, analyze shapes using cv2.approxPolyDP() polygon approximation algorithm (also known as Ramer–Douglas–Peucker algorithm). You could additionally check the aspect ratio of the bounding box to make sure the shape is a rectangle as well as check the page width as this seems to be a known metric in your case. PyImageSearch did this amazing article:
OpenCV shape detection
in a related question, there is also a suggestion to look into Hough Lines to detect a horizontal line, taking a turn a detecting vertical lines the same way. Not 100% sure how reliable this approach would be.
Once you find the box frames, the next step would be to check if there is any text inside them. Detecting text is a broader problem in general and there are many ways of doing it, here are a few examples:
apply EAST text detector
PixelLink
tesseract (e.g. via pytesseract) but not sure if this would not have too many false positives
if it is a simpler case of boxes being empty or not, you could check for average pixel values inside - e.g. with cv2.countNonZero(). Examples:
How to identify empty rectangle using OpenCV
Count the black pixels using OpenCV
Additional references:
ideas on quadrangle/rectangle detection using convolutional neural networks

object detection: is object in the photo, python

I am trying to detect plants in the photos, i've already labeled photos with plants (with labelImg), but i don't understand how to train model with only background photos, so that when there is no plant here model can tell me so.
Do I need to set labeled box as the size of image?
p.s. new to ml so don't be rude, please)
I recently had a problem where all my training images were zoomed in on the object. This meant that the training images all had very little background information. Since object detection models use space outside bounding boxes as negative examples of these objects, this meant that the model had no background knowledge. So the model knew what objects were, but didn't know what they were not.
So I disagree with #Rika, since sometimes background images are useful. With my example, it worked to introduce background images.
As I already said, object detection models use non-labeled space in an image as negative examples of a certain object. So you have to save annotation files without bounding boxes for background images. In the software you use here (labelImg), you can use verify image to say that it saves the annotation file of the image without boxes. So it saves a file that says it should be included in training, but has no bounding box information. The model uses this as negative examples.
In your case, you don't need to do anything in that regard. Just grab the detection data that you created and train your network with it. When it comes to testing, you usually set a threshold for bounding boxes accuracy, because you may get lots of them so you only want the ones with the highest confidence.
Then you get/show the ones with highest bbox accuracies and there your go, you get your detection result and you can do what ever you want like cropping them using the bounding box coordinates you get.
If there are no plants, your network will likely create bboxes with an accuracy below your threshold (very low confidence) and then, you just ignore them.

How are ground truth bounding boxes created for a deep learning training dataset?

I'm working on a project where I'd like to use mask RCNN to identify objects in a set of images. But, I'm having a hard time understanding how bounding boxes(encoded pixels) are created for the ground truth data. Can anyone point me in the right direction or explain this to me further?
Bounding boxes are typically labeled by hand. Most deep-learning people use a separate application for tagging. I believe this package is popular:
https://github.com/AlexeyAB/Yolo_mark
I developed my own RoR solution for tagging, because it's helpful to distribute the work among several people. The repository is open-source if you want to take a look:
https://github.com/asfarley/imgclass
I think it's a bit misleading to call this 'encoded pixels'. Bounding boxes are a labelled rectangle data-type, which means they are entirely defined by the type (car, bus, truck) and the (x,y) coordinates of the rectangle's corners.
The software for defining bounding-boxes generally consists of an image-display element, plus features to allow the user to drag bounding-boxes on the UI. My application uses a radio-button list to select the object type (car, bus, etc); then the user draws a bounding-box.
The result of completely tagging an image is a text-file, where each row represents a single bounding-box. You should check the library documentation for your training algorithm to understand exactly what format you need to input the bounding boxes.
In my own application, I've developed some features to compare bounding-boxes from different users. In any large ML effort, you will probably encounter some mis-labelled images, so you really need a tool to identify this because it can severely degrade your results.

Ideal images for Viola-Jones Haar cascade in OpenCV

I am trying to detect specific objects in images using Haar cascade in OpenCV.
Let's say I am interested in detecting stop signs in landscape images. When defining positive image samples for my training set, which would be the best kind of image: (a) full images with my object, (b) a medium crop or (c) a tight crop?
Similarly, what's best for negative images? Does this influence overfitting? I would also appreciate any other general tips from those with experience. Thanks.
Image ref: http://kaitou-ace.deviantart.com/art/Stop-sign-on-a-country-road-Michigan-271990933
You only want features that you want to detect in your positive samples. So the C image would be correct for positive samples.
As for negative samples you want EVERYTHING else. Although that is obviously unrealistic if you are using your detector in a specific environment then training to detect that as negative is the right way to go. I.e. lots of pictures of landscapes etc (ones that don't have stop signs in)
The best choice is (c) because (a) and (b) contain too many features, all around the border of the sign, that are not interesting for you.
Not only they are not useful but they can seriously compromise the performance of the algorithm.
In case (c) its aim is to recognize situations where in the current window there are the features you are looking for.
But what about (b) and (c)?
In those cases the algorithm has to detect interesting features just in a corner of the window (and unfortunately that corner could be everywhere) and at the same time to be consistent with all the infinite possibilities that could occur around that corner.
You would need a huge amount of samples and anyway, even if you finally manage to get an acceptable hit rate, the job of separating positives and negatives is so difficult that the running time would be very high.
As to negatives collection, ideally you should pick up images that reproduce what you think are the images against which your final detector will run.
For example if you think that indoor images are not interesting for this, just discard them. If you think that a certain kind of landscapes are the ones where you detector will run, just retain much of them.
But this is only theoretical, I feel that the improvement would be negligeble. Just collect as many images as you can, The number of different images, that really matters.

Categories