I am trying to detect plants in the photos, i've already labeled photos with plants (with labelImg), but i don't understand how to train model with only background photos, so that when there is no plant here model can tell me so.
Do I need to set labeled box as the size of image?
p.s. new to ml so don't be rude, please)
I recently had a problem where all my training images were zoomed in on the object. This meant that the training images all had very little background information. Since object detection models use space outside bounding boxes as negative examples of these objects, this meant that the model had no background knowledge. So the model knew what objects were, but didn't know what they were not.
So I disagree with #Rika, since sometimes background images are useful. With my example, it worked to introduce background images.
As I already said, object detection models use non-labeled space in an image as negative examples of a certain object. So you have to save annotation files without bounding boxes for background images. In the software you use here (labelImg), you can use verify image to say that it saves the annotation file of the image without boxes. So it saves a file that says it should be included in training, but has no bounding box information. The model uses this as negative examples.
In your case, you don't need to do anything in that regard. Just grab the detection data that you created and train your network with it. When it comes to testing, you usually set a threshold for bounding boxes accuracy, because you may get lots of them so you only want the ones with the highest confidence.
Then you get/show the ones with highest bbox accuracies and there your go, you get your detection result and you can do what ever you want like cropping them using the bounding box coordinates you get.
If there are no plants, your network will likely create bboxes with an accuracy below your threshold (very low confidence) and then, you just ignore them.
Related
I made a tif image based on a 3d model of a woodsheet. (x, y, z) represents a point in a 3d space. I simply map (x, y) to a pixel position in the image and (z) to the greyscale value of that pixel. It worked as I have imagined. Then I ran into a low-resolution problem when I tried to print it. The tif image would get pixilated badly as soon as it zooms out. My research suggests that I need to increase the resolution of the image. So I tried a few super-resolution algos found from online sources, including this one https://learnopencv.com/super-resolution-in-opencv/
The final image did get a lot bigger in resolution (10+ times larger in either dimension) but the same problem persists - it gets pixilated as soon as it zooms out, just about the same as the original image.
Looks like quality of an image has something to do not only with resolution of it but also something else. When I say quality of image, I mean how clear the wood texture is in the image. And when I enlarge it, how sharp/clear the texture remains in the image. Can anyone shed some light on this? Thank you.
original tif
The algo generated tif is too large to be included here (32M)
Gigapixel enhanced tif
Update - Here is a recently achieved result: with a GAN-based solution
It has restored/invented some of the wood grain details. But the models need to be retrained.
In short, it is possible to do this via deep learning reconstruction like the Super Resolution package you referred to, but you should understand what something like this is trying to do and whether it is fit for purpose.
Generic algorithms like the Super Resolution is trained on variety of images to "guess" at details that is not present in the original image, typically using generative training methods like using the low vs high resolution version of the same image as training data.
Using a contrived example, let's say you are trying to up-res a picture of someone's face (CSI Zoom-and-Enhance style!). From the algorithm's perspective, if a black circle is always present inside a white blob of a certain shape (i.e. a pupil in an eye), then next time it the algorithm sees the same shape it will guess that there should be a black circle and fill in a black pupil. However, this does not mean that there is details in the original photo that suggests a black pupil.
In your case, you are trying to do a very specific type of up-resing, and algorithms trained on generic data will probably not be good for this type of work. It will be trying to "guess" what detail should be entered, but based on a very generic and diverse set of source data.
If this is a long-term project, you should look to train your algorithm on your specific use-case, which will definitely yield much better results. Otherwise, simple algorithms like smoothing will help make your image less "blocky", but it will not be able to "guess" details that aren't present.
I use OD-API to train models. I have two question please regarding the way of processing backgrounds images and images that have same object labeled twice (or more) of different label names, and that when using faster_rcnn_resnet101 and SSD_mobilenet_v2.
1- When an image has no ground truth boxes(background image) do we generate Anchor boxes for them in case of using fRCNN (or default boxes for the SSD) even though we don't have GT boxes? Or the whole image in such a case will be a negative example?
2- When an image has two (or more) GT boxes that have same coordinates, but different label names, does this make issues when matching with Anchor boxes (or default boxes for the SSD)? like only one of the GT boxes will be matched here?
I will be glad for any help, I tried reading papers, tutorials and books but couldn't find answers or maybe I am missing something.
Regarding question 2, Prof. Andrew Ng said at 6:55 of this video about Anchor Boxes in YOLO, that such cases, when we have multiply objects in the same grid cell, these cases can't be handled well. So maybe the same applies to my cases, even though I don't know what happens as a result in my cases.
Also I think these files target_assigner.py and argmax_matcher.py have some clues, but also I can't really confirm.
Thank you in advance
1) Anchor boxes are independent of the ground truth boxes and are generated based on the image shape (and the anchor configuration). The targets are what is generated, based on the GT boxes and generated anchors, to train the bounding box regression head. If there are no ground truth boxes, no targets are generated and the whole image is used as negative samples for the classification head, while not affecting the regression head (it only trains on positive samples).
2) I am not 100% sure on this one, but as far as I can tell, the bounding box regression won't have a problem (if the bounding boxes are identical, the IoU with anchors is identical and the target assigner will just pick one of the two), but classification might. IIRC there are ways to enable multi-label classification (although I have no experience in it), so that may help you out a bit. The best solution, though, would be not to have objects annotated multiple times.
I'm working on a project where I'd like to use mask RCNN to identify objects in a set of images. But, I'm having a hard time understanding how bounding boxes(encoded pixels) are created for the ground truth data. Can anyone point me in the right direction or explain this to me further?
Bounding boxes are typically labeled by hand. Most deep-learning people use a separate application for tagging. I believe this package is popular:
https://github.com/AlexeyAB/Yolo_mark
I developed my own RoR solution for tagging, because it's helpful to distribute the work among several people. The repository is open-source if you want to take a look:
https://github.com/asfarley/imgclass
I think it's a bit misleading to call this 'encoded pixels'. Bounding boxes are a labelled rectangle data-type, which means they are entirely defined by the type (car, bus, truck) and the (x,y) coordinates of the rectangle's corners.
The software for defining bounding-boxes generally consists of an image-display element, plus features to allow the user to drag bounding-boxes on the UI. My application uses a radio-button list to select the object type (car, bus, etc); then the user draws a bounding-box.
The result of completely tagging an image is a text-file, where each row represents a single bounding-box. You should check the library documentation for your training algorithm to understand exactly what format you need to input the bounding boxes.
In my own application, I've developed some features to compare bounding-boxes from different users. In any large ML effort, you will probably encounter some mis-labelled images, so you really need a tool to identify this because it can severely degrade your results.
I'm doing object detection for texts in image and want to use Yolo to draw a bounding box where the text is in the image.
Then, how do you do data augmentation? Also, what is the difference between augmentation (contrast adjustment, gamma conversion, smoothing, noise, inversion, scaling, etc.) in ordinary image recognition?
If you have any useful website links, would you tell me plz :)
If you mean by what should you use then, it just a regular object detection task, the common augment, like flips or crop, works fine.
For the difference, if you mean by what the output images will look like then look at this repo https://github.com/albumentations-team/albumentations
But of you mean by the model performance difference then there's probably no answer for that, you can only try several ways and see what's the best.
I am working on machine learning for image classification and managed to get several projects done successfully. All projects had images which always belongs to one class. Now I want to try images with multiple labels on each image. I read that I have to draw boxes (boundary boxes) around images for training.
My question is
Do I have to crop those areas into single images and use them as before for training?
Drawn boxes are only used to cropping?
Or do we really feed the original images and box coordinates (top left[X, Y], width and height) to training?
Any tutorials to materials related to this are appreciated.
Basically, you need to detect various objects in an image which belong to different classes. Here's where Object Detection comes in the picture.
Object Detection tries to classify labels for various objects in an
image and also predict the bounding boxes.
There are many algorithms for object detection. If you are a seasoned TensorFlow user, you can directly use the TensorFlow Object Detection API. You can select the architecture you need and feed the annotations along with the images.
To annotate the images ( drawing bounding boxes around boxes and storing them separately ), you can use LabelImg tool.
You can refer to these blogs:
Creating your own object detector
A Step-by-Step Introduction to the Basic Object Detection Algorithms
Instead of training a whole new object detector, you can use a pretrained object detector available. The TensorFlow Object Detection model can classify 80 objects. If the objects you need to classify are included in these objects, then you get a ready-to-build model. The model draws a bounding box around the object of your interest.
You can crop this part of the image and build a classifier on it, according to your needs.