I'm practicing with computer vision in general and specifically with the TensorFlow object detection API, and there are a few things I don't really understand yet.
I'm trying to re-train an SSD model to detect one class of custom objects (guitars).
I've been using ssd_mobilenet_v1_coco and ssd_mobilenet_v2_coco models, with a dataset of 1000K pre-labeled images downloaded from the OpenImage dataset. I used the standard configuration file, only modifying the necessary parts.
I'm getting slightly unsatisfactory detections on small objects, which is supposed to be normal when using SSD models. Here on stackoverflow I saw people suggesting to crop the image in smaller frames, but I'm having trouble understanding a few things:
According to the .config file and the SSD papers, images are resized to the fixed dimension of 300x300 pixels (I'm assuming it holds both when training the model and when using it for inference). So, I guess it means that the original size of training and test/evaluation images doesn't matter, because they're always resized to 300x300 anyway? Then, I don't understand why many people suggest using images of the same size of the ones the models has been trained on...does it matter or not?
It's not really clear to me what "small objects" means in the first place.
Is it referred to the size ratio between the object and the whole image? So, a small object is one that covers...say, less than 5% of the total image?
Or is it referred to the number of pixels forming the object?
In the first case, cropping the image around the object would make sense. In the second case, it shouldn't work, because the number of useful pixels identifying the object stays the same.
Thanks!
I am not sure for the answer I am giving below but it worked for me, as you correctly said that images are resized to 300x 300 in the config file of ssd_mobilenet-v2, what this resizing does is compress image to 300 x 300 thus loosing the important features. This adversely effect the object that are small in size as they have much to loose. Depending on the GPU power you have you can make some changes in the config file:
1st- change the following line as
image_resizer {
fixed_shape_resizer {
height: 600
width: 600
}
}
thus now giving double the data(in the config file).
2nd - What the above change will do is throw your GPU out of memory, so you need to reduce the batch size from 24 to 12 or 8, which can lead to over fitting so do check the regularization parameters too.
3rd- optional method is to comment out the following
enter image description here
this helps a lot and reduce the time to train by almost half. the trade off is if the image is not aligned as your train data, the confidence level of the model will drop and it may completely not recognize inverted cat.
I do not see why one would get better results in keeping image size the SSD model was trained on. SSD detectors are fully convolutionnal and covolutions are not concerned with image sizes.
'Small objects' refers to the number of pixels containing information about the object. Here is how it makes sense to crop images to improve performance on small objects : Tensorflow object detection API performs data augmentations before resizing images, (check inputs.transform_input_data doc strings) so cropping then resizing the cropped image will preserve more information than resizing the full image because the donwsizing factor is smaller for the cropped image than for the full image.
Related
I made a tif image based on a 3d model of a woodsheet. (x, y, z) represents a point in a 3d space. I simply map (x, y) to a pixel position in the image and (z) to the greyscale value of that pixel. It worked as I have imagined. Then I ran into a low-resolution problem when I tried to print it. The tif image would get pixilated badly as soon as it zooms out. My research suggests that I need to increase the resolution of the image. So I tried a few super-resolution algos found from online sources, including this one https://learnopencv.com/super-resolution-in-opencv/
The final image did get a lot bigger in resolution (10+ times larger in either dimension) but the same problem persists - it gets pixilated as soon as it zooms out, just about the same as the original image.
Looks like quality of an image has something to do not only with resolution of it but also something else. When I say quality of image, I mean how clear the wood texture is in the image. And when I enlarge it, how sharp/clear the texture remains in the image. Can anyone shed some light on this? Thank you.
original tif
The algo generated tif is too large to be included here (32M)
Gigapixel enhanced tif
Update - Here is a recently achieved result: with a GAN-based solution
It has restored/invented some of the wood grain details. But the models need to be retrained.
In short, it is possible to do this via deep learning reconstruction like the Super Resolution package you referred to, but you should understand what something like this is trying to do and whether it is fit for purpose.
Generic algorithms like the Super Resolution is trained on variety of images to "guess" at details that is not present in the original image, typically using generative training methods like using the low vs high resolution version of the same image as training data.
Using a contrived example, let's say you are trying to up-res a picture of someone's face (CSI Zoom-and-Enhance style!). From the algorithm's perspective, if a black circle is always present inside a white blob of a certain shape (i.e. a pupil in an eye), then next time it the algorithm sees the same shape it will guess that there should be a black circle and fill in a black pupil. However, this does not mean that there is details in the original photo that suggests a black pupil.
In your case, you are trying to do a very specific type of up-resing, and algorithms trained on generic data will probably not be good for this type of work. It will be trying to "guess" what detail should be entered, but based on a very generic and diverse set of source data.
If this is a long-term project, you should look to train your algorithm on your specific use-case, which will definitely yield much better results. Otherwise, simple algorithms like smoothing will help make your image less "blocky", but it will not be able to "guess" details that aren't present.
Lets assume i have a little dataset. I want to implement data augmentation. First i implement image segmentation (after this, image will be binary image) and then implement data augmentation. Is this a good way?
For image augmentation in segmentation and instance segmentation, you have to either no change the positions of the objects contained in the image by manipulating colors for example, or modify these positions by applying translations and rotation.
So, yes this way works, but you have to take into consideration the type of data you have and what you are looking to achieve. Data augmentation isn't a ready to-go process with good results everywhere.
In case you have a:
Semantic segmentation : Each pixel of your image has a row i and a column j which are labeled as its enclosing object. This means having your main image I and a label image L with its same size linking every pixel to its object label. In this case, your data augmentation is applied to both I and L, giving a combination of the two transformed images.
Instance segmentation : Here we generate a mask for every instance of the original image and the augmentation is applied to all of them including the original, then from these transformed masks we get our new instances.
EDIT:
Take a look at CLoDSA (Classification, Localization, Detection and Segmentation Augmentor) it may help you implement your idea.
In case your dataset is small, you should add data-augmentation during the training. It is important to change the original image & the targets (masks) in the same way !!.
For example, If an image is rotated 90 degrees, then its mask should also be rotated 90 degrees. Since you are using Keras library, You should check if the ImageDataGenerator also changes the target images (masks), along with the inputs. If it doesn't, You can implement the augmentations by yourself. This repository shows how it is done in OpenCV here:
https://github.com/kochlisGit/random-data-augmentations
I am trying to detect plants in the photos, i've already labeled photos with plants (with labelImg), but i don't understand how to train model with only background photos, so that when there is no plant here model can tell me so.
Do I need to set labeled box as the size of image?
p.s. new to ml so don't be rude, please)
I recently had a problem where all my training images were zoomed in on the object. This meant that the training images all had very little background information. Since object detection models use space outside bounding boxes as negative examples of these objects, this meant that the model had no background knowledge. So the model knew what objects were, but didn't know what they were not.
So I disagree with #Rika, since sometimes background images are useful. With my example, it worked to introduce background images.
As I already said, object detection models use non-labeled space in an image as negative examples of a certain object. So you have to save annotation files without bounding boxes for background images. In the software you use here (labelImg), you can use verify image to say that it saves the annotation file of the image without boxes. So it saves a file that says it should be included in training, but has no bounding box information. The model uses this as negative examples.
In your case, you don't need to do anything in that regard. Just grab the detection data that you created and train your network with it. When it comes to testing, you usually set a threshold for bounding boxes accuracy, because you may get lots of them so you only want the ones with the highest confidence.
Then you get/show the ones with highest bbox accuracies and there your go, you get your detection result and you can do what ever you want like cropping them using the bounding box coordinates you get.
If there are no plants, your network will likely create bboxes with an accuracy below your threshold (very low confidence) and then, you just ignore them.
I am using CV2 to resize various images with different dimensions(i.e. 70*300, 800*500, 60*50) to a specific (200*200) pixels dimension. Later, I am feeding the pictures to CNN algorithm to classify the images. (my understanding that pictures must have the same size when fed into CNN).
My questions:
1- How low picture resolutions are converted into higher one and how higher resolutions are converted into lower one? Will this affect the stored information in the pictures
2- Is it good practice to use this approach with CNN? Or is it better to Pad zeros to the end of the image to get the desired resolution? I have seen many researchers pad the end of a file with zeros when trying to detect Malware files to have a common dimension for all the files. Does this mean that padding is more accurate than resizing?
Using interpolation. https://chadrick-kwag.net/cv2-resize-interpolation-methods/
Definitely, resizing is a lossy process and you'll lose information.
Both are okay and used depending on the needs. Resizing is also equally applicable. If your CNN can't differentiate between the original and resized images it must be a badly overfitted one. Resizing is a very light regularization too, even it's advisable to apply more augmentation schemes on the images before CNN training.
I want to make a CNN or FCN that can take grayscale images as an input and outputs a color image. It is very important to me that the size of the images can vary. I heard that I can only do this when I make a FCN and take a batch with images of one size and another batch with images of another size. But I don't know how to make this concept in Tensorflow Keras (the Python version) and I was wondering if you could provide some sample code or pseudo code? I appreciate that. Thanks!
I know you want to keep them all in their original size, but that's not possible. Don't worry, though, because the resizing can take place while the images are being fed into the model (while in memory); the image will never be touched except to be read.
Here's a great example that I frequently reference!