I'm creating an encoder-decoder CNN for some images. Each image has a geometric shape around the center - circle, ellipse, etc.
I want my CNN to ignore all the values that are in this shape. All my input data values have been normalized to be around 0-1. I set all the shape values to be 0.
I thought that setting them to zero would mean that they will not be updated, however, the output of my encoder-decoder CNN changes the shape.
What can I do to ensure these values stay put and do not update?
Thank you!
I think you are looking for "partial convolution". This is a work published by Guilin Liu and colleagues that extends convolution to take an input mask as well as an input feature map and apply the convolution only to the unmasked pixels. They also suggest how to compensate for pixels on the boundary of the mask, where the kernel "sees" both valid and masked-out pixels.
Please note that their implementation may have issues running with automatic mixed precision (AMP).
Related
I am trying to feed small patches of satellite image data (landsat-8 Surface Reflectance Bands) into neural networks for my project. However the downloaded image values range from 1 to 65535.
So I tried dividing images by 65535(max value) but plotting them shows all black/brown image like this!
But most of the images do not have values near 65535
Without any normalization the image looks all white.
Dividing the image with 30k looks like this.
If the images are too dark or too light my network may not perform as intended.
My question: Is dividing the image with max value possible (65535) is the only solution or are there any other ways to normalize images especially for satellite data.
Please help me with this.
To answer your question, though. There are other ways to normalize images. Standardization is the most common way (subtract the mean and divide by the standard deviation).
Using numpy...
image = (image - np.mean(image)) / np.std(image)
As I mentioned in a clarifying comment, you want the normalization method to match how the NN training set.
i'm building a CNN to identify facial keypoints. i want to make the net more robust, so i thought about applying some zoom-out transforms because most pictures have about the same location of keypoints, so the net doesn't learn much.
my approach:
i want augmented images to keep the original image size so apply MaxPool2d and then random (not equal) padding until the original size is reached.
first question
is it going to work with simple average padding or zero padding? i'm sure it would be even better if i made the padding appear more like a background but is there a simple way to do that?
second question
the keypoints are the target vector, they come as a row vector of 30. i'm getting confused with the logic needed to transform them to the smaller space.
generally if an original point was at (x=5,y=7) it transforms to (x=2,y=3)- i'm not sure about it but so far manually checked and it's correct. but what to do if to keypoints are in the same new pixel? i can't feed the network with less target values.
that's it. would be happy to hear your thoughts
I suggest to use torchvision.transforms.RandomResizedCrop as a part of your Compose statement. which will give you random zooms AND resize the resulting the images to some standard size. This avoids issues in both your questions.
I am using CV2 to resize various images with different dimensions(i.e. 70*300, 800*500, 60*50) to a specific (200*200) pixels dimension. Later, I am feeding the pictures to CNN algorithm to classify the images. (my understanding that pictures must have the same size when fed into CNN).
My questions:
1- How low picture resolutions are converted into higher one and how higher resolutions are converted into lower one? Will this affect the stored information in the pictures
2- Is it good practice to use this approach with CNN? Or is it better to Pad zeros to the end of the image to get the desired resolution? I have seen many researchers pad the end of a file with zeros when trying to detect Malware files to have a common dimension for all the files. Does this mean that padding is more accurate than resizing?
Using interpolation. https://chadrick-kwag.net/cv2-resize-interpolation-methods/
Definitely, resizing is a lossy process and you'll lose information.
Both are okay and used depending on the needs. Resizing is also equally applicable. If your CNN can't differentiate between the original and resized images it must be a badly overfitted one. Resizing is a very light regularization too, even it's advisable to apply more augmentation schemes on the images before CNN training.
Keras' Conv3D expects input 5D tensor with shape: (batch, conv_dim1, conv_dim2, conv_dim3, channels) (assuming data_format is "channels_last"). Now say my filter size is (3,3,3) and my input is (10,125,300,200,3), a video dataset of 10 videos, each with 125 frames, and spatial size 300x200,and channel 3 due to frames being RGB. The default stride value is (1, 1, 1). The picture in my head of how this convolution works is as shown here at 9:28.
What I can't figure is whether the stride of 1 along the temporal dimension moves 1 frame at a time or 1 channel of a frame at a time. I tried to look up the code of conv3D here and couldn't gather much. I tried training a deep learning network using 3D CNNS and RGB videos and the resulting images have messed up colours (almost grey), so I'm assuming there's some mess up with colour channels. I checked my input, that seems fine, so the network is probably funny.
Tl:dr
Need to figure if RGB videos need conscious changes in strides so the channels of one frame are treated with 2D convolution, and also would be grateful for pointers to code/papers dealing with RGB videos and 3D CNNs
In all convolutions, the filter size encompasses all channels together. Channels do not participate in strides.
So, strides happen as if your video was a cube. Stride 1 step in each dimension (x,y,z) until the entire cube is swept. (The convolution has no idea of what the dimensions are, and will not treat the frames differently from how they treat pixels).
You have a little 3x3x3 cube sweeping a huge 125x300,200 parallelepiped, pixel by pixel, frame by frame. So: the stride moves one frame at a time, but considering only a 3x3 segment of the image.
This doesn't "seem" good for videos (but machine learning has its surprises), unless at some point you have a very tiny resolution, so that a filter starts seeing the whole picture in each frame.
You can keep testing the 3D convs to see what happens, but a few suggestions that "seem" better are:
Use TimeDistributed(Conv2D(...)) and TimeDistributed(MaxPooling2D(...)) until you get a small resolution video in the middle of the model (or even a 1x1, if you're going extreme). Then start using:
Conv3D if there are still spatial dimensions
Conv1D if you eliminated the spatial dimensions
In both cases, it's a good idea to increase the kernel size in the frames dimension, 3 frames may be too little to interpret what is happening (unless you have a low frame rate)
Use TimeDistributed(Conv2D(...)), eliminate the spatial dimensions at some point and start using RNNs like LSTM
Use ConvLSTM2D layers.
I was reading this research paper fully convolutional network for semantic segmentation and following is quote from this paper
The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.
I didn't understand bold part, but after some researching on internet I have come to conclusion that if I remove last layer(that is fully connected layer) and then convolve last layer(which was second last before removing fully connected layer) with three 1x1 kernels, I will be doing the same thing as the bold part says. Am I correct here?
Why three 1x1 kernels?
Because in paper they are creating a heatmap in rgb from original input and rgb means three channels, but result of convolution network(without fully connected layer) is having many channels(high dimensional) and therefore convolution with three 1x1 kernels to make it an rbg image. Image from paper
Suppose you have a 200X200 matrix in the second last layer.
Then if you are going to fully connected layer you will be converting the 200X200 matrix into a single one dimensional array.
That means an array of size 40000. That is what is meant by throwing away spatial coordinates. If you are applying a 1x1 kernel the same thing will be happening. You will get a similar one with no change in the values of pixel.