Does 3D convolution in Keras work with RGB videos? - python

Keras' Conv3D expects input 5D tensor with shape: (batch, conv_dim1, conv_dim2, conv_dim3, channels) (assuming data_format is "channels_last"). Now say my filter size is (3,3,3) and my input is (10,125,300,200,3), a video dataset of 10 videos, each with 125 frames, and spatial size 300x200,and channel 3 due to frames being RGB. The default stride value is (1, 1, 1). The picture in my head of how this convolution works is as shown here at 9:28.
What I can't figure is whether the stride of 1 along the temporal dimension moves 1 frame at a time or 1 channel of a frame at a time. I tried to look up the code of conv3D here and couldn't gather much. I tried training a deep learning network using 3D CNNS and RGB videos and the resulting images have messed up colours (almost grey), so I'm assuming there's some mess up with colour channels. I checked my input, that seems fine, so the network is probably funny.
Tl:dr
Need to figure if RGB videos need conscious changes in strides so the channels of one frame are treated with 2D convolution, and also would be grateful for pointers to code/papers dealing with RGB videos and 3D CNNs

In all convolutions, the filter size encompasses all channels together. Channels do not participate in strides.
So, strides happen as if your video was a cube. Stride 1 step in each dimension (x,y,z) until the entire cube is swept. (The convolution has no idea of what the dimensions are, and will not treat the frames differently from how they treat pixels).
You have a little 3x3x3 cube sweeping a huge 125x300,200 parallelepiped, pixel by pixel, frame by frame. So: the stride moves one frame at a time, but considering only a 3x3 segment of the image.
This doesn't "seem" good for videos (but machine learning has its surprises), unless at some point you have a very tiny resolution, so that a filter starts seeing the whole picture in each frame.
You can keep testing the 3D convs to see what happens, but a few suggestions that "seem" better are:
Use TimeDistributed(Conv2D(...)) and TimeDistributed(MaxPooling2D(...)) until you get a small resolution video in the middle of the model (or even a 1x1, if you're going extreme). Then start using:
Conv3D if there are still spatial dimensions
Conv1D if you eliminated the spatial dimensions
In both cases, it's a good idea to increase the kernel size in the frames dimension, 3 frames may be too little to interpret what is happening (unless you have a low frame rate)
Use TimeDistributed(Conv2D(...)), eliminate the spatial dimensions at some point and start using RNNs like LSTM
Use ConvLSTM2D layers.

Related

How to have consistent upsampling/downsampling when creating a laplacian pyramid?

I,m trying to make a laplacian pyramid for image reconstruction but I,ve got a problem.
When downsampling I just get the even rows/columns and discard the rest, when upsampling I duplicate each row/column.
But this creates a situation where sizes of images may not be the same, not allowing me to create a LoG by subtraction. If I downsample a odd size img and the upsample it it will be even.
How do I do this correctly so I can reconstruct the original image perfectly?

Using K.tf.nn.max_pool_with_argmax() with an 3D input tensor

I would like to implement the SegNet for 3D images (width, height and depth, where depth is not channels). Thus in the decoder part of the network I need the pooling indices.
The function K.tf.nn.max_pool_with_argmax() only works for 2D images (width and height).
There is a function MaxPooling3D but this only returns the tensor after the pooling operation without the indices.
Does anyone know a solution to this?

How do I properly mask values in my Convolutional Neural Network

I'm creating an encoder-decoder CNN for some images. Each image has a geometric shape around the center - circle, ellipse, etc.
I want my CNN to ignore all the values that are in this shape. All my input data values have been normalized to be around 0-1. I set all the shape values to be 0.
I thought that setting them to zero would mean that they will not be updated, however, the output of my encoder-decoder CNN changes the shape.
What can I do to ensure these values stay put and do not update?
Thank you!
I think you are looking for "partial convolution". This is a work published by Guilin Liu and colleagues that extends convolution to take an input mask as well as an input feature map and apply the convolution only to the unmasked pixels. They also suggest how to compensate for pixels on the boundary of the mask, where the kernel "sees" both valid and masked-out pixels.
Please note that their implementation may have issues running with automatic mixed precision (AMP).

CV2 resizing with CNN

I am using CV2 to resize various images with different dimensions(i.e. 70*300, 800*500, 60*50) to a specific (200*200) pixels dimension. Later, I am feeding the pictures to CNN algorithm to classify the images. (my understanding that pictures must have the same size when fed into CNN).
My questions:
1- How low picture resolutions are converted into higher one and how higher resolutions are converted into lower one? Will this affect the stored information in the pictures
2- Is it good practice to use this approach with CNN? Or is it better to Pad zeros to the end of the image to get the desired resolution? I have seen many researchers pad the end of a file with zeros when trying to detect Malware files to have a common dimension for all the files. Does this mean that padding is more accurate than resizing?
Using interpolation. https://chadrick-kwag.net/cv2-resize-interpolation-methods/
Definitely, resizing is a lossy process and you'll lose information.
Both are okay and used depending on the needs. Resizing is also equally applicable. If your CNN can't differentiate between the original and resized images it must be a badly overfitted one. Resizing is a very light regularization too, even it's advisable to apply more augmentation schemes on the images before CNN training.

creating fully convolution network

I was reading this research paper fully convolutional network for semantic segmentation and following is quote from this paper
The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.
I didn't understand bold part, but after some researching on internet I have come to conclusion that if I remove last layer(that is fully connected layer) and then convolve last layer(which was second last before removing fully connected layer) with three 1x1 kernels, I will be doing the same thing as the bold part says. Am I correct here?
Why three 1x1 kernels?
Because in paper they are creating a heatmap in rgb from original input and rgb means three channels, but result of convolution network(without fully connected layer) is having many channels(high dimensional) and therefore convolution with three 1x1 kernels to make it an rbg image. Image from paper
Suppose you have a 200X200 matrix in the second last layer.
Then if you are going to fully connected layer you will be converting the 200X200 matrix into a single one dimensional array.
That means an array of size 40000. That is what is meant by throwing away spatial coordinates. If you are applying a 1x1 kernel the same thing will be happening. You will get a similar one with no change in the values of pixel.

Categories