How do 1x1 convolutions preserve learned features?

How do 1x1 convolutions preserve learned features? - python

Below, I use channels and feature maps interchangeably.
I'm trying to better understand how 1x1 convolution works with multiple input channels and have yet to find a good explanation of this. Before getting into 1x1, I'd like to ensure my understanding of 2D vs 3D convolution. Let's look at a simplistic example of 2D convolution in Keras API:
i = Input(shape=(64,64,3))
x = Conv2D(filters=32,kernel_size=(3,3),padding='same',activation='relu') (i)
In the above example, the input image has 3 channels and the convolutional layer will produce 32 feature maps. Will the 2D convolutional layer apply 3 different kernels to each of the 3 input channels to generate each feature map? If so, this means the number of kernels used in each 2D convolutional operation = #input channels * #feature maps. In this case, 96 different kernels would be used to produce 32 feature maps.
Now let's look at 3D convolution:
i = Input(shape=(1,64,64,3))
x = Conv3D(filters=32,kernel_size=(3,3,3),padding='same',activation='relu') (i)
In the above example, based on my current understanding, each kernel is convolved with all input channels simultaneously. Therefore, the # of kernels used in each 3D convolution operation = #input channels. In this case, 32 different kernels would be used to produce 32 feature maps.
I understand the purpose of downsampling channels before computations with bigger kernels (3x3, 5x5, 7x7). I'm asking because I'm confused as to how 1x1 convolutions preserve learned features. Let's look at a 1x1 convolution:
i = Input(shape=(64,64,3))
x = Conv2D(filters=32,kernel_size=(3,3),padding='same',activation='relu') (i)
x = Conv2D(filters=8,kernel_size=(1,1),padding='same',activation='relu') (x)
If my above understanding of 2D convolutions is correct, then the 1x1 convolutional layer will use 32 different kernels to generate each feature map. This operation would use a total of 256 kernels (32*8) to generate 8 feature maps. Each feature map computation essentially combines 32 pixels into one. How does this one pixel somehow retain all of the features from the previous 32 pixels?

A 1x1 convolution is a 2D convolution just with a "kernel size" of 1. Since there is no sense of spatial neighborhoods, like in a 3x3 kernel, how they are able to learn spatial features depends on the architecture.
By the way, the difference in a 2D convolution and a 3D convolution is in the movement of the convolution. A 2D convolution correlates the filter along "x and y" and is learning (kernel x kernel x input_channel) parameters per output channel. A 3D convolution correlates along "x, y, and z" and is learning (kernel x kernel x kernel x input_channel) parameters per output channel. You could do a 3D convolution on an image with channels, but it doesn't really make sense because we already know the "depth" is correlated. 3D convolutions are generally used with geometric volumes, e.g. data from a CT scan.
Maybe this link would be helpful
https://medium.com/analytics-vidhya/talented-mr-1x1-comprehensive-look-at-1x1-convolution-in-deep-learning-f6b355825578

Related

How to interpret this CNN architecture

How does this CNN architecture work from an input layer to the first convolution layer? hx98 are input matrix dimensions, is n the number of channels or the number of inputs?
It doesn't seem like n is the number of channels because 25 is the number of feature maps and their dimensions do not indicate they are two channels.
However if n is the number of inputs and matrices are single channel, I haven't found a single CNN architecture anywhere that takes multiple input matrices and convolute them together. Most example convolute them seperately and then concatenate.
In my example, n is 2, one is matrix with BER values and another with connection line-rate values.
What mistake am I making? How does this CNN work.

In CNN the image pixels with height and width are multiplied with the
kernel weights of the convolution layer and are added to create a
feature map.
The kernel will pass through all the channels of the
image (3 channels for RGB, 1 channel for GreyScale) based on the
strides defined in the convolution layer.
After the convolution, the size of the image is reduced.
To get the same output dimension as the input dimension, you need to add padding. Padding consists of adding
the right number of rows and columns on each side of the matrix. For
details, please refer to this
documentation.
Thank You.

Tensorflow CNN for different input size

I'm trying to make conv network for image regression.
As shown in below, one image [224 x 224] has one GT value {x}.
It's easy to make train [224 x 224] and valid/test with [224 x 224] images.
However, I'd like to apply CNN for different image sizes.
For example, [224 x 229] image, I want to get 5 regression values 'at once'.
Simply, I can do that by just sliding windows of [224 x 224] x 5 times, but apparently it is too slow.
I think using conv for different image size is possible. But FCL is not.
If I change image size to [455 x 256]
lhs shape= [4608,1024] rhs shape= [2048,1024]
error occurred. Is there any way to handle it?

Fully connected layers have a fixed size input. Thus, changing the input size will cause a wrong-size error.
One way to tackle this problem, and allow for different image sizes is to use a fully convolutional network.
An example with easy numbers:
Assuming for example the conv layer's output is of size 16X16, you can create a "classifier layer" of size 4x4 with stride 4, that would output for each of the 4 4x4 squares comprising the 16x16 feature map, a single value per dimension. Such filter would be of size 4x4xn_dim, in your case n_dim will be 5, and the final output would be of size 4x4x5, corresponding to 5 outputs (one for each regression value) for each 4x4 square.
You will notice you can play with the shape of the last conv filter to obtain different sizes for the final output, corresponding to different parts of the input image, but really, looking at all of it.
You can work out the numbers for your own example.
You probably would like to read about basic methods for semantic segmentaion.
Also see basic fully conv nets.

Kernel size change in convolutional neural networks

I have been working on creating a convolutional neural network from scratch, and am a little confused on how to treat kernel size for hidden convolutional layers. For example, say I have an MNIST image as input (28 x 28) and put it through the following layers.
Convolutional layer with kernel_size = (5,5) with 32 output channels
new dimension of throughput = (32, 28, 28)
Max Pooling layer with pool_size (2,2) and step (2,2)
new dimension of throughput = (32, 14, 14)
If I now want to create a second convolutional layer with kernel size = (5x5) and 64 output channels, how do I proceed? Does this mean that I only need two new filters (2 x 32 existing channels) or does the kernel size change to be (32 x 5 x 5) since there are already 32 input channels?
Since the initial input was a 2D image, I do not know how to conduct convolution for the hidden layer since the input is now 3 dimensional (32 x 14 x 14).

you need 64 kernel, each with the size of (32,5,5) .
depth(#channels) of kernels, 32 in this case, or 3 for a RGB image, 1 for gray scale etc, should always match the input depth, but values are all the same.
e.g. if you have a 3x3 kernel like this : [-1 0 1; -2 0 2; -1 0 1] and now you want to convolve it with an input with N as depth or say channel, you just copy this 3x3 kernel N times in 3rd dimension, the following math is just like the 1 channel case, you sum all values in all N channels which your kernel window is currently on them after multiplying the kernel values with them and get the value of just 1 entry or pixel. so what you get as output in the end is a matrix with 1 channel:) how much depth you want your matrix for next layer to have? that's the number of kernels you should apply. hence in your case it would be a kernel with this size (64 x 32 x 5 x 5) which is actually 64 kernels with 32 channels for each and same 5x5 values in all cahnnels.
("I am not a very confident english speaker hope you get what I said, it would be nice if someone edit this :)")

You essentially answered your own question. YOU are building the network solver. It seems like your convolutional layer output is [channels out] = [channels in] * [number of kernels]. I had to infer this from the wording of your question. In general, this is how it works: you specify the kernel size of the layer and how many kernels to use. Since you have one input channel you are essentially saying that there are 32 kernels in your first convolution layer. That is 32 unique 5x5 kernels. Each of these kernels will be applied to the one input channel. More in general, each of the layer kernels (32 in your example) is applied to each of the input channels. And that is the key. If you build code to implement the convolution layer according to these generalities, then your subsequent convolution layers are done. In the next layer you specify two kernels per channel. In your example there would be 32 input channels, the hidden layer has 2 kernels per channel, and the output would be 64 channels.
You could then down sample by applying a pooling layer, then flatten the 64 channels [turn a matrix into a vector by stacking the columns or rows], and pass it as a column vector into a fully connected network. That is the basic scheme of convolutional networks.
The work comes when you try to code up backpropagation through the convolutional layers. But the OP didn’t ask about that. I’ll just say this, you will come to a place where you have the stored input matrix (one channel), you have a gradient from a lower layer in the form of a matrix and is the size of the layer kernel, and you need to backpropagate it up to the next convolutional layer.
The simple approach is to rotate your stored channel matrix by 180 degrees and then convolve it with the gradient. The explanation for this is long and tedious, too much to write here, and not a lot on the internet explains it well.
A more sophisticated approach is to apply “correlation” between the input gradient and the stored channel matrix. Note I specifically said “correlation” as opposed to “convolution” and that is key. If you think they “almost” the same thing, then I recommend you take some time and learn about the differences.
If you would like to have a look at my CNN solver here's a link to the project. It's C++ and no documentation, sorry :) It's all in a header file called layer.h, find the class FilterLayer2D. I think the code is pretty readable (what programmer doesn't think his code is readable :) )
https://github.com/sraber/simplenet.git
I also wrote a paper on basic fully connected networks. I wrote it so that I would forget what I learned in my self study. Maybe you'll get something out of it. It's at this link:
http://www.raberfamily.com/scottblog/scottblog.htm

Keras Conv2D: filters vs kernel_size

What's the difference between those two? It would also help to explain in the more general context of convolutional networks.
Also, as a side note, what is channels? In other words, please break down the 3 terms for me: channels vs filters vs kernel.

Each convolution layer consists of several convolution channels (aka. depth or filters). In practice, they are a number such as 64, 128, 256, 512 etc. This is equal to number of channels in the output of a convolutional layer. kernel_size, on the other hand, is the size of these convolution filters. In practice, they take values such as 3x3 or 1x1 or 5x5. To abbreviate, they can be written as 1 or 3 or 5 as they are mostly square in practice.
Edit
Following quote should make it more clear.
Discussion on vlfeat
Suppose X is an input with size W x H x D x N (where N is the size of the batch) to a convolutional layer containing filter F (with size FW x FH x FD x K) in a network.
The number of feature channels D is the third dimension of the input X here (for example, this is typically 3 at the first input to the network if the input consists of colour images).
The number of filters K is the fourth dimension of F.
The two concepts are closely linked because if the number of filters in a layer is K, it produces an output with K feature channels. So the input to the next layer will have K feature channels.
The FW x FH above is filter size you are looking for.
Added
You should be familiar with filters. You can consider each filter to be responsible for extracting some type of feature from a raw image. The CNNs try to learn such filters i.e. the filters parametrized in CNNs are learned during training of CNNs. You apply each filter in a Conv2D to each input channel and combine these to get output channels. So, the number of filters and the number of output channels are the same.

tensorflow - understanding tensor shapes for convolution

Currently trying to work my way through the Tensorflow MNIST tutorial for convolutional networks and I could use some help with understanding the dimensions of the darn tensors.
So we have images of 28x28 pixels in size.
The convolution will compute 32 features for each 5x5 patch.
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels.
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
If you say so ...
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
Alright, now I'm getting lost.
Judging by this last reshape, we have
"howevermany" 28x28x1 "blocks" of pixels that are our images.
I guess this makes sense because the images are in greyscale
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
The x32 makes sense, I guess, if we want to infer 32 features per patch
The rest, though, I'm not terribly convinced by.
Why does the weight tensor look the way it apparently does?
(For completeness: we use them
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
where
def conv2d(x,W):
'''
2D convolution, expects 4D input x and filter matrix W
'''
return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME')
def max_pool_2x2(x):
'''
max-pooling, using 2x2 patches
'''
return tf.nn.max_pool(x,ksize=[1,2,2,1], strides=[1,2,2,1],padding='SAME')
)

Your input tensor has the shape [-1,28,28,1]. Like you mention, the last dimension is 1 because the images are in greyscale. The first index is the batchsize. The convolution will process every image in the batch independently, therefore the batchsize has no influence on the convolution-weight-tensor dimensions, or, in fact, no influence on any weight-tensor dimensions in the network. That is why the batchsize can be arbitrary (-1 signifies arbitrary size in tensorflow).
Now to the weight tensor; you don't have five of 5x1x32-blocks, you rather have 32 of 5x5x1-blocks. Each represents one feature. The 1 is the depth of the patch and is 1 due to the gray scale (it would be 5x5x3x32 for color images). The 5x5 is the size of the patch.
The ordering of dimensions in the data tensors is different from the ordering of dimensions in the convolution weight tensors.

Beside the other answer, I would like to add some more points,
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
There is no specific reason why we choose 5x5 patches or 32 features, all of this parameters are experienced (except in some cases), you may use 3x3 patches or larger feature size.
I said 'except in some cases', because may we use 3x3 patches to catch information from images in more details, or larger feature size to learn each image in more details ('larger' and 'more details' are relative terms in this case).
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
Not exactly, but the weight tensor is not a collection it is only a filter with size 5x5 and input channel 1 and output feature (channel) 32
Why does the weight tensor look the way it apparently does?
The weight tensor weight_variable([5, 5, 1, 32]) tells I have 5x5 patch size to apply on an image, I have 1 input feature (since images are in grayscale) and 32 output feature (channel).
More Details:
So this line tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes input x as [-1,28,28,1], -1 means you can put in this dimension any size you want (batch size), 28,28 shows input size, and it must be exactly 28x82, and the last 1 shows the number of input channel, since the mnist images are grayscale so it is 1, in more details it says input image is a 28x28 2D matrix and each cell of matrix shows a value which indicates the grayscale intensity. If input images were RGB so we should have 3 channel instead 1, and this 3 channel says input image is a 28x28x3 3D matrix, the cells in the first dimension of 3 shows the intensity of Red color, the second dimension of 3 shows the intensity of Green color and the other shows Blue color.
Now tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes x and apply W ( which is a 3x3 patches and apply whis patch on 28x28 image with step size 1 (since stride is 1) and give the result image again in size 28x28 because we use padding='SAME'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.