I was reading a Machine Learning book and came across this in the CNN chapter.
The weights of a convolutional layer are represented as a 4D tensor of
shape [fh, fw, fn′, fn]. The bias terms of a convolutional layer are
simply represented as a 1D tensor of shape [fn]. Where fh is the heigher of the receptive field and fw is the width of the receptive field. fn' is the number of feature maps in the previous layer and fn is the number of feature maps in the current layer.
I am trying to understand what each number in the given order signifies. Is it creating a rank 4 matrix where each entry represents the weight connecting an output neuron from the previous layer with specified feature map and location in the receptive field to the current ouput neuron?
fn':
It represents number of channels in previous layer which also indirectly specifies depth(or channels) of each kernels in current layer.
fn:
It represents number of feature maps in current layer i.e. number of different kernels in current layer. Because each kernel outputs single channel.
fw:
It represents kernel width.
fh:
It represents kernel height.
Suppose [fh, fw, fn′, fn] = [3, 3, 10, 20] then layer(weight) will be of size 20x10x3x3. Each kernel will be of size 10x3x3 (where 3x3 is spacial and 10 is depth) and there will be 20 such kernels. These kernels operate on previous 20 feature maps to output 10 feature maps.
And each entry in this 4d matrix representing weight will be shared. It doesn't connect neurons one to one because of convolution. Convolution's main advantages itself is parameter sharing and receptive field/local connectivity.
Related
I am trying to train a resnet for 32x32 images, and I came upon a tutorial: https://towardsdatascience.com/resnets-for-cifar-10-e63e900524e0, which applies to cifar-10 (32x32 image dataset), but I don't understand what its saying.
This is from the site:
"The rest of the notes from the authors to construct the ResNet are:
Use a stack of 6 layers of 3x3 convolutions. The choice of will determine the size of our ResNet.
The feature map sizes are {32, 16, 8} respectively with 2 convolutions for each feature map size. Also, the number of filters is {16, 32, 64} respectively.
The down sampling of the volumes through the ResNet is achieved increasing the stride to 2, for the first convolution of each layer. Therefore, no pooling operations are used until right before the dense layer.
For the bypass connections, no projections will be used. In the cases where there is a different in the shape of the volume, the input will be simply padded with zeros, so the output size matched the size of the volume before the addition.
This would leave Figure 4 as the representation of our first layer. In this case, our bypass connection is a regular Identity Shortcut because the dimensionality of the volume is constant thorough the layer operations. Since we chose n=1, 2 convolutions are applied within the layer 1.
We still can check from Figure 2 that the output volume of Layer1 is indeed 32x32x16. Let’s go deeper!"
I am confused why the number of channels in the output is 16, when the number of filters was {16, 32, 64}
How does this CNN architecture work from an input layer to the first convolution layer? hx98 are input matrix dimensions, is n the number of channels or the number of inputs?
It doesn't seem like n is the number of channels because 25 is the number of feature maps and their dimensions do not indicate they are two channels.
However if n is the number of inputs and matrices are single channel, I haven't found a single CNN architecture anywhere that takes multiple input matrices and convolute them together. Most example convolute them seperately and then concatenate.
In my example, n is 2, one is matrix with BER values and another with connection line-rate values.
What mistake am I making? How does this CNN work.
In CNN the image pixels with height and width are multiplied with the
kernel weights of the convolution layer and are added to create a
feature map.
The kernel will pass through all the channels of the
image (3 channels for RGB, 1 channel for GreyScale) based on the
strides defined in the convolution layer.
After the convolution, the size of the image is reduced.
To get the same output dimension as the input dimension, you need to add padding. Padding consists of adding
the right number of rows and columns on each side of the matrix. For
details, please refer to this
documentation.
Thank You.
I have been working on creating a convolutional neural network from scratch, and am a little confused on how to treat kernel size for hidden convolutional layers. For example, say I have an MNIST image as input (28 x 28) and put it through the following layers.
Convolutional layer with kernel_size = (5,5) with 32 output channels
new dimension of throughput = (32, 28, 28)
Max Pooling layer with pool_size (2,2) and step (2,2)
new dimension of throughput = (32, 14, 14)
If I now want to create a second convolutional layer with kernel size = (5x5) and 64 output channels, how do I proceed? Does this mean that I only need two new filters (2 x 32 existing channels) or does the kernel size change to be (32 x 5 x 5) since there are already 32 input channels?
Since the initial input was a 2D image, I do not know how to conduct convolution for the hidden layer since the input is now 3 dimensional (32 x 14 x 14).
you need 64 kernel, each with the size of (32,5,5) .
depth(#channels) of kernels, 32 in this case, or 3 for a RGB image, 1 for gray scale etc, should always match the input depth, but values are all the same.
e.g. if you have a 3x3 kernel like this : [-1 0 1; -2 0 2; -1 0 1] and now you want to convolve it with an input with N as depth or say channel, you just copy this 3x3 kernel N times in 3rd dimension, the following math is just like the 1 channel case, you sum all values in all N channels which your kernel window is currently on them after multiplying the kernel values with them and get the value of just 1 entry or pixel. so what you get as output in the end is a matrix with 1 channel:) how much depth you want your matrix for next layer to have? that's the number of kernels you should apply. hence in your case it would be a kernel with this size (64 x 32 x 5 x 5) which is actually 64 kernels with 32 channels for each and same 5x5 values in all cahnnels.
("I am not a very confident english speaker hope you get what I said, it would be nice if someone edit this :)")
You essentially answered your own question. YOU are building the network solver. It seems like your convolutional layer output is [channels out] = [channels in] * [number of kernels]. I had to infer this from the wording of your question. In general, this is how it works: you specify the kernel size of the layer and how many kernels to use. Since you have one input channel you are essentially saying that there are 32 kernels in your first convolution layer. That is 32 unique 5x5 kernels. Each of these kernels will be applied to the one input channel. More in general, each of the layer kernels (32 in your example) is applied to each of the input channels. And that is the key. If you build code to implement the convolution layer according to these generalities, then your subsequent convolution layers are done. In the next layer you specify two kernels per channel. In your example there would be 32 input channels, the hidden layer has 2 kernels per channel, and the output would be 64 channels.
You could then down sample by applying a pooling layer, then flatten the 64 channels [turn a matrix into a vector by stacking the columns or rows], and pass it as a column vector into a fully connected network. That is the basic scheme of convolutional networks.
The work comes when you try to code up backpropagation through the convolutional layers. But the OP didn’t ask about that. I’ll just say this, you will come to a place where you have the stored input matrix (one channel), you have a gradient from a lower layer in the form of a matrix and is the size of the layer kernel, and you need to backpropagate it up to the next convolutional layer.
The simple approach is to rotate your stored channel matrix by 180 degrees and then convolve it with the gradient. The explanation for this is long and tedious, too much to write here, and not a lot on the internet explains it well.
A more sophisticated approach is to apply “correlation” between the input gradient and the stored channel matrix. Note I specifically said “correlation” as opposed to “convolution” and that is key. If you think they “almost” the same thing, then I recommend you take some time and learn about the differences.
If you would like to have a look at my CNN solver here's a link to the project. It's C++ and no documentation, sorry :) It's all in a header file called layer.h, find the class FilterLayer2D. I think the code is pretty readable (what programmer doesn't think his code is readable :) )
https://github.com/sraber/simplenet.git
I also wrote a paper on basic fully connected networks. I wrote it so that I would forget what I learned in my self study. Maybe you'll get something out of it. It's at this link:
http://www.raberfamily.com/scottblog/scottblog.htm
So let's assume that I have RGB images of shape [128,128,3], I want to create a CNN with two Conv-ReLu-MaxPool layers as below.
def cnn(input_data):
#conv1
conv1_weight = tf.Variable(tf.truncated_normal([4,4,3,25], stddev=0.1,),tf.float32)
conv1_bias = tf.Variable(tf.zeros([25]), tf.float32)
conv1 = tf.nn.conv2d(input_data, conv1_weight, [1,1,1,1], 'SAME')
relu1 = tf.nn.relu(tf.nn.add(conv1, conv1_bias))
max_pool1 = tf.nn.max_pool(relu1, [1,2,2,1], [1,1,1,1], 'SAME')
#conv2
conv2_weight = tf.Variable(tf.truncated_normal([4,4,25,50]),0.1,tf.float32)
conv2_bias = tf.Variable(tf.zeros([50]), tf.float32)
conv2 = tf.nn.conv2d(max_pool1, conv2_weight, [1,1,1,1], 'SAME')
relu2 = tf.nn.relu(tf.nn.add(conv2, conv2_bias))
max_pool2 = tf.nn.max_pool(relu2, [1,2,2,1], [1,1,1,1], 'SAME')
After this step, I need to transform the output into 1xN layer for the next fully connected layer. However, I am not sure how I should determine what N is in 1xN. Is there a specific formula including the layer size, strides, max pool size, image size etc? I am pretty lost in this phase of the problem even though I think that I get the intuition behind a CNN.
I understand that you want to transform the multiple 2D feature maps that come out of the last convolutional/pooling layer to a vector that can be fed into a fully-connected layer. Or to be precise and include the batch dimension, go from shape [batch, width, height, feature_maps] to [batch, N].
The above already implies that N = batch * width * height since reshaping keeps the overall number of elements the same. width and height depend on the size of your inputs and the strides of your network layers (convolution and/or pooling).
A stride of x simply divides the size by x. You have inputs of size 128 in each dimension, and two pooling layers with stride 2. Thus after the first pooling layer your images are 64x64 and after the second they are 32x32, so width = height = 32. Normally we would have to account for padding as well but the point of SAME padding is precisely that we don't have to worry about that.
Finally, feature_maps is 50 since that is how many filters your last convolutional layer has (pooling doesn't modify this). So N = 32*32*50 = 51200.
Thus, you should be able to do tf.reshape(max_pool2, [-1, 51200]) (or tf.reshape(max_pool2, [-1, 32*32*50]) to keep it more interpretable) and feed the resulting 2D tensor through a fully-connected layer (i.e. tf.matmul).
The simplest way would be to just use tf.layers.flatten(max_pool2). This function does all the above for you and just gives you the [batch, N] result.
First of all since you are starting out, I would recommend Keras instead of pure tensorflow. And to answer your question regarding the shape refer this blog by Andrej karpathy
Quote from the blog:
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Now coming to your tensorflow's implementation:
For the conv1 stage you have given a 4*4 filter having a depth of 25. Since you have used padding="SAME" for conv1 and maxpooling1 your output 2D spatial dimensions will be same as input for both the cases. That is after conv1 your output size is: 128*128*25. For the same reason the output of your maxpool1 layer is also the same. Since you have given padding to be "SAME" for the second conv2 also your output shape is 128*128*50(you changed the output channels). Thus after maxpool2 your dimensions are: batch_size, 128*128*50. Thus before adding Dense layer you have 3 major options:
1) flatten the tensor results in a shape : batch_size, 128*128*50
2) global average pooling results in a shape : batch_size, 50
3) global max pooling also results in a shape : batch_size, 50.
Note:
global average pooling layer is similar to average pooling but, we average the entire feature map instead of a window. Hence the name global. For example: in your case you have batch_size, 128,128,50 as your dimensions. This means you have 50 feature maps with spatial dimensions 128*128. What global average pooling does is that, it
Averages the 128*128 feature map to give a single number. Thus you will have 50 values in total. This is very useful in designing fully convolutional architectures like inception, resnet etc. Because, this makes the network's input generic meaning you can send any image size as input to the network. Global max pooling is very similar to above but the slight difference is it finds the max value of the feature map instead of average.
Problems with this architecture:
Generally it is not recommended to use padding = "SAME" in maxpooling layers. If you see the source code of vgg16 you will see that after each block (conv relu and maxpooling) the input size is halved. Thus the general structure is you reduce the spatial dimension while increasing the depth/channels.
Flattening the layer:
var_name = tf.layers.flatten(max_pool2)
Should work, and it's what almost every example of a Tensorflow CNN uses.
What's the difference between those two? It would also help to explain in the more general context of convolutional networks.
Also, as a side note, what is channels? In other words, please break down the 3 terms for me: channels vs filters vs kernel.
Each convolution layer consists of several convolution channels (aka. depth or filters). In practice, they are a number such as 64, 128, 256, 512 etc. This is equal to number of channels in the output of a convolutional layer. kernel_size, on the other hand, is the size of these convolution filters. In practice, they take values such as 3x3 or 1x1 or 5x5. To abbreviate, they can be written as 1 or 3 or 5 as they are mostly square in practice.
Edit
Following quote should make it more clear.
Discussion on vlfeat
Suppose X is an input with size W x H x D x N (where N is the size of the batch) to a convolutional layer containing filter F (with size FW x FH x FD x K) in a network.
The number of feature channels D is the third dimension of the input X here (for example, this is typically 3 at the first input to the network if the input consists of colour images).
The number of filters K is the fourth dimension of F.
The two concepts are closely linked because if the number of filters in a layer is K, it produces an output with K feature channels. So the input to the next layer will have K feature channels.
The FW x FH above is filter size you are looking for.
Added
You should be familiar with filters. You can consider each filter to be responsible for extracting some type of feature from a raw image. The CNNs try to learn such filters i.e. the filters parametrized in CNNs are learned during training of CNNs. You apply each filter in a Conv2D to each input channel and combine these to get output channels. So, the number of filters and the number of output channels are the same.