tensorflow - understanding tensor shapes for convolution - python

Currently trying to work my way through the Tensorflow MNIST tutorial for convolutional networks and I could use some help with understanding the dimensions of the darn tensors.
So we have images of 28x28 pixels in size.
The convolution will compute 32 features for each 5x5 patch.
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels.
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
If you say so ...
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
Alright, now I'm getting lost.
Judging by this last reshape, we have
"howevermany" 28x28x1 "blocks" of pixels that are our images.
I guess this makes sense because the images are in greyscale
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
The x32 makes sense, I guess, if we want to infer 32 features per patch
The rest, though, I'm not terribly convinced by.
Why does the weight tensor look the way it apparently does?
(For completeness: we use them
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
where
def conv2d(x,W):
'''
2D convolution, expects 4D input x and filter matrix W
'''
return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME')
def max_pool_2x2(x):
'''
max-pooling, using 2x2 patches
'''
return tf.nn.max_pool(x,ksize=[1,2,2,1], strides=[1,2,2,1],padding='SAME')
)

Your input tensor has the shape [-1,28,28,1]. Like you mention, the last dimension is 1 because the images are in greyscale. The first index is the batchsize. The convolution will process every image in the batch independently, therefore the batchsize has no influence on the convolution-weight-tensor dimensions, or, in fact, no influence on any weight-tensor dimensions in the network. That is why the batchsize can be arbitrary (-1 signifies arbitrary size in tensorflow).
Now to the weight tensor; you don't have five of 5x1x32-blocks, you rather have 32 of 5x5x1-blocks. Each represents one feature. The 1 is the depth of the patch and is 1 due to the gray scale (it would be 5x5x3x32 for color images). The 5x5 is the size of the patch.
The ordering of dimensions in the data tensors is different from the ordering of dimensions in the convolution weight tensors.

Beside the other answer, I would like to add some more points,
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
There is no specific reason why we choose 5x5 patches or 32 features, all of this parameters are experienced (except in some cases), you may use 3x3 patches or larger feature size.
I said 'except in some cases', because may we use 3x3 patches to catch information from images in more details, or larger feature size to learn each image in more details ('larger' and 'more details' are relative terms in this case).
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
Not exactly, but the weight tensor is not a collection it is only a filter with size 5x5 and input channel 1 and output feature (channel) 32
Why does the weight tensor look the way it apparently does?
The weight tensor weight_variable([5, 5, 1, 32]) tells I have 5x5 patch size to apply on an image, I have 1 input feature (since images are in grayscale) and 32 output feature (channel).
More Details:
So this line tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes input x as [-1,28,28,1], -1 means you can put in this dimension any size you want (batch size), 28,28 shows input size, and it must be exactly 28x82, and the last 1 shows the number of input channel, since the mnist images are grayscale so it is 1, in more details it says input image is a 28x28 2D matrix and each cell of matrix shows a value which indicates the grayscale intensity. If input images were RGB so we should have 3 channel instead 1, and this 3 channel says input image is a 28x28x3 3D matrix, the cells in the first dimension of 3 shows the intensity of Red color, the second dimension of 3 shows the intensity of Green color and the other shows Blue color.
Now tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes x and apply W ( which is a 3x3 patches and apply whis patch on 28x28 image with step size 1 (since stride is 1) and give the result image again in size 28x28 because we use padding='SAME'

Related

How to interpret this CNN architecture

How does this CNN architecture work from an input layer to the first convolution layer? hx98 are input matrix dimensions, is n the number of channels or the number of inputs?
It doesn't seem like n is the number of channels because 25 is the number of feature maps and their dimensions do not indicate they are two channels.
However if n is the number of inputs and matrices are single channel, I haven't found a single CNN architecture anywhere that takes multiple input matrices and convolute them together. Most example convolute them seperately and then concatenate.
In my example, n is 2, one is matrix with BER values and another with connection line-rate values.
What mistake am I making? How does this CNN work.
In CNN the image pixels with height and width are multiplied with the
kernel weights of the convolution layer and are added to create a
feature map.
The kernel will pass through all the channels of the
image (3 channels for RGB, 1 channel for GreyScale) based on the
strides defined in the convolution layer.
After the convolution, the size of the image is reduced.
To get the same output dimension as the input dimension, you need to add padding. Padding consists of adding
the right number of rows and columns on each side of the matrix. For
details, please refer to this
documentation.
Thank You.

How do 1x1 convolutions preserve learned features?

Below, I use channels and feature maps interchangeably.
I'm trying to better understand how 1x1 convolution works with multiple input channels and have yet to find a good explanation of this. Before getting into 1x1, I'd like to ensure my understanding of 2D vs 3D convolution. Let's look at a simplistic example of 2D convolution in Keras API:
i = Input(shape=(64,64,3))
x = Conv2D(filters=32,kernel_size=(3,3),padding='same',activation='relu') (i)
In the above example, the input image has 3 channels and the convolutional layer will produce 32 feature maps. Will the 2D convolutional layer apply 3 different kernels to each of the 3 input channels to generate each feature map? If so, this means the number of kernels used in each 2D convolutional operation = #input channels * #feature maps. In this case, 96 different kernels would be used to produce 32 feature maps.
Now let's look at 3D convolution:
i = Input(shape=(1,64,64,3))
x = Conv3D(filters=32,kernel_size=(3,3,3),padding='same',activation='relu') (i)
In the above example, based on my current understanding, each kernel is convolved with all input channels simultaneously. Therefore, the # of kernels used in each 3D convolution operation = #input channels. In this case, 32 different kernels would be used to produce 32 feature maps.
I understand the purpose of downsampling channels before computations with bigger kernels (3x3, 5x5, 7x7). I'm asking because I'm confused as to how 1x1 convolutions preserve learned features. Let's look at a 1x1 convolution:
i = Input(shape=(64,64,3))
x = Conv2D(filters=32,kernel_size=(3,3),padding='same',activation='relu') (i)
x = Conv2D(filters=8,kernel_size=(1,1),padding='same',activation='relu') (x)
If my above understanding of 2D convolutions is correct, then the 1x1 convolutional layer will use 32 different kernels to generate each feature map. This operation would use a total of 256 kernels (32*8) to generate 8 feature maps. Each feature map computation essentially combines 32 pixels into one. How does this one pixel somehow retain all of the features from the previous 32 pixels?
A 1x1 convolution is a 2D convolution just with a "kernel size" of 1. Since there is no sense of spatial neighborhoods, like in a 3x3 kernel, how they are able to learn spatial features depends on the architecture.
By the way, the difference in a 2D convolution and a 3D convolution is in the movement of the convolution. A 2D convolution correlates the filter along "x and y" and is learning (kernel x kernel x input_channel) parameters per output channel. A 3D convolution correlates along "x, y, and z" and is learning (kernel x kernel x kernel x input_channel) parameters per output channel. You could do a 3D convolution on an image with channels, but it doesn't really make sense because we already know the "depth" is correlated. 3D convolutions are generally used with geometric volumes, e.g. data from a CT scan.
Maybe this link would be helpful
https://medium.com/analytics-vidhya/talented-mr-1x1-comprehensive-look-at-1x1-convolution-in-deep-learning-f6b355825578

Tensorflow CNN for different input size

I'm trying to make conv network for image regression.
As shown in below, one image [224 x 224] has one GT value {x}.
It's easy to make train [224 x 224] and valid/test with [224 x 224] images.
However, I'd like to apply CNN for different image sizes.
For example, [224 x 229] image, I want to get 5 regression values 'at once'.
Simply, I can do that by just sliding windows of [224 x 224] x 5 times, but apparently it is too slow.
I think using conv for different image size is possible. But FCL is not.
If I change image size to [455 x 256]
lhs shape= [4608,1024] rhs shape= [2048,1024]
error occurred. Is there any way to handle it?
Fully connected layers have a fixed size input. Thus, changing the input size will cause a wrong-size error.
One way to tackle this problem, and allow for different image sizes is to use a fully convolutional network.
An example with easy numbers:
Assuming for example the conv layer's output is of size 16X16, you can create a "classifier layer" of size 4x4 with stride 4, that would output for each of the 4 4x4 squares comprising the 16x16 feature map, a single value per dimension. Such filter would be of size 4x4xn_dim, in your case n_dim will be 5, and the final output would be of size 4x4x5, corresponding to 5 outputs (one for each regression value) for each 4x4 square.
You will notice you can play with the shape of the last conv filter to obtain different sizes for the final output, corresponding to different parts of the input image, but really, looking at all of it.
You can work out the numbers for your own example.
You probably would like to read about basic methods for semantic segmentaion.
Also see basic fully conv nets.

4D Tensor Shape

I was reading a Machine Learning book and came across this in the CNN chapter.
The weights of a convolutional layer are represented as a 4D tensor of
shape [fh, fw, fn′, fn]. The bias terms of a convolutional layer are
simply represented as a 1D tensor of shape [fn]. Where fh is the heigher of the receptive field and fw is the width of the receptive field. fn' is the number of feature maps in the previous layer and fn is the number of feature maps in the current layer.
I am trying to understand what each number in the given order signifies. Is it creating a rank 4 matrix where each entry represents the weight connecting an output neuron from the previous layer with specified feature map and location in the receptive field to the current ouput neuron?
fn':
It represents number of channels in previous layer which also indirectly specifies depth(or channels) of each kernels in current layer.
fn:
It represents number of feature maps in current layer i.e. number of different kernels in current layer. Because each kernel outputs single channel.
fw:
It represents kernel width.
fh:
It represents kernel height.
Suppose [fh, fw, fn′, fn] = [3, 3, 10, 20] then layer(weight) will be of size 20x10x3x3. Each kernel will be of size 10x3x3 (where 3x3 is spacial and 10 is depth) and there will be 20 such kernels. These kernels operate on previous 20 feature maps to output 10 feature maps.
And each entry in this 4d matrix representing weight will be shared. It doesn't connect neurons one to one because of convolution. Convolution's main advantages itself is parameter sharing and receptive field/local connectivity.

After going through convolution steps, what should be the shape of the tensor in fully connected layer?

So let's assume that I have RGB images of shape [128,128,3], I want to create a CNN with two Conv-ReLu-MaxPool layers as below.
def cnn(input_data):
#conv1
conv1_weight = tf.Variable(tf.truncated_normal([4,4,3,25], stddev=0.1,),tf.float32)
conv1_bias = tf.Variable(tf.zeros([25]), tf.float32)
conv1 = tf.nn.conv2d(input_data, conv1_weight, [1,1,1,1], 'SAME')
relu1 = tf.nn.relu(tf.nn.add(conv1, conv1_bias))
max_pool1 = tf.nn.max_pool(relu1, [1,2,2,1], [1,1,1,1], 'SAME')
#conv2
conv2_weight = tf.Variable(tf.truncated_normal([4,4,25,50]),0.1,tf.float32)
conv2_bias = tf.Variable(tf.zeros([50]), tf.float32)
conv2 = tf.nn.conv2d(max_pool1, conv2_weight, [1,1,1,1], 'SAME')
relu2 = tf.nn.relu(tf.nn.add(conv2, conv2_bias))
max_pool2 = tf.nn.max_pool(relu2, [1,2,2,1], [1,1,1,1], 'SAME')
After this step, I need to transform the output into 1xN layer for the next fully connected layer. However, I am not sure how I should determine what N is in 1xN. Is there a specific formula including the layer size, strides, max pool size, image size etc? I am pretty lost in this phase of the problem even though I think that I get the intuition behind a CNN.
I understand that you want to transform the multiple 2D feature maps that come out of the last convolutional/pooling layer to a vector that can be fed into a fully-connected layer. Or to be precise and include the batch dimension, go from shape [batch, width, height, feature_maps] to [batch, N].
The above already implies that N = batch * width * height since reshaping keeps the overall number of elements the same. width and height depend on the size of your inputs and the strides of your network layers (convolution and/or pooling).
A stride of x simply divides the size by x. You have inputs of size 128 in each dimension, and two pooling layers with stride 2. Thus after the first pooling layer your images are 64x64 and after the second they are 32x32, so width = height = 32. Normally we would have to account for padding as well but the point of SAME padding is precisely that we don't have to worry about that.
Finally, feature_maps is 50 since that is how many filters your last convolutional layer has (pooling doesn't modify this). So N = 32*32*50 = 51200.
Thus, you should be able to do tf.reshape(max_pool2, [-1, 51200]) (or tf.reshape(max_pool2, [-1, 32*32*50]) to keep it more interpretable) and feed the resulting 2D tensor through a fully-connected layer (i.e. tf.matmul).
The simplest way would be to just use tf.layers.flatten(max_pool2). This function does all the above for you and just gives you the [batch, N] result.
First of all since you are starting out, I would recommend Keras instead of pure tensorflow. And to answer your question regarding the shape refer this blog by Andrej karpathy
Quote from the blog:
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Now coming to your tensorflow's implementation:
For the conv1 stage you have given a 4*4 filter having a depth of 25. Since you have used padding="SAME" for conv1 and maxpooling1 your output 2D spatial dimensions will be same as input for both the cases. That is after conv1 your output size is: 128*128*25. For the same reason the output of your maxpool1 layer is also the same. Since you have given padding to be "SAME" for the second conv2 also your output shape is 128*128*50(you changed the output channels). Thus after maxpool2 your dimensions are: batch_size, 128*128*50. Thus before adding Dense layer you have 3 major options:
1) flatten the tensor results in a shape : batch_size, 128*128*50
2) global average pooling results in a shape : batch_size, 50
3) global max pooling also results in a shape : batch_size, 50.
Note:
global average pooling layer is similar to average pooling but, we average the entire feature map instead of a window. Hence the name global. For example: in your case you have batch_size, 128,128,50 as your dimensions. This means you have 50 feature maps with spatial dimensions 128*128. What global average pooling does is that, it
Averages the 128*128 feature map to give a single number. Thus you will have 50 values in total. This is very useful in designing fully convolutional architectures like inception, resnet etc. Because, this makes the network's input generic meaning you can send any image size as input to the network. Global max pooling is very similar to above but the slight difference is it finds the max value of the feature map instead of average.
Problems with this architecture:
Generally it is not recommended to use padding = "SAME" in maxpooling layers. If you see the source code of vgg16 you will see that after each block (conv relu and maxpooling) the input size is halved. Thus the general structure is you reduce the spatial dimension while increasing the depth/channels.
Flattening the layer:
var_name = tf.layers.flatten(max_pool2)
Should work, and it's what almost every example of a Tensorflow CNN uses.

Categories