Why Convolution function in MXnet have a kernel parameter

Why Convolution function in MXnet have a kernel parameter - python

I am new of mxnet, in the official doc, the generation of a convolution layer could be
conv = nd.Convolution(data=data, weight=W, bias=b, kernel=(3,3), num_filter=10)
But it is required that the weight parameter needs to take a 4-D tensor
W = [weight_num, stride, kernel_height, kernel_width]
So why we still need to set a kernel parameter in Convolution function?

kernel parameter sets up the kernel size, which can be either:
(width,) - for 1D convolution
(height, width) - for 2D convolution
(depth, height, width) - for 3D convolution
It only defines shapes.
The weight and bias parameters contain actual parameters that are going to be trained. The actual values are going to be here.
While you could probably figure out kernel (shapes) by provided weight, it is more defensive to ask to provide kernel shape explicitly instead of trying figuring it out based on parameters passed to weight.
Here is an example of 2D convolution:
# shape is batch_size x channels x height x width
x = mx.nd.random.uniform(shape=(100, 1, 9, 9))
# kernel is just 3 x 3,
# weight is num_filter x channels x kernel_height x kernel_width
# bias is num_filter
mx.nd.Convolution(data=x,
kernel=(3, 3),
num_filter=5,
weight=mx.nd.random.uniform(shape=(5, 1, 3, 3)),
bias=mx.nd.random.uniform(shape=(5,)))
The documentation explaining various shapes of parameters in case of 1D, 2D or 3D convolutions is quite good: https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.Convolution

Related

Understanding output shape for 1D convolution

I'm trying to get my head around 1D convolution - specifically, how the padding comes into it.
Suppose I have an input sequence of shape (batch,128,1) and run it through the following Keras layer:
tf.keras.layers.Conv1D(32, 5, strides=2, padding="same")
I get an output of shape (batch,64,32), but I don't understand why the sequence length has reduced from 128 to 64... I thought the padding="same" parameter kept the output length the same as the input? I suppose that's only true if strides=1; so in this case I'm confused about what padding="same" actually means.

According to the TensorFlow documents in your case we have:
filters (Number of filters - output dimension) = 32
kernelSize (The filter size) = 5
strides (The unit to move in input data by the convolution layer in each dimensions after applying each convolution) = 2
So applying input in shape (batch, 128, 1) will be to apply 32 kernels (in shape 5) and jump two unit after each convolution - so we have 128 / 2 = 64 value corresponding to each filter and at the end output would be in shape (batch, 64, 32).
padding="same" is just determining the the convolution on borders. For more details you can check here.

How to calculate output dimensions in CNN if you specify number of outputs

I am having trouble figuring out what the dimensions of each CNN layer is.
Let's say my input is a vector which I then projected onto a 4x4x256 matrix using a fully-connected layer as so...
zP = slim.fully_connected(
z,
4*4*256,
normalizer_fn=slim.batch_norm,
activation_fn=tf.nn.relu,
scope='g_project',
weights_initializer=initializer
)
# Layer is reshaped to a 4x4x256 mapping.
zCon = tf.reshape(zP,[-1,4,4,256])
Where z was my original vector. I then take this 4x4x256 matrix and feed it into a CNN...
gen1 = slim.convolution2d_transpose(
zCon,
num_outputs=64,
kernel_size=[5,5],
stride=[2,2],
padding="SAME",
normalizer_fn=slim.batch_norm,
activation_fn=tf.nn.relu,
scope='g_conv1',
weights_initializer=initializer
)
As you can see I used a convolutional 2d transpose and I specified the output as 64, with a stride of 2 and a filter size of 5. This means that I know one of my dimension will be 64, however I do not know what the other 2 dimensions will be and I do not know how to calculate it.
I tried using the following formula but it is not working out for me...
How can I calculate the remaining dimensions?

The formula you have written is for the Convolution operation, since you need to calculate for the transposed convolution where the shapes are inverse of convolution, the formula can be derived from the above equation by re-arranging the terms as:
W = (Out-1)*S + F - 2P
W is your actual output and Out is your actual input to the transpose convolution.

After going through convolution steps, what should be the shape of the tensor in fully connected layer?

So let's assume that I have RGB images of shape [128,128,3], I want to create a CNN with two Conv-ReLu-MaxPool layers as below.
def cnn(input_data):
#conv1
conv1_weight = tf.Variable(tf.truncated_normal([4,4,3,25], stddev=0.1,),tf.float32)
conv1_bias = tf.Variable(tf.zeros([25]), tf.float32)
conv1 = tf.nn.conv2d(input_data, conv1_weight, [1,1,1,1], 'SAME')
relu1 = tf.nn.relu(tf.nn.add(conv1, conv1_bias))
max_pool1 = tf.nn.max_pool(relu1, [1,2,2,1], [1,1,1,1], 'SAME')
#conv2
conv2_weight = tf.Variable(tf.truncated_normal([4,4,25,50]),0.1,tf.float32)
conv2_bias = tf.Variable(tf.zeros([50]), tf.float32)
conv2 = tf.nn.conv2d(max_pool1, conv2_weight, [1,1,1,1], 'SAME')
relu2 = tf.nn.relu(tf.nn.add(conv2, conv2_bias))
max_pool2 = tf.nn.max_pool(relu2, [1,2,2,1], [1,1,1,1], 'SAME')
After this step, I need to transform the output into 1xN layer for the next fully connected layer. However, I am not sure how I should determine what N is in 1xN. Is there a specific formula including the layer size, strides, max pool size, image size etc? I am pretty lost in this phase of the problem even though I think that I get the intuition behind a CNN.

I understand that you want to transform the multiple 2D feature maps that come out of the last convolutional/pooling layer to a vector that can be fed into a fully-connected layer. Or to be precise and include the batch dimension, go from shape [batch, width, height, feature_maps] to [batch, N].
The above already implies that N = batch * width * height since reshaping keeps the overall number of elements the same. width and height depend on the size of your inputs and the strides of your network layers (convolution and/or pooling).
A stride of x simply divides the size by x. You have inputs of size 128 in each dimension, and two pooling layers with stride 2. Thus after the first pooling layer your images are 64x64 and after the second they are 32x32, so width = height = 32. Normally we would have to account for padding as well but the point of SAME padding is precisely that we don't have to worry about that.
Finally, feature_maps is 50 since that is how many filters your last convolutional layer has (pooling doesn't modify this). So N = 32*32*50 = 51200.
Thus, you should be able to do tf.reshape(max_pool2, [-1, 51200]) (or tf.reshape(max_pool2, [-1, 32*32*50]) to keep it more interpretable) and feed the resulting 2D tensor through a fully-connected layer (i.e. tf.matmul).
The simplest way would be to just use tf.layers.flatten(max_pool2). This function does all the above for you and just gives you the [batch, N] result.

First of all since you are starting out, I would recommend Keras instead of pure tensorflow. And to answer your question regarding the shape refer this blog by Andrej karpathy
Quote from the blog:
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Now coming to your tensorflow's implementation:
For the conv1 stage you have given a 4*4 filter having a depth of 25. Since you have used padding="SAME" for conv1 and maxpooling1 your output 2D spatial dimensions will be same as input for both the cases. That is after conv1 your output size is: 128*128*25. For the same reason the output of your maxpool1 layer is also the same. Since you have given padding to be "SAME" for the second conv2 also your output shape is 128*128*50(you changed the output channels). Thus after maxpool2 your dimensions are: batch_size, 128*128*50. Thus before adding Dense layer you have 3 major options:
1) flatten the tensor results in a shape : batch_size, 128*128*50
2) global average pooling results in a shape : batch_size, 50
3) global max pooling also results in a shape : batch_size, 50.
Note:
global average pooling layer is similar to average pooling but, we average the entire feature map instead of a window. Hence the name global. For example: in your case you have batch_size, 128,128,50 as your dimensions. This means you have 50 feature maps with spatial dimensions 128*128. What global average pooling does is that, it
Averages the 128*128 feature map to give a single number. Thus you will have 50 values in total. This is very useful in designing fully convolutional architectures like inception, resnet etc. Because, this makes the network's input generic meaning you can send any image size as input to the network. Global max pooling is very similar to above but the slight difference is it finds the max value of the feature map instead of average.
Problems with this architecture:
Generally it is not recommended to use padding = "SAME" in maxpooling layers. If you see the source code of vgg16 you will see that after each block (conv relu and maxpooling) the input size is halved. Thus the general structure is you reduce the spatial dimension while increasing the depth/channels.

Flattening the layer:
var_name = tf.layers.flatten(max_pool2)
Should work, and it's what almost every example of a Tensorflow CNN uses.

Tensorflow convolution

I'm trying to perform a convolution (conv2d) on images of variable dimensions. I have those images in form of an 1-D array and I want to perform a convolution on them, but I have a lot of troubles with the shapes.
This is my code of the conv2d:
tf.nn.conv2d(x, w, strides=[1, 1, 1, 1], padding='SAME')
where x is the input image.
The error is:
ValueError: Shape must be rank 4 but is rank 1 for 'Conv2D' (op: 'Conv2D') with input shapes: [1], [5,5,1,32].
I think I might reshape x, but I don't know the right dimensions. When I try this code:
x = tf.reshape(self.x, shape=[-1, 5, 5, 1]) # example
I get this:
ValueError: Dimension size must be evenly divisible by 25 but is 1 for 'Reshape' (op: 'Reshape') with input shapes: [1], [4] and with input tensors computed as partial shapes: input[1] = [?,5,5,1].

You can't use conv2d with a tensor of rank 1. Here's the description from the doc:
Computes a 2-D convolution given 4-D input and filter tensors.
These four dimensions are [batch, height, width, channels] (as Engineero already wrote).
If you don't know the dimensions of the image in advance, tensorflow allows to provide a dynamic shape:
x = tf.placeholder(tf.float32, shape=[None, None, None, 3], name='x')
with tf.Session() as session:
print session.run(x, feed_dict={x: data})
In this example, a 4-D tensor x is created, but only the number of channels is known statically (3), everything else is determined on runtime. So you can pass this x into conv2d, even if the size is dynamic.
But there's another problem. You didn't say your task, but if you're building a convolutional neural network, I'm afraid, you'll need to know the size of the input to determine the size of FC layer after all pooling operations - this size must be static. If this is the case, I think the best solution is actually to scale your inputs to a common size before passing it into a convolutional network.
UPD:
Since it wasn't clear, here's how you can reshape any image into 4-D array.
a = np.zeros([50, 178, 3])
shape = a.shape
print shape # prints (50, 178, 3)
a = a.reshape([1] + list(shape))
print a.shape # prints (1, 50, 178, 3)

tensorflow - understanding tensor shapes for convolution

Currently trying to work my way through the Tensorflow MNIST tutorial for convolutional networks and I could use some help with understanding the dimensions of the darn tensors.
So we have images of 28x28 pixels in size.
The convolution will compute 32 features for each 5x5 patch.
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels.
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
If you say so ...
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
Alright, now I'm getting lost.
Judging by this last reshape, we have
"howevermany" 28x28x1 "blocks" of pixels that are our images.
I guess this makes sense because the images are in greyscale
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
The x32 makes sense, I guess, if we want to infer 32 features per patch
The rest, though, I'm not terribly convinced by.
Why does the weight tensor look the way it apparently does?
(For completeness: we use them
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
where
def conv2d(x,W):
'''
2D convolution, expects 4D input x and filter matrix W
'''
return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME')
def max_pool_2x2(x):
'''
max-pooling, using 2x2 patches
'''
return tf.nn.max_pool(x,ksize=[1,2,2,1], strides=[1,2,2,1],padding='SAME')
)

Your input tensor has the shape [-1,28,28,1]. Like you mention, the last dimension is 1 because the images are in greyscale. The first index is the batchsize. The convolution will process every image in the batch independently, therefore the batchsize has no influence on the convolution-weight-tensor dimensions, or, in fact, no influence on any weight-tensor dimensions in the network. That is why the batchsize can be arbitrary (-1 signifies arbitrary size in tensorflow).
Now to the weight tensor; you don't have five of 5x1x32-blocks, you rather have 32 of 5x5x1-blocks. Each represents one feature. The 1 is the depth of the patch and is 1 due to the gray scale (it would be 5x5x3x32 for color images). The 5x5 is the size of the patch.
The ordering of dimensions in the data tensors is different from the ordering of dimensions in the convolution weight tensors.

Beside the other answer, I would like to add some more points,
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
There is no specific reason why we choose 5x5 patches or 32 features, all of this parameters are experienced (except in some cases), you may use 3x3 patches or larger feature size.
I said 'except in some cases', because may we use 3x3 patches to catch information from images in more details, or larger feature size to learn each image in more details ('larger' and 'more details' are relative terms in this case).
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
Not exactly, but the weight tensor is not a collection it is only a filter with size 5x5 and input channel 1 and output feature (channel) 32
Why does the weight tensor look the way it apparently does?
The weight tensor weight_variable([5, 5, 1, 32]) tells I have 5x5 patch size to apply on an image, I have 1 input feature (since images are in grayscale) and 32 output feature (channel).
More Details:
So this line tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes input x as [-1,28,28,1], -1 means you can put in this dimension any size you want (batch size), 28,28 shows input size, and it must be exactly 28x82, and the last 1 shows the number of input channel, since the mnist images are grayscale so it is 1, in more details it says input image is a 28x28 2D matrix and each cell of matrix shows a value which indicates the grayscale intensity. If input images were RGB so we should have 3 channel instead 1, and this 3 channel says input image is a 28x28x3 3D matrix, the cells in the first dimension of 3 shows the intensity of Red color, the second dimension of 3 shows the intensity of Green color and the other shows Blue color.
Now tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes x and apply W ( which is a 3x3 patches and apply whis patch on 28x28 image with step size 1 (since stride is 1) and give the result image again in size 28x28 because we use padding='SAME'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.