What's the difference between conv2d(SAME) and tf.pad + conv2d(VALID)?

What's the difference between conv2d(SAME) and tf.pad + conv2d(VALID)? - python

I'm almost new to tensorflow, and when I learn tensorflow through some tutorials, i've read the following codes:
if stride == 1:
return slim.conv2d(inputs, num_outputs, kernel_size, stride=1, padding='SAME', scope=scope)
else:
pad_total = kernel_size - 1
pad_beg = pad_total // 2
pad_end = pad_total - pad_beg
inputs = tf.pad(inputs, [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]])
return slim.conv2d(inputs, num_outputs, kernel_size, stride=stride, padding='VALID', scope=scope)
However, i also learn that, "SAME" padding means the output data has the same size with the input data, while "VALID" means different, and the the method of tf.pad also pad zero manually, so is there any difference between these two methods? Or what's the purpose of this tf.pad?

In many real-word use-cases, there is no difference.
For instance, in some imagenet architectures, we often pad with 1, then do a 3x3 convolution. Behaviour of the network would be the same if you first zero-pad with 1, then convolve, or if you convolve with "same" padding.
However, behaviour will be different in non-standard situations. Remember that you can define kernel size AND stride AND dilation rate at a convolution layer.
A counterexample where there is a difference between conv2d(SAM) and a symmetric tf.pad +conv2d(VALID):
Input: (7,7,1)
Kernel: (4,4)
Stride: (2,2)
conv2d(SAME) here would be the same as tf.pad(0 pixel left/top, 1 pixel right/bottom), and would yield a (3,3,1) output.

Related

The input to the CNN of Conv1D

I'm working in the field of machine learning.
For the stronger Network, I'm going to adopt the techniques concerning Conv1D.
The input data is an one-dimension list data so I just would've thought that Conv1D is the best choice.
What would happen if the input size is (1, 740)? Would it be okay the input channel is 1?
I mean,I have a feeling that the (1, 740) tensor's conv1D output should be the same with that of a simple Linear networks.
Of course I'll also include other conv1d layer, like below.
self.conv1 = torch.nn.Conv1d(in_channels=1, out_channels=64, kernel_size=5)
self.conv2 = torch.nn.Conv1d(in_channels=64,out_channels=64, kernel_size=5)
self.conv3 = torch.nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5)
self.conv4 = torch.nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5)
Would it make sense when an input channel is 1?
Thanks in advance. :)

I think it's fine.
Note that the input of Conv1D should be (B, N, M), where B is the batch size, N is the number of channels (e.g. for RGB is 3) and M is the number of features.
The out_channels refers to the number of 5x5 filters to use. look at the output shape of the following code:
k = nn.Conv1d(1,64,kernel_size=5)
input = torch.randn(1, 1, 740)
print(k(input).shape) # -> torch.Size([1, 64, 736])
The 736 is the result of not using padding the dimension isn't kept.

The nn.Conv1d layer takes an input of shape (b, c, w) (where b is the batch size, c the number of channels, and w the input width). Its kernel size is one-dimensional. It performs a convolution operation over the input dimension (batch and channel axes aside). This means the kernel will apply the same operation over the whole input (wether 1D, 2D, or 3D). Like a 'sliding window'. As such, it only has kernel_size parameters. This is the main characteristic of a convolution layer.
Conv1d allows to extract features on the input regardless of where it's located in the input data: at the beginning or at the end of your w-width input. This would make sense if your input is temporal (input sequence over time) or spatial data (an image).
On the other hand, a nn.Linear takes a 1D tensor as input and returns another 1D tensor. You could consider w to be the number of neurons. You would end up having w*output_dim parameters. If your input contains components which are independant from one another (like a One/Multi-Hot-Encoding) then a fully connected layer as nn.Linear implements would be prefered.
These two behave differently. When using a nn.Linear - in scenarios where you should use a nn.Conv1d - your training would ideally result in having neurons of equal weights, if that makes sense... but you probably won't. Fully-densely-connected layers were used in the past in deep learning for computer vision. Today convolutions are used because there are much more efficient and suitable for these types of tasks.

Setting custom kernel for CNN in pytorch

Is there a way to specify our own custom kernel values for a convolution neural network in pytorch? Something like kernel_initialiser in tensorflow? Eg. I want a 3x3 kernel in nn.Conv2d with initialization so that it acts as a identity kernel -
0 0 0
0 1 0
0 0 0
(this will effectively return the same output as my input in the very first iteration)
My non-exhaustive research on the subject -
I could use nn.init but it only has some pre-defined kernel initialisaition values.
I tried to follow the discussion on their official thread but it doesn't suit my needs.
I might have missed something in my research please feel free to point out.

I think an easier solution is to :
deconv = nn.ConvTranspose2d(
in_channels=channel_dim, out_channels=channel_dim,
kernel_size=kernel_size, stride=stride,
bias=False, padding=1, output_padding=1
)
deconv.weight.data.copy_(
get_upsampling_weight(channel_dim, channel_dim, kernel_size)
)
in other words use copy_

Thanks to ptrblck I was able to solve it.
I can define a new convolution layer as conv and as per the example I can set the identity kernel using -
weights = ch.Tensor([[0, 0, 0], [0, 1, 0], [0, 0, 0]]).unsqueeze(0).unsqueeze(0)
weights.requires_grad = True
conv = nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=1, bias=False)
with ch.no_grad():
conv.weight = nn.Parameter(weights)
I can then continue to use conv as my regular nn.Conv2d layer.

After going through convolution steps, what should be the shape of the tensor in fully connected layer?

So let's assume that I have RGB images of shape [128,128,3], I want to create a CNN with two Conv-ReLu-MaxPool layers as below.
def cnn(input_data):
#conv1
conv1_weight = tf.Variable(tf.truncated_normal([4,4,3,25], stddev=0.1,),tf.float32)
conv1_bias = tf.Variable(tf.zeros([25]), tf.float32)
conv1 = tf.nn.conv2d(input_data, conv1_weight, [1,1,1,1], 'SAME')
relu1 = tf.nn.relu(tf.nn.add(conv1, conv1_bias))
max_pool1 = tf.nn.max_pool(relu1, [1,2,2,1], [1,1,1,1], 'SAME')
#conv2
conv2_weight = tf.Variable(tf.truncated_normal([4,4,25,50]),0.1,tf.float32)
conv2_bias = tf.Variable(tf.zeros([50]), tf.float32)
conv2 = tf.nn.conv2d(max_pool1, conv2_weight, [1,1,1,1], 'SAME')
relu2 = tf.nn.relu(tf.nn.add(conv2, conv2_bias))
max_pool2 = tf.nn.max_pool(relu2, [1,2,2,1], [1,1,1,1], 'SAME')
After this step, I need to transform the output into 1xN layer for the next fully connected layer. However, I am not sure how I should determine what N is in 1xN. Is there a specific formula including the layer size, strides, max pool size, image size etc? I am pretty lost in this phase of the problem even though I think that I get the intuition behind a CNN.

I understand that you want to transform the multiple 2D feature maps that come out of the last convolutional/pooling layer to a vector that can be fed into a fully-connected layer. Or to be precise and include the batch dimension, go from shape [batch, width, height, feature_maps] to [batch, N].
The above already implies that N = batch * width * height since reshaping keeps the overall number of elements the same. width and height depend on the size of your inputs and the strides of your network layers (convolution and/or pooling).
A stride of x simply divides the size by x. You have inputs of size 128 in each dimension, and two pooling layers with stride 2. Thus after the first pooling layer your images are 64x64 and after the second they are 32x32, so width = height = 32. Normally we would have to account for padding as well but the point of SAME padding is precisely that we don't have to worry about that.
Finally, feature_maps is 50 since that is how many filters your last convolutional layer has (pooling doesn't modify this). So N = 32*32*50 = 51200.
Thus, you should be able to do tf.reshape(max_pool2, [-1, 51200]) (or tf.reshape(max_pool2, [-1, 32*32*50]) to keep it more interpretable) and feed the resulting 2D tensor through a fully-connected layer (i.e. tf.matmul).
The simplest way would be to just use tf.layers.flatten(max_pool2). This function does all the above for you and just gives you the [batch, N] result.

First of all since you are starting out, I would recommend Keras instead of pure tensorflow. And to answer your question regarding the shape refer this blog by Andrej karpathy
Quote from the blog:
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Now coming to your tensorflow's implementation:
For the conv1 stage you have given a 4*4 filter having a depth of 25. Since you have used padding="SAME" for conv1 and maxpooling1 your output 2D spatial dimensions will be same as input for both the cases. That is after conv1 your output size is: 128*128*25. For the same reason the output of your maxpool1 layer is also the same. Since you have given padding to be "SAME" for the second conv2 also your output shape is 128*128*50(you changed the output channels). Thus after maxpool2 your dimensions are: batch_size, 128*128*50. Thus before adding Dense layer you have 3 major options:
1) flatten the tensor results in a shape : batch_size, 128*128*50
2) global average pooling results in a shape : batch_size, 50
3) global max pooling also results in a shape : batch_size, 50.
Note:
global average pooling layer is similar to average pooling but, we average the entire feature map instead of a window. Hence the name global. For example: in your case you have batch_size, 128,128,50 as your dimensions. This means you have 50 feature maps with spatial dimensions 128*128. What global average pooling does is that, it
Averages the 128*128 feature map to give a single number. Thus you will have 50 values in total. This is very useful in designing fully convolutional architectures like inception, resnet etc. Because, this makes the network's input generic meaning you can send any image size as input to the network. Global max pooling is very similar to above but the slight difference is it finds the max value of the feature map instead of average.
Problems with this architecture:
Generally it is not recommended to use padding = "SAME" in maxpooling layers. If you see the source code of vgg16 you will see that after each block (conv relu and maxpooling) the input size is halved. Thus the general structure is you reduce the spatial dimension while increasing the depth/channels.

Flattening the layer:
var_name = tf.layers.flatten(max_pool2)
Should work, and it's what almost every example of a Tensorflow CNN uses.

Batch normalization with 3D convolutions in TensorFlow

I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.

Implementing CNN with tensorflow

I'm new in convolutional neural networks and in Tensorflow and I need to implement a conv layer with further parameters:
Conv. layer1: filter=11, channel=64, stride=4, Relu.
The API is following:
tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, data_format=None, name=None)
I understand, what is stride and that it should be [1, 4, 4, 1] in my case. But I do not understand, how should I pass a filter parameter and padding.
Could someone help with it?

At first, you need to create a filter variable:
W = tf.Variable(tf.truncated_normal(shape = [11, 11, 3, 64], stddev = 0.1), tf.float32)
First two fields of shape parameter stand for filter size, third for the number of input channels (I guess your images have 3 channels) and fourth for the number of output channels.
Now output of convolutional layer could be computed as follows:
conv1 = tf.nn.conv2d(input, W, strides = [1, 4, 4, 1], padding = 'SAME'), where padding = 'SAME' stands for zero padding and therefore size of the image remains the same, input should have size [batch, size1, size2, 3].
ReLU application is pretty straightforward:
conv1 = tf.nn.relu(conv1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.