So, I am trying to write my own code for CNN using CIFAR-10 dataset. I have completed the feed forward algorithm and started with the back-propagation. I had no problem back propagating through the fully connected layer as well as going through unpooling. I am now stuck at back propagating through a convolution layer. My image input size is (32, 32, 3) rgb input.
This is my network so far.
Conv Layer + Relu -> Maxpool -> Conv Layer + Relu -> Maxpool -> Dense + Relu -> Dense + Softmax
While propagating forward, I took a (3, 3, 3) kernel and convolved with the image. I used 12 such filters, which gave me (32, 32, 12) tensor as an output. Maxpool made the dimensions (16, 16, 12) which was again convolved with (3, 3, 12) for 8 different filters to get (16, 16, 8).
While back propagating, I have reached this part. I have a (16, 16, 8) tensor, using which i have to update a filter matrix of (3, 3, 12) for 8 different filters. Also I have to propagate my error to calculate change in the input of the previous layer which is of the dimension (16, 16, 12).
I understood how this works with 1 channel image, but I'm confused when three channels are concerned.
I also know that back-propagating through a convolution layer is also a convolution, but I'm stuck here. How is it done. Please help.
Related
In each observation, I have 6 timesteps each with 2 features, and I am trying to predict 1 timetsep that has 2 parallel features. More specifically,
The shape of my input data is: (81, 6, 2)
The shape of my output data is: (81, 1, 2)
I wrote the following code to build Encoder-Decoder LSTM:
model.add(LSTM(200, activation='relu', input_shape=(n_input, 2)))
model.add(RepeatVector(1))
model.add(LSTM(200, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(100, activation='relu')))
model.add(TimeDistributed(Dense(2)))
The network gives me back the shape (1, 1, 2) when I perform a single prediction.
I want to double check if this is correct, and I am not missing anything, because the predicted values are very bad (some are negative and others are very high).
Having a bad prediction is a different issue. The shape you're getting back should correspond to (samples, timesteps, features) => (1, 1, 2) as specified in the Dense(2) layer.
This question already has answers here:
Pytorch - Inferring linear layer in_features
(2 answers)
Closed 1 year ago.
I was trying to learn PyTorch and came across a tutorial where a CNN is defined like below,
class Net(Module):
def __init__(self):
super(Net, self).__init__()
self.cnn_layers = Sequential(
# Defining a 2D convolution layer
Conv2d(1, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
# Defining another 2D convolution layer
Conv2d(4, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
)
self.linear_layers = Sequential(
Linear(4 * 7 * 7, 10)
)
# Defining the forward pass
def forward(self, x):
x = self.cnn_layers(x)
x = x.view(x.size(0), -1)
x = self.linear_layers(x)
return x
I understood how the cnn_layers are made. After the cnn_layers, the data should be flattened and given to linear_layers.
I don't understand how the number of features to Linear is 4*7*7. I understand that 4 is the output dimension from the last Conv2d layer.
How is 7*7 coming in to picture? Does stride or padding got any role in that?
Input image shape is [1, 28, 28]
Conv2d layers have a kernel size of 3, stride and padding of 1, which means it doesn't change the spatial size of an image. There are two MaxPool2d layers which reduce the spatial dimensions from (H, W) to (H/2, W/2). So, for each batch, output of the last convolution with 4 output channels has a shape of (batch_size, 4, H/4, W/4). In the forward pass feature tensor is flattened by x = x.view(x.size(0), -1) which makes it in the shape (batch_size, H*W/4). I assume H and W are 28, for which the linear layer would take inputs of shape (batch_size, 196).
Actually,
in the 2D convolution layers features [values] in a matric [2D-tensor],
As usual neural network end up with a fully connected layer followed by the logist later.
so, features in the fully-connected layer in the vector [1D-tensor].
therefore we have to map each feature [value] in the last metric into the fully-connected layer follows.
in pytorch implementation of the fully-connected layer is Linear class.
the first parameter is the number of input features:
in this case
input_image : (28,28,1)
after_Conv2d_1 : (28,28,4) <- because of the padding : if padding := 0 then (26,26,1)
after_maxPool_1 : (14,14,4) <- due to the stride of 2
after_Conv2D_2 : (14,14,4) <- because this is "same" padding
after_maxPool_2 : (7,7,4)
in the end, the total number of features before the fully connected layer is 4*7*7.
Also, here shows why we use an odd number for the kernel size and start from images with even number of pixels
I'm uncertain about the following question, all I've found on the internet seemed vague and fuzzy.
Consider this CNN:
model = Sequential()
# 1st conv layer
model.add(Conv2D(10, (4,4), actiavtion="relu", input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
# 2nd conv layer
model.add(Conv2D(20, (4,4), actiavtion="relu"))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
Now, when the input image is passed to the first conv layer, we result in 10 features maps, each of them of the shape (25, 25, 1). Hence, we result in the shape of (25, 25, 1, 10), correct? Applying the Pooling leads us to (12, 12, 1, 10).
My question appears when it comes to the second conv layer. A conv layer always takes one picture/matrix as input. Like the first layer took (28, 28, 1), which is one picture.
But conv layer 1 gave us 10 pictures (or feature maps). So, which of these 10 is used as the input? I'd assume every single one.
Suppose that is correct: So, we have the input shape (12, 12, 1) for the second conv layer. Applying it results in (9, 9, 1) and the Pooling layer gives then (4, 4, 1). Since we have 20 features specified, we result in (4, 4, 1, 20).
But that's only for one of the 10 possible inputs! Therefore, if we apply all of them, we'd have the final shape (4, 4, 1, 20, 10). Correct?
Edit:
The weight calculation makes me think it's correct because it fits.
On the other hand, the flatten layer only has 320 = 4*4*20 neurons, not 3200 = 4*4*20*10 like I would expect it. So that would make me think it's not correct.
This is the output of the model summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_13 (Conv2D) (None, 25, 25, 10) 170
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 12, 12, 10) 0
_________________________________________________________________
conv2d_14 (Conv2D) (None, 9, 9, 20) 3220
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 4, 4, 20) 0
_________________________________________________________________
flatten_6 (Flatten) (None, 320) 0
_________________________________________________________________
dense_12 (Dense) (None, 128) 41088
_________________________________________________________________
dense_13 (Dense) (None, 10) 1290
=================================================================
Total params: 45,768
Trainable params: 45,768
Non-trainable params: 0
And if the initial input shape would have been an RGB picture (e.g. (28, 28, 3)), we would result in (4, 4, 3, 20, 10)?
Your confusion comes from the fact that even though you provide 2 numbers to the filter (4 for width and 4 for height in your example), the filter is actually 3D. This 3rd dimension represents the number of input channels.
Let's go through the first convolution layer: Conv2D(10, (4,4), actiavtion="relu", input_shape=(28,28,1).
We have an input shape of (28, 28, 1), and filter shape of (4, 4, 1). Even though you specified the shape to be (4, 4) in that line above, remember that the third dimension will be the number of input channels, which for this first convolution layer, is 1. If you were feeding RGB images into your model, for example, both the input shape and filter would have the third dimension be 3 instead of 1.
Our output shape, given our input shape and filter shape, should be (input_shape[0] - filter_shape[0] + 1, input_shape[1] - filter_shape[1] + 1, output_channels) (assuming the stride is 1, which it is in your model). Substituting values, we get (28 - 4 + 1, 28 - 4 + 1, 10), or (25, 25, 10). This confirms what we see in model.summary().
As for how we go from input to output under the hood, first we need to move the filter across the input, both horizontally and vertically. An input of shape (28, 28, 1), with a filter of shape (4, 4, 1), would yield a chunked input of shape (25, 25, 4, 4, 1). In other words, we have have 25 x 25 "views" of our original image, with each of these views having shape (4, 4, 1) representing the pixel values we see in the image.
We have 10 (4, 4, 1) filters (10 being number of output channels). Let's take the first of these filters. Let's also take the first "view" of our original image (remember, we have 25 x 25 in total). We multiply the filter by this "view" element-wise, which works out great because both the filter and the "view" are of the same shape (4, 4, 1). The nature of this multiplication gives us an output "view" of shape (4, 4, 1). We then add all these values (4 x 4 x 1 = 16 values total) to give our "signal". Larger sum of these values would mean stronger detection of whatever the filter is looking for. I've overlooked some things, like bias, but that doesn't change the dimensionality of things.
The above walk through only dealt with the first filter and first "view" of our image, and resulted in a single scalar "signal". We have 10 filters, and 25 x 25 views, yielding a final output shape of (25, 25, 10) (as expected).
Note how the entire process operated in 3D space. Both the filters and views were 3D, in this case with a last dimension of 1. It is able to operate in 3D space because the element-wise multiplication will work out, as long as both the filter and "view" have the same 3rd dimension (1 in this case).
If we went through the second convolution layer (Conv2D(20, (4,4), actiavtion="relu")), that last dimension of both the filter and "view" would be 10 instead of 1. This is because the output channels of the previous convolution layer are the same as the input channels of the current one.
I need to write my CNN model as a Theano function with my weights already set by Keras (Tensorflow as the backend), but I am unsure about how to add the bias values associated with each layer.
This solution How can I get a 1D convolution in theano works nicely to write a single layer as a Theano function, but I need to stack my weights with the biases from each layer
Simplified version of my code:
model = Sequential([
InputLayer(batch_input_shape=(None,100,1)),
Convolution1D(nb_filter=16, filter_length=8, activation='relu', border_mode='same', init='he_normal', input_shape=(None,100,1)),
Convolution1D(nb_filter=32, filter_length=8, activation='relu', border_mode='same', init='he_normal'),
MaxPooling1D(pool_length=4),
Flatten(),
Dense(output_dim=32, activation='relu', init='he_normal'),
Dense(output_dim=1, input_dim=32, activation='linear'),
])
How do you add the bias weights to the CNN layer?
For instance, the weights of my first layer have the dimensions: (8, 1, 1, 16)
With a bias with dimensions: (16,)
Which is easy enough to concatenate together to get dimensions: (9, 1, 1, 16)
but for the next layer I have dimensions: (8, 1, 16, 32)
with a bias with dimensions: (32,)
How can I combine this into one weight matrix? To put into the Theano T.signal.conv.conv2d function?
Bellow is a piece of example code from the documentation in Keras. It looks like the first convolution accepts a 256x256 image with 3 color channels. It has 64 output filters (I think these are the same as feature maps which I have read about elsewhere can someone confirm this for me). What confuses me is that the output size is (None, 64, 256, 256). I would expect it to be (None, 64 * 3, 256, 256) since it would need to do convolutions for each of the color channels. What I am wondering is how does Keras handel the color channels. Do the values get averaged together (converted to grey scale) before passing though the convolution?
# apply a 3x3 convolution with 64 output filters on a 256x256 image:
model = Sequential()
model.add(Convolution2D(64, 3, 3, border_mode='same', input_shape=(3, 256, 256)))
# now model.output_shape == (None, 64, 256, 256)
# add a 3x3 convolution on top, with 32 output filters:
model.add(Convolution2D(32, 3, 3, border_mode='same'))
# now model.output_shape == (None, 32, 256, 256)
a filter of size 3*3 with 3 input channels consists of 3*3*3 parameters, so the weights of the convolution kernels for each channel are different.
it sums up the convolution results of each channel (probably together with a bias term) to get the output. so the output shape is independent of the number of input channels, for example, (None, 64, 256, 256) rather than (None, 64 * 3, 256, 256).
I'm not 100% sure but I think a feature map refers to the output of applying one such filter to the input (for example a 256*256 matrix).