This question already has answers here:
Pytorch - Inferring linear layer in_features
(2 answers)
Closed 1 year ago.
I was trying to learn PyTorch and came across a tutorial where a CNN is defined like below,
class Net(Module):
def __init__(self):
super(Net, self).__init__()
self.cnn_layers = Sequential(
# Defining a 2D convolution layer
Conv2d(1, 4, kernel_size=3, stride=1, padding=1),
MaxPool2d(kernel_size=2, stride=2),
# Defining another 2D convolution layer
Conv2d(4, 4, kernel_size=3, stride=1, padding=1),
MaxPool2d(kernel_size=2, stride=2),
self.linear_layers = Sequential(
Linear(4 * 7 * 7, 10)
# Defining the forward pass
def forward(self, x):
x = self.cnn_layers(x)
x = x.view(x.size(0), -1)
x = self.linear_layers(x)
return x
I understood how the cnn_layers are made. After the cnn_layers, the data should be flattened and given to linear_layers.
I don't understand how the number of features to Linear is 4*7*7. I understand that 4 is the output dimension from the last Conv2d layer.
How is 7*7 coming in to picture? Does stride or padding got any role in that?
Input image shape is [1, 28, 28]
Conv2d layers have a kernel size of 3, stride and padding of 1, which means it doesn't change the spatial size of an image. There are two MaxPool2d layers which reduce the spatial dimensions from (H, W) to (H/2, W/2). So, for each batch, output of the last convolution with 4 output channels has a shape of (batch_size, 4, H/4, W/4). In the forward pass feature tensor is flattened by x = x.view(x.size(0), -1) which makes it in the shape (batch_size, H*W/4). I assume H and W are 28, for which the linear layer would take inputs of shape (batch_size, 196).
in the 2D convolution layers features [values] in a matric [2D-tensor],
As usual neural network end up with a fully connected layer followed by the logist later.
so, features in the fully-connected layer in the vector [1D-tensor].
therefore we have to map each feature [value] in the last metric into the fully-connected layer follows.
in pytorch implementation of the fully-connected layer is Linear class.
the first parameter is the number of input features:
in this case
input_image : (28,28,1)
after_Conv2d_1 : (28,28,4) <- because of the padding : if padding := 0 then (26,26,1)
after_maxPool_1 : (14,14,4) <- due to the stride of 2
after_Conv2D_2 : (14,14,4) <- because this is "same" padding
after_maxPool_2 : (7,7,4)
in the end, the total number of features before the fully connected layer is 4*7*7.
Also, here shows why we use an odd number for the kernel size and start from images with even number of pixels
I am very new to Deep learning. I am working on the CIFAR10 dataset and created a CNN model which is as below.
class Net2(nn.Module):
def __init__(self):
super(Net2, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 5, 1)
self.fc1 = nn.Linear(32 * 5 * 5, 512)
self.fc2 = nn.Linear(512,10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = F.max_pool2d(F.relu(self.conv1(x)),(2,2))
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
net2 = Net2().to(device)
My assignment requirements are to create a model with:
Convolutional layer with 32 filters, kernel size of 5x5 and stride of 1.
Max Pooling layer with kernel size of 2x2 and default stride.
ReLU Activation Layers.
Linear layer with output of 512.
ReLU Activation Layers.
A linear layer with output of 10.
Which I guess I wrote. But I am assuming that I am going to the wrong path. Please help me to write the correct model and also the reason behind those arguments in Conv2d and Linear layers.
The error which I am getting from my code is as below:
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [32, 3, 5, 5], but got 2-dimensional input of size [1024, 3072] instead
Please help me!
There are two problems with the code:
Flattening of input
x = x.view(x.size(0), -1)
The convolutional layer expects a four dimensional input of dimensions (N, C, H, W), where N is the batch size, C = 3 is the number of channels, and (H, W) is the dimension of the image. By using the above statement, you are flattening your (1024, 3, 32, 32) input to (1024, 3072).
Number of input features in the first linear layer
self.fc1 = nn.Linear(32 * 5 * 5, 512)
The output dimensions of the convolutional layer for a (1024, 3, 32, 32) input will be (1024, 32, 28, 28), and after applying the 2 x 2 maxpooling, it is (1024, 32, 14, 14). So the number of input features for the linear layer should be 32 x 14 x 14 = 6272.
I have a simple convolution network:
import torch.nn as nn
class model(nn.Module):
def __init__(self, ks=1):
super(model, self).__init__()
self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=ks, stride=1)
self.fc1 = nn.Linear(8*8*32*ks, 64)
self.fc2 = nn.Linear(64, 64)
def forward(self, x):
x = F.relu(self.conv1(x))
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
cnn = model(1)
Since the kernel size is 1 and the output channel is 32, I assume that there should be 32*1*1 weights in this layer. But, when I ask pytorch about the shape of the weight matrix cnn.conv1.weight.shape, it returns torch.Size([32, 4, 1, 1]). Why the number of input channel should matter on the weight of a conv2d layer?
Am I missing something?
It matters because you are doing 2D convolution over the images which means the depth of the filter(kernel) must be equal to the number of in_channels(pytorch sets it for you) so the true kernel size is [in_channels,1,1]. On the other hands we can say that out_channels number is the number of kernels so the number of weights = number of kernels * size of kernel = out_channels * (in_channels * kernel_size). Here is 2D conv with 3D input
I am trying to tie together a CNN layer with 2 LSTM layers and ctc_batch_cost for loss, but I'm encountering some problems. My model is supposed to work with grayscale images.
During my debugging I've figured out that if I use just a CNN layer that keeps the output size equal to the input size + LSTM and CTC, the model is able to train:
# === Without MaxPool2D ===
inp = Input(name='inp', shape=(128, 32, 1))
cnn = Conv2D(name='conv', filters=1, kernel_size=3, strides=1, padding='same')(inp)
# Go from Bx128x32x1 to Bx128x32 (B x TimeSteps x Features)
rnn_inp = Reshape((128, 32))(maxp)
blstm = Bidirectional(LSTM(256, return_sequences=True), name='blstm1')(rnn_inp)
blstm = Bidirectional(LSTM(256, return_sequences=True), name='blstm2')(blstm)
# Softmax.
dense = TimeDistributed(Dense(80, name='dense'), name='timedDense')(blstm)
rnn_outp = Activation('softmax', name='softmax')(dense)
# Model compiles, calling fit works!
But when I add a MaxPool2D layer that halves the dimensions, I get an error sequence_length(0) <= 64, similar to the one presented here.
# === With MaxPool2D ===
inp = Input(name='inp', shape=(128, 32, 1))
cnn = Conv2D(name='conv', filters=1, kernel_size=3, strides=1, padding='same')(inp)
maxp = MaxPool2D(name='maxp', pool_size=2, strides=2, padding='valid')(cnn) # -> 64x16x1
# Go from Bx64x16x1 to Bx64x16 (B x TimeSteps x Features)
rnn_inp = Reshape((64, 16))(maxp)
blstm = Bidirectional(LSTM(256, return_sequences=True), name='blstm1')(rnn_inp)
blstm = Bidirectional(LSTM(256, return_sequences=True), name='blstm2')(blstm)
# Softmax.
dense = TimeDistributed(Dense(80, name='dense'), name='timedDense')(blstm)
rnn_outp = Activation('softmax', name='softmax')(dense)
# Model compiles, but calling fit crashes with:
# InvalidArgumentError: sequence_length(0) <= 64
# [[{{node ctc_loss_1/CTCLoss}}]]
After struggling for about 3 days with this problem, I posted the above question here, on StackOverflow. About 2 hours after posting the questions I finally figured it out.
TL;DR Solution:
If you're using ctc_batch_cost:
Make sure you're passing the lengths (numbers of timesteps) of the sequences entering your RNNs as their inputs for the input_length argument.
If you're using ctc_loss:
Make sure you're passing the lengths (numbers of timesteps) of the sequences entering your RNNs as their inputs for the logit_length argument.
The solution lies in the documentation, which, relatively sparse, can be cryptic for a machine learning newbie like myself.
The TensorFlow documentation for ctc_batch_cost reads:
y_true, y_pred, input_length, label_length
input_length tensor (samples, 1) containing the sequence length for
each batch item in y_pred.
input_length corresponds to logit_length from ctc_loss function's TensorFlow documentation:
labels, logits, label_length, logit_length, logits_time_major=True, unique=None,
blank_index=None, name=None
logit_length tensor of shape [batch_size] Length of input sequence in
That's where it clicked, at the word logit. So, the argument for input_length or logit_length is supposed to be a tensor/container (in my case, numpy array) of the lengths (i.e. number of timesteps) of the sequences entering the RNN (in my case LSTM) as input.
I was originally making the mistake of considering the required length to be the width of the grayscale images that act as input for the whole network (CNN + MaxPool2D + RNN), but because the MaxPool2D layer creates a tensor of different dimensions for the RNN's input, the ctc loss function crashes.
Now fit runs without crashing.
I understand Conv1d strides in one dimension. But my input is of shape [64, 20, 161], where 64 is the batches, 20 is the sequence length and 161 is the dimension of my vector.
I'm not sure how to set up my Conv1d to stride over the vector.
I'm trying:
self.conv1 = torch.nn.Conv1d(batch_size, 20, 161, stride=1)
but getting:
RuntimeError: Given groups=1, weight of size 20 64 161, expected input[64, 20, 161] to have 64 channels, but got 20 channels instead
According to the documentation:
torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')
in_channels is the number of channels in your input, number of channels usually an computer vision term, in your case this number is 20.
out_channels size of your output, it depends on how much output you want.
For 1D convolution, you can think of number of channels as "number of input vectors" and "number of output feature vectors". And size (not number) of output feature vectors are decided from other parameters like kernel_size, strike, padding, dilation.
An example usage:
t = torch.randn((64, 20, 161))
conv = torch.nn.Conv1d(20, 100)
Note: You never specify batch size in torch.nn modules, first dimension is always assumed to be batch size.
The following code is from the Tensorflow Resnet example at
# Create the bottleneck groups, each of which contains `num_blocks`
# bottleneck groups.
for group_i, group in enumerate(groups):
for block_i in range(group.num_blocks):
name = 'group_%d/block_%d' % (group_i, block_i)
# 1x1 convolution responsible for reducing dimension
with tf.variable_scope(name + '/conv_in'):
conv = tf.layers.conv2d(
conv = tf.layers.batch_normalization(conv, training=training)
with tf.variable_scope(name + '/conv_bottleneck'):
conv = tf.layers.conv2d(
conv = tf.layers.batch_normalization(conv, training=training)
# 1x1 convolution responsible for restoring dimension
with tf.variable_scope(name + '/conv_out'):
input_dim = net.get_shape()[-1].value
conv = tf.layers.conv2d(
conv = tf.layers.batch_normalization(conv, training=training)
# shortcut connections that turn the network into its counterpart
# residual function (identity shortcut)
net = conv + net
This piece of code runs for each block, with a given output and bottleneck dimension. These blocks are defined as:
# Configurations for each bottleneck group.
BottleneckGroup = namedtuple('BottleneckGroup',
['num_blocks', 'num_filters', 'bottleneck_size'])
groups = [
BottleneckGroup(3, 128, 32), BottleneckGroup(3, 256, 64),
BottleneckGroup(3, 512, 128), BottleneckGroup(3, 1024, 256)
In the common practice, as far as I know, these so-called bottleneck layers of the ResNets first reduce the input channel count with 1x1 kernels, applies higher order (3x3) convolutions in that reduced channel size and then restores to the input channel size back with a final 1x1 convolutional layer, as given in the original ResNet paper:
But in the Tensorflow example, the first 1x1 layer uses the block output size, not the bottleneck size, hence no meaningful channel reduction is made. Is the Tensorflow example really wrong here, or am I missing something?