I am reading this research paper (https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf) and trying to follow along with the code on Github. I don't understand how the parameters for the nn.Conv2d() were determined. For the first Conv2d: Does 64#96*96 mean 64 channels with a 96 x 96 kernel size? And if so then why is the kernel size 10 in the function? I have googled the parameters and their meanings and from what I read I understand that its (input_channels, output_channels, kernel_size)
Here is the github post: https://github.com/fangpin/siamese-pytorch/blob/master/train.py
For reference page 4 of the research paper has the model schematic.
self.conv = nn.Sequential(
nn.Conv2d(1, 64, 10), # 64#96*96
nn.ReLU(inplace=True),
nn.MaxPool2d(2), # 64#48*48
nn.Conv2d(64, 128, 7),
nn.ReLU(), # 128#42*42
nn.MaxPool2d(2), # 128#21*21
nn.Conv2d(128, 128, 4),
nn.ReLU(), # 128#18*18
nn.MaxPool2d(2), # 128#9*9
nn.Conv2d(128, 256, 4),
nn.ReLU(), # 256#6*6
)
self.liner = nn.Sequential(nn.Linear(9216, 4096), nn.Sigmoid())
self.out = nn.Linear(4096, 1)
If you look at the model schematic, it's showing two things,
Parameters of the convolution kernel,
Parameters of the feature maps (output of the nn.Conv2D op)
For example first conv2d layer is 64#10x10, meaning 64 output channels and a 10x10 kernel.
Whereas the feature map is 64#96x96, which comes from applying 64#10x10 convolution op on 105x105x1 sized input. This way you get 64 output channels and a 105-10+1=96 sized width and height.
Related
I'm trying to find road lanes using PyTorch. I created dataset and my model. But when I try to train my model, I get mat1 and mat2 shapes cannot be multiplied (4x460800 and 80000x16) error. I've tried other topic's solutions but those solutions didn't help me very much.
My dataset is bunch of road images with their validation images. I have .csv file that contains names of images (such as 'image1.jpg, image2.jpg'). Original size of images and validation images is 1280x720. I convert them 200x200 in my dataset code.
Road image:
Validation image:
Here's my dataset:
import os
import pandas as pd
import random
import torch
import torchvision.transforms.functional as TF
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image
class Dataset(Dataset):
def __init__(self, csv_file, root_dir, val_dir, transform=None):
self.annotations = pd.read_csv(csv_file)
self.root_dir = root_dir
self.val_dir = val_dir
self.transform = transform
def __len__(self):
return len(self.annotations)
def __getitem__(self, index):
img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])
image = Image.open(img_path).convert('RGB')
mask_path = os.path.join(self.val_dir, self.annotations.iloc[index, 0])
mask = Image.open(mask_path).convert('RGB')
transform = transforms.Compose([
transforms.Resize((200, 200)),
transforms.ToTensor()
])
if self.transform:
image = self.transform(image)
mask = self.transform(mask)
return image, mask
My model:
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.cnn_layers = nn.Sequential(
# Conv2d, 3 inputs, 128 outputs
# 200x200 image size
nn.Conv2d(3, 128, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
# Conv2d, 128 inputs, 64 outputs
# 100x100 image size
nn.Conv2d(128, 64, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
# Conv2d, 64 inputs, 32 outputs
# 50x50 image size
nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.linear_layers = nn.Sequential(
# Linear, 32*50*50 inputs, 16 outputs
nn.Linear(32 * 50 * 50, 16),
# Linear, 16 inputs, 3 outputs
nn.Linear(16, 3)
)
def forward(self, x):
x = self.cnn_layers(x)
x = x.view(x.size(0), -1)
x = self.linear_layers(x)
return x
How to avoid this error and train my images on these validation images?
The answer: In your case, NN input has a shape (3, 1280, 720), not (3, 200, 200) as you want. Probably you have forgotten to modify transform argument in RNetDataset. It stays None, so transforms are not applied and the image is not resized. Another possibility is that it happens due to these lines:
transform = transforms.Compose([
transforms.Resize((200, 200)),
transforms.ToTensor()
])
if self.transform:
image = self.transform(image)
mask = self.transform(mask)
You have two variables named transform, but one with self. - maybe you messed them up. Verify it and the problem should go away.
How I came up with it: 460800 is clearly a tensor size after reshaping before linear layers. According to the architecture, tensor processed with self.cnn_layers should have 32 layers, so its height multiplied by width should give 460800 / 32 = 14400. Suppose that its height = H, width = W, so H x W = 14400. Let's understand, what was the original input size in this case? nn.MaxPool2d(kernel_size=2, stride=2) layer divides height and width by 2, and it happens three times. So, the original input size has been 8H x 8W = 64 x 14400 = 936000. Finally, notice that 936000 = 1280 * 720. This can't be a magical coincidence. Case closed!
Another suggestion: even if you apply transforms correctly, your code might not work. Suppose that you have an input of size (4, 3, 200, 200), where 4 is a batch size. Layers in your architecture will process this input as follows:
nn.Conv2d(3, 128, kernel_size=3, stride=1, padding=1) # -> (4, 128, 200, 200)
nn.MaxPool2d(kernel_size=2, stride=2) # -> (4, 128, 100, 100)
nn.Conv2d(128, 64, kernel_size=3, stride=1, padding=1) # -> (4, 64, 100, 100)
nn.MaxPool2d(kernel_size=2, stride=2) # -> (4, 64, 50, 50)
nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=1) # -> (4, 32, 50, 50)
nn.MaxPool2d(kernel_size=2, stride=2) # -> (4, 32, 25, 25)
So, your first layer in self.linear_layers should be not nn.Linear(32 * 50 * 50, 16), but nn.Linear(32 * 25 * 25, 16). With this change, everything should be fine.
I am struggling to work out how to calculate the dimensions for the fully connected layer. I am inputing images which are (448x448) using a batch size (16). Below is the code for my convolutional layers:
class ConvolutionalNet(nn.Module):
def __init__(self, num_classes=182):
super().__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(3, 16, kernal_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernal_size=2, stride=2)
)
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernal_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernal_size=2, stride=2)
)
self.layer3 = nn.Sequential(
nn.Conv2d(32, 32, kernal_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernal_size=2, stride=2)
)
self.layer4 = nn.Sequential(
nn.Conv2d(32, 64, kernal_size=5, stride=1, padding=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernal_size=2, stride=2)
)
self.layer5 = nn.Sequential(
nn.Conv2d(64, 64, kernal_size=5, stride=1, padding=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(kernal_size=2, stride=2)
)
I want to add a fully connected layer:
self.fc = nn.Linear(?, num_classes)
Would anyone be able to explain the best way to go about calculating this? Also, if I have multiple fully connected layers e.g. (self.fc2, self.fc3), would the second parameter always equal the number of classes. I am new to coding and finding it hard to wrap my head around this.
The conv layers don't change the width/height of the features since you've set padding equal to (kernel_size - 1) / 2. Max pooling with kernel_size = stride = 2 will decrease the width/height by a factor of 2 (rounded down if input shape is not even).
Using 448 as input width/height, the output width/height will be 448 // 2 // 2 // 2 // 2 // 2 = 448/32 = 14 (where // is floor-divide operator).
The number of channels is fully determined by the last conv layer, which outputs 64 channels.
Therefore you will have a [B,64,14,14] shaped tensor, so the Linear layer should have in_features = 64*14*14 = 12544.
Note you'll need to flatten the input beforehand, something like.
self.layer6 = nn.Sequential(
nn.Flatten(),
nn.Linear(12544, num_classes)
)
Noob here, hard to elaborate my question without an example,
so I use a model on the MNIST data that classifies digits based on number images.
# Load data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))
Why does model end up with 64 (row) x 10 (column) matrix?
I thought nn.Linear(64, 10) means a layer that has 64 input neurons to 10 neurons. Shouldn't it be an array of 10 probabilities?
and Why output activation function has dim=1 not dim=0?
Isn't each row of 10 columns for an epoch? Shouldn't LogSoftmax being used to calculate the possibility of each digit?
I'm ...lost.
I have spent 2 hr on this, still can't find the answer, sorry for the noob question!
We usually have our data in the form of (BATCH SIZE, INPUT SIZE) which here in your case would be (64, 784).
What this means is that in every batch you have 64 images and each image has 784 features.
Regarding your model this is what it outputs :
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))
print(model)
# Sequential(
# (0): Linear(in_features=784, out_features=128, bias=True)
# (1): ReLU()
# (2): Linear(in_features=128, out_features=64, bias=True)
# (3): ReLU()
# (4): Linear(in_features=64, out_features=10, bias=True)
# (5): LogSoftmax(dim=1)
# )
Let's go through how the data will flow through this model.
You have input of shape (64, 784)
It passes through first Linear layer, where each image of 784 features is converted to have 128 features so output is of shape (64, 128)
ReLU does not change the shape just the values so shape is again (64, 128)
Next Linear layer converts 128 features to 64 so now output shape is (64, 64).
Again ReLU layer just changes values so shape is still (64, 64)
This last Linear layer maps 64 input features to 10 output ones so shape is now (64, 10).
Lastly we have the LogSoftmax layer. Here we provided dim=1 because we want to calculate output possibility for each of the possible 10 digits for each of the 64 images in out batch. The dim=0 is the batch and dim=1 is the outputs for digits that is we we provide dim=1. After this your output will have shape (64, 10).
Therefore at the end, each image in the batch will have possibility for each of the 10 digits.
I thought nn.Linear(64, 10) means a layer that has 64 input neurons to 10 neurons.
That is correct. Another point to remember is that batch dimension is not specified in the layers of the models. We define layers to operate on each image. Your second last Linear layer output 64 values for an image so last Linear layer converts it to 10 values and then applies LogSoftmax to it.
This operation is simply repeated for all 64 images in a batch efficiently using matrix operations.
You might be confusing your batch_size=64 with the input_features=64 of Linear layer which is entirely unrelated.
I am looking at an implementation of AlexNet with PyTorch. According to the formula, output height = (input_height + padding_top + padding_bottom - kernel_height) / stride_height + 1. so using the formula, with input of size 224, stride = 4, padding = 1, and kernel size =11, the output should be of size 54.75. But if you run a summary of the model, you would see that the output of this first layer to be 54. Does PyTorch clip down the output size? If so, does it consistently clip down (seem like it)? I would like to understand what is going behind the scene please .
Here is the code that I refer to:
net = nn.Sequential(
nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
nn.Linear(6400, 4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.Linear(4096, 10))
The output size is an whole number of course! It's just that your formula is not correct, the expression is height = floor((input_height + padding_top + padding_bottom - kernel_height) / stride_height + 1). This wouldn't make any sense otherwise.
Yes, it's clipped down when you have length of output by decimal point.
Because when you look at the output variable, it'd be a matrix. Every element in a matrix worth a length, and there is no guarantee an element for 0.75 length, so it's either being clipped down to 54 or rounded up to 55. And because the kernel have to stop striding at where the next stride don't match the kernel size(you may draw to understand how funny it would be if the kernel didn't stop there), it has to be clipped down.
I am new to AI and python, I'm trying to build an architecture to train a set of images. and later to aim to overfit. but up till now, I couldn't understand how to get the inputs and outputs correctly. I keep seeing the error whenever I try to train the network:
mat1 and mat2 shapes cannot be multiplied (48x13456 and 16x64)
my network:
net2 = nn.Sequential(
nn.Conv2d(3,8, kernel_size=5, padding=0),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(8,16, kernel_size=5, padding=0),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16,64),
nn.ReLU(),
nn.Linear(64,10)
)
this is a part of a task I'm working on and I really don't get why it's not running. any hints!
its because you have flattened your 2D cnn into 1D FC layers...
& you have to manually calculate your changed input shape from 128 size to your Maxpool layer just before flattening layer ...In your case its 29*29*16
So your code must be rewritten as
net2 = nn.Sequential(
nn.Conv2d(3,8, kernel_size=5, padding=0),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(8,16, kernel_size=5, padding=0),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(13456,64),
nn.ReLU(),
nn.Linear(64,10)
)
This should work
EDIT: This is a simple formula to calculate output size :
(((W - K + 2P)/S) + 1)
Here W = Input size
K = Filter size
S = Stride
P = Padding
So 1st conv block will make your output of size 124
Then you do Maxpool which will make it half i.e 62
2nd conv block will make your output of size 58
Then your last Maxpool will make it 29...
So final flattened output would be 29*29*16 where 16 is output channels