Shape inference for PyTorch layers - python

I'd like to be able to write things in PyTorch like "add another Conv2D on top of the last Conv2D, where the output has 128 channels and the input has the right number of channels to match the previous layer." I end up writing code like this:
from torch import nn
CONV_CHANNELS = [3, 64, 128, 256, 512, 512]
CONV_SIZE = 3
POOL_SIZE = 2
class CNN(torch.nn.Module):
def __init__(self):
super().__init__()
def make_conv_pool(in_channels, out_channels):
return nn.Sequential(
nn.Conv2D(in_channels=in_channels,
out_channels=out_channels,
kernel_size=CONV_SIZE),
nn.ReLU(),
nn.MaxPool2d(kernel_size=POOL_SIZE, stride=2))
self.net = nn.Sequential(*[
make_conv_pool(in_channels, out_channels)
for in_channels, out_channels
in zip(CONV_CHANNELS, CONV_CHANNELS[1:])
])
Now I want to add a fully connected layer at the end, and I want the input to be the result of "flattening" the output of the last conv layer across the height, width, and channel dimensions. I end up with some awkward calculation like:
def calculate_fc_input_features(side_length):
for _ in range(len(CONV_CHANNELS)):
side_length -= (CONV_SIZE - 1)
side_length = side_length // POOL_SIZE
return side_length
so that I can write:
CONV_CHANNELS = [3, 64, 128, 256, 512, 512]
CONV_SIZE = 3
POOL_SIZE = 2
IMAGE_SIDE_LENGTH = 256
NUM_CLASSES = 3
class FlatFC(torch.nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.fc = nn.Linear(in_features, out_features)
def forward(self, x):
return self.fc(x.view(x.shape[0], -1))
class CNN(torch.nn.Module):
def __init__(self):
super().__init__()
def make_conv_pool(in_channels, out_channels):
return nn.Sequential(
nn.Conv2D(in_channels=in_channels,
out_channels=out_channels,
kernel_size=CONV_SIZE),
nn.ReLU(),
nn.MaxPool2d(kernel_size=POOL_SIZE, stride=2))
convs = nn.Sequential(*[
make_conv_pool(in_channels, out_channels)
for in_channels, out_channels
in zip(CONV_CHANNELS, CONV_CHANNELS[1:])
])
fc = FlatFC(in_features=calculate_fc_input_features(IMAGE_SIDE_LENGTH),
out_features=NUM_CLASSES)
self.net = nn.Sequential(convs, fc)
It feels like I'm writing a lot of unnecessary boilerplate here: for each of the conv layers, the number of input channels is completely determined by the output channels of the previous layer; for the fully connected layer, I have to do calculation that assumes things about the shape of the network rather than asking the layers themselves something like "if your initial input has shape [B, W, H, C], what is your output shape?"
Is there a better way to do this? Does Pytorch provide a more concise way to say "put another Conv2D on top of this network, and figure out the number of input channels for yourself, because there is only one value that works"? If not, are there libraries that fill this role? It feels like this should be a common task, so I'm surprised by how verbose it seems my implementation needs to be.

Related

Translate Keras functional API to PyTorch nn.Module - Conv2d

I'm trying to translate the following Inception code from tutorial in Keras functional API (link) to PyTorch nn.Module:
def conv_module(x, K, kX, kY, stride, chanDim, padding="same"):
# define a CONV => BN => RELU pattern
x = Conv2D(K, (kX, kY), strides=stride, padding=padding)(x)
x = BatchNormalization(axis=chanDim)(x)
x = Activation("relu")(x)
# return the block
return x
def inception_module(x, numK1x1, numK3x3, chanDim):
# define two CONV modules, then concatenate across the
# channel dimension
conv_1x1 = conv_module(x, numK1x1, 1, 1, (1, 1), chanDim)
conv_3x3 = conv_module(x, numK3x3, 3, 3, (1, 1), chanDim)
x = concatenate([conv_1x1, conv_3x3], axis=chanDim)
# return the block
return x
I'm having trouble translating the Conv2D. If I understand correctly:
There is no in_features in Keras - how should I represent it in PyTorch?
Keras filters is PyTorch out_features
kernel_size, stride and padding are the same (maybe a few options for padding are called differently)
Do I understand this correctly? If so, what should I do with in_features? My code so far:
class BasicConv2d(nn.Module):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: int
) -> None:
super().__init__()
self.conv = nn.Conv2d(in_channels,
out_channels,
kernel_size=kernel_size,
stride=stride)
self.bn = nn.BatchNorm2d(out_channels, eps=0.001)
self.relu = nn.ReLU()
def forward(self, x: Tensor) -> Tensor:
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
return x
class Inception(nn.Module):
def __init__(
self,
in_channels: int,
num_1x1_filters: int,
num_3x3_filters: int,
) -> None:
super().__init__()
# how to fill this further?
self.conv_1d = BasicConv2d(
num_1x1_filters,
)
You're correct for the most part. The in_channels parameter in Con2d corresponds to the no. of output channels from the previous layer. If Conv2d is the first layer, the in_channels correspond to the no. of channels in your image. It will be 1 for a Grayscale image and 3 for an RGB image.
But I'm not sure how you could concat the two BasicConv2d outputs.
Fixing batch_size as 1, assume that the image size is 256*256 and out_channels for conv1x1 is 64. This would output a tensor of shape torch.Size([1, 64, 256, 256]). Assuming out_channels of the conv3x3 as 32, this layer would output a tensor of shape torch.Size([1, 32, 254, 254]). We will not be able to concat these two tensors without some trick, such as using padding=1 for the conv3x3 alone as this would produce an output of shape torch.Size([1, 32, 256, 256]) and therefore we would be able to concat.
Your implementation of BasicConv2d is fine, here is the code of Inception module.
class Inception(nn.Module):
def __init__(
self,
in_channels: int,
num_1x1_filters: int,
num_3x3_filters: int,
) -> None:
super().__init__()
# how to fill this further?
self.conv1 = BasicConv2d(in_channels, num_1x1_filters, 1,1)
self.conv3 = BasicConv2d(in_channels, num_3x3_filters, 3,1)
def forward(self,x):
conv1_out = self.conv1(x)
conv3_out = self.conv3(x)
x = torch.cat([conv1_out, conv3_out],)
return x
You need define two basic conv layers, and use them in the forward pass with same input separately.
As #planet_pluto pointed, you can't concatenate two feature maps have different size. you can choose a better stride, padding to construct two feature maps with same size, alternatively, do upsampling or downsampling before you concatenate them.

What is the best way to represent spatio-temporal image data in a vanilla LSTM

I'm working on video frame segmentation prediction and I want to start using a vanilla LSTM as baseline, I know it won't get good results.
In my current approach is I have the original inputs image frame in the first channel and segmentation frame is the second channel like this Data representation, then a flatten out the input into a 1-D array. How should I represent my spatio-temporal such that it works for a vanilla LSTM?
Here is the snippet to the vanilla LSTM that i'm using in Pytorch
class ImageLSTM(nn.Module):
def __init__(self, n_inputs:int=49,
n_outputs:int=4096,
n_hidden:int=256,
n_layers:int=1,
bidirectional:bool=False):
"""
Takes a 1D flatten images.
"""
super(ImageLSTM, self).__init__()
self.n_inputs = n_inputs
self.n_hidden = n_hidden
self.n_outputs = n_outputs
self.n_layers = n_layers
self.bidirectional = bidirectional
self.lstm = nn.LSTM( input_size=self.n_inputs,
hidden_size=self.n_hidden,
batch_first=False,
num_layers=self.n_layers,
bidirectional=self.bidirectional)
if (self.bidirectional):
self.FC = nn.Sequential(
nn.Linear(self.n_hidden*2, self.n_outputs),
nn.Dropout(p=0.5),
nn.Sigmoid()
)
else:
self.FC = nn.Sequential(
nn.Linear(self.n_hidden, self.n_outputs),
nn.Dropout(p=0.5),
nn.Sigmoid()
)
def init_hidden(self, x, device=None): # input 4D tensor: (batch size, channels, width, height)
# initialize the hidden and cell state to zero
# vectors:(number of layer, sequence length, number of hidden nodes)
if (self.bidirectional):
h0 = torch.zeros(2*self.n_layers, 1, self.n_hidden)
c0 = torch.zeros(2*self.n_layers, 1, self.n_hidden)
else:
h0 = torch.zeros(self.n_layers, 1, self.n_hidden)
c0 = torch.zeros(self.n_layers, 1, self.n_hidden)
if device is not None:
h0 = h0.to(device)
c0 = c0.to(device)
self.hidden = (h0,c0)
def forward(self, X): # X: tensor of shape (batch_size, channels, width, height)
# forward propagate LSTM
lstm_out, self.hidden = self.lstm(X, self.hidden) # lstm_out: tensor of shape (seq_length, batch_size, hidden_size)
out = self.FC(lstm_out[:, -1, :])
return out

Why return self.head(x.view(x.size(0), -1)) in the nn.Module for pyTorch reinforcement learning example

I understand that the balancing the pole example requires 2 outputs. Reinforcement Learning (DQN) Tutorial
Here is the output for self.head
print ('x',self.head)
x = Linear(in_features=512, out_features=2, bias=True)
When I run the epochs below is the outputs:
print (self.head(x.view(x.size(0), -1)))
return self.head(x.view(x.size(0), -1))
tensor([[-0.6945, -0.1930]])
tensor([[-0.0195, -0.1452]])
tensor([[-0.0906, -0.1816]])
tensor([[ 0.0631, -0.9051]])
tensor([[-0.0982, -0.5109]])
...
The size of x is:
x = torch.Size([121, 32, 2, 8])
So I am trying to understand what x.view(x.size(0), -1) is doing?
I understand from the comment in the code that it's returning:
Returns tensor([[left0exp,right0exp]...]).
But how does x which is torch.Size([121, 32, 2, 8]) being reduced to a tensor of size 2?
Is there an alternative way of writing that makes more sense? What if I had 4 outputs. How would I represent that? Why x.size(0). Why -1?
So appears to take self.head with 4 outputs to 2 outputs. Is that correct?
At the bottom is that class I am referring:
class DQN(nn.Module):
def __init__(self, h, w, outputs):
super(DQN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
self.bn1 = nn.BatchNorm2d(16)
self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
self.bn2 = nn.BatchNorm2d(32)
self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
self.bn3 = nn.BatchNorm2d(32)
# Number of Linear input connections depends on output of conv2d layers
# and therefore the input image size, so compute it.
def conv2d_size_out(size, kernel_size = 5, stride = 2):
return (size - (kernel_size - 1) - 1) // stride + 1
convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
linear_input_size = convw * convh * 32
self.head = nn.Linear(linear_input_size, outputs)
# Called with either one element to determine next action, or a batch
# during optimization. Returns tensor([[left0exp,right0exp]...]).
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
return self.head(x.view(x.size(0), -1))
x.view(x.size(0), -1) is flattening the tensor, this is because the Linear layer only accepts a vector (1d array). To break it down, x.view() reshapes the tensor of the specified shape (more info). x.shape(0) returns 1st dimension of the tensor (which is the batch size, this should remain the constant). The -1 in x.view() is a filler, in other words, its dimensions that we don't know, so PyTorch automatically calculates it. For example, if x = torch.tensor([1,2,3,4]), to reshape the tensor to a 2x2, you could do x.view(2,2) or x.view(2,-1) or x.view(-1,2).
The output shape is not a tensor shape of 2, but that of 121,2 (the 121 is the batch size, and the 2 comes from the Linear layers output). So to change the output size from 2, to 4, you would have to change the outputs argument in the __init__ function to 4.

Why are the images generated by a GAN get darker as the network trains more?

I created a simple DCGAN with 6 layers and trained it on CelebA dataset (a portion of it containing 30K images).
I noticed my network generated images are dimmed looking and as the network trains more, the bright colors fade into dim ones!
here are some example:
This is how CelebA images look like (real images used for training) :
and these are the generated ones ,the number shows the epoch number(they were trained for 30 epochs ultimately) :
What is the cause for this phenomenon?
I tried to do all the general tricks concerning GANs, such as rescaling the input image between -1 and 1, or not using BatchNorm in the first layer of the Discriminator, and for the last layer of the Generator or
using LeakyReLU(0.2) in the Discriminator, and ReLU for the Generator. yet I have no idea why the images are this dim/dark!
Is this caused by simply less training images?
or is it caused by the networks deficiencies ? if so what is the source of such deficiencies?
Here are how these networks implemented :
def conv_batch(in_dim, out_dim, kernel_size, stride, padding, batch_norm=True):
layers = nn.ModuleList()
conv = nn.Conv2d(in_dim, out_dim, kernel_size, stride, padding, bias=False)
layers.append(conv)
if batch_norm:
layers.append(nn.BatchNorm2d(out_dim))
return nn.Sequential(*layers)
class Discriminator(nn.Module):
def __init__(self, conv_dim=32, act = nn.ReLU()):
super().__init__()
self.conv_dim = conv_dim
self.act = act
self.conv1 = conv_batch(3, conv_dim, 4, 2, 1, False)
self.conv2 = conv_batch(conv_dim, conv_dim*2, 4, 2, 1)
self.conv3 = conv_batch(conv_dim*2, conv_dim*4, 4, 2, 1)
self.conv4 = conv_batch(conv_dim*4, conv_dim*8, 4, 1, 1)
self.conv5 = conv_batch(conv_dim*8, conv_dim*10, 4, 2, 1)
self.conv6 = conv_batch(conv_dim*10, conv_dim*10, 3, 1, 1)
self.drp = nn.Dropout(0.5)
self.fc = nn.Linear(conv_dim*10*3*3, 1)
def forward(self, input):
batch = input.size(0)
output = self.act(self.conv1(input))
output = self.act(self.conv2(output))
output = self.act(self.conv3(output))
output = self.act(self.conv4(output))
output = self.act(self.conv5(output))
output = self.act(self.conv6(output))
output = output.view(batch, self.fc.in_features)
output = self.fc(output)
output = self.drp(output)
return output
def deconv_convtranspose(in_dim, out_dim, kernel_size, stride, padding, batchnorm=True):
layers = []
deconv = nn.ConvTranspose2d(in_dim, out_dim, kernel_size = kernel_size, stride=stride, padding=padding)
layers.append(deconv)
if batchnorm:
layers.append(nn.BatchNorm2d(out_dim))
return nn.Sequential(*layers)
class Generator(nn.Module):
def __init__(self, z_size=100, conv_dim=32):
super().__init__()
self.conv_dim = conv_dim
# make the 1d input into a 3d output of shape (conv_dim*4, 4, 4 )
self.fc = nn.Linear(z_size, conv_dim*4*4*4)#4x4
# conv and deconv layer work on 3d volumes, so we now only need to pass the number of fmaps and the
# input volume size (its h,w which is 4x4!)
self.drp = nn.Dropout(0.5)
self.deconv1 = deconv_convtranspose(conv_dim*4, conv_dim*3, kernel_size =4, stride=2, padding=1)
self.deconv2 = deconv_convtranspose(conv_dim*3, conv_dim*2, kernel_size =4, stride=2, padding=1)
self.deconv3 = deconv_convtranspose(conv_dim*2, conv_dim, kernel_size =4, stride=2, padding=1)
self.deconv4 = deconv_convtranspose(conv_dim, conv_dim, kernel_size =3, stride=2, padding=1)
self.deconv5 = deconv_convtranspose(conv_dim, 3, kernel_size =4, stride=1, padding=1, batchnorm=False)
def forward(self, input):
output = self.fc(input)
output = self.drp(output)
output = output.view(-1, self.conv_dim*4, 4, 4)
output = F.relu(self.deconv1(output))
output = F.relu(self.deconv2(output))
output = F.relu(self.deconv3(output))
output = F.relu(self.deconv4(output))
# we create the image using tanh!
output = F.tanh(self.deconv5(output))
return output
# testing nets
dd = Discriminator()
zd = np.random.rand(2,3,64,64)
zd = torch.from_numpy(zd).float()
# print(dd)
print(dd(zd).shape)
gg = Generator()
z = np.random.uniform(-1,1,size=(2,100))
z = torch.from_numpy(z).float()
print(gg(z).shape)
I think that the problem lies rather in the architecture itself and I would first consider the overall quality of generated images rather than their brightness or darkness. The generations clearly get better as you train for more epochs. I agree that the images get darker but even in the early epochs, the generated images are significantly darker than the ones in the training samples. (At least compared to ones that you posted.)
And now coming back to your architecture, 30k samples are actually enough to obtain very convincing results as achieved by state-of-the-art models in face generations. The generations do get better but they are still far away from being "very good".
I think the generator is definitely not strong enough and is the problematic part. (The fact that your generator loss skyrockets can also be a hint for this.) In the generator, all you do is just upsampling and upsampling. You should note that the transposed convolution is more like a heuristic and it does not provide much learnability. This is related to the nature of the problem. When you are doing convolutions, you have all the information and you are trying to learn to encode but in the decoder, you are trying to recover information that was previously lost :). So, in a way, it is harder to learn because the information taken as input is limited and lacking.
In fact, deterministic bilinear interpolation methods do perform similar or even better than transposed convolutions and these are purely based on scaling/extending with zero learnability. (https://arxiv.org/pdf/1707.05847.pdf)
To observe the transposed convolutions' limits, I suggest that you replace all the Transposedconv2d with UpSampling2D (https://www.tensorflow.org/api_docs/python/tf/keras/layers/UpSampling2D) and I claim that the results will not be much different. UpSampling2D is one of those deterministic methods that I mentioned.
To improve your generator, you can try to insert convolutional layers between upsampling layers. These layers would refine the features/images and correct some of the mistakes that occurred during the up-sampling. In addition to corrections, the next upsampling layer would take a more informative input. What I mean is to try a UNet like decoding that you can find in this link (https://arxiv.org/pdf/1505.04597.pdf). Of course, that would be a primary step to explore. There are many more GAN architectures that you can try and probably perform better.

PyTorch and Convolutional Neural Networks

I have a image input 340px*340px and I want to classify it to 2 classes.
I want to create convolution neural network (PyTorch framework). I have a problem with input and output of layer.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 3 channels (RGB), kernel=5, but i don't understand why 6.
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
#why 16?
self.conv2 = nn.Conv2d(6, 16, 5)
#why 107584 = 328*328
self.fc1 = nn.Linear(107584, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 2)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
# i dont understand this line
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
Is it network is correct?
# 3 channels (RGB), kernel=5, but i don't understand why 6.
The second parameter of Conv2d is out_channels. In a convolutional layer you can arbitrarily define a number of out channels. So it's set to 6 because someone set it to 6.
# why 16?
Same as above.
#why 107584 = 328*328
and
\ # i dont understand this line
Tensor.view() returns a new tensor with the same data as the self tensor but of a different size.
x = x.view(x.size(0), -1): -1 means "infer from other dimensions" so, you are forcing the Tensor to be [1, 15*164*164] => [1, 403440].
403440 is also the correct value for self.fc1 = nn.Linear(107584, 120), instead of 107584.

Categories