The input image size in tf.gradients - python

I'm trying to calculate the gradient at some layer with respect to the input image. The gradient is defined as
feature = g.get_tensor_by_name('inception/conv2d0_pre_relu:0')
gradient = tf.gradients(tf.reduce_max(feature, 3), x)
and my input image has a shape of (299,299), which is the size that inception is trained at
print(img.shape)
# output (299,299,3)
Then the gradient with respect to the input can be calculated as
img_4d=img[np.newaxis]
res = sess.run(gradient, feed_dict={x: img_4d})[0]
print(res.shape)
# output (1,299,299,3)
We see that the gradient has the same shape as the input image, which is expected.
However, it appears that one can use image with any size but still get the gradient. For example, if I have a img_resized with a shape (150,150,3), the gradient with respect to this input will also with a shape of (150,150,3):
img_resized=skimage.transform.resize(img, [150,150], preserve_range=True)
img_4d=img_resized[np.newaxis]
res = sess.run(gradient, feed_dict={x: img_4d})[0]
res.shape
# output (1,150,150,3)
So why does this work? In my naive picture, the dimension of the input image must be fixed at (299,299,3), and the gradient at some layer with respect to the input would always have the shape of (299,299,3). Why is it able to generate a gradient of other sizes?
In other words, what happens in the above code? When we feed an image with shape (150,150,3), does tensorflow resize the image to (299,299,3) and calculate the gradient with shape (299,299,3), and then resize the gradient back to (150,150,3)?

This is an expected phenomena esp. in the case of inception net which can work with any sized input owing to being fully convolutional network. Unlike Alexnet or VGG which rely on Fully Connected layer in later part of the network, Fully Convolutional networks can work on any sized input. Hope this answers your question.

Related

CNN cropped image loss

I would like to train a CNN only on a cropped part of the CNN-output. This mean that the input image has a resolution of w x h. The output image processed by my model is also w x h. The loss should then be computed by comparing the crop of the output image at the center (w/2 x h/2) and the label.
Is PyTorch taking care of the cropping or will my model not learn properly because of the cropping?
I am aware that training a CNN with a variable input size is possible; however, I am not sure if the weights will be adjusted correctly since my loss operates on a different resolution than my model's output.
Thanks!

how to get the output of a CNN with same dimension as the input

I have gray scale images which I got their arrays of pixels in x_train and x_test.
x_train is of size (2500, 21, 512) and x_test of size (500, 21, 512).
I want to do a CNN to get as output y_train as also (2500,21,512) and y_test as (500,21,512) but which are the arrays of other images that I want the network to predict.
In the MNIST they do it but by taking y_train and y_test as a vector of values and then take the output as (3000, 1). How could I do the same but for my images?
Hmmmm I don't fully understand your question, but I will take a stab. Please let me know if I misinterpreted your question.
Your model takes the following input:
x_train: the image.
And outputs:
x_hat = an image with the same dimensions as `x_train`
Judging by the described architecture, it seems like you are building a convolutional autoencoder. Am I correct?
If so, you have to do the following:
You need to add a channel of dimension one so that the CNN can receive the input, which can be done by reshaping the tensor. Convolutional neural network inputs are as follows: (batch_size, channels, width, height).If you don't want to add a channel, you can use a simple feed-forward neural network (or MLP). If this is the case, you will still have to flatten the inputs into the following dimension: (batch_size, pixels). For a more concrete example, given the mnist dataset, if the batch_size is 32, your input dimension will be (32, 784), since MNIST images are 28 x 28. By flattening the image, you get input size of 784.
You can create a convolutional autoencoder by doing strided convolutions to downsample the images in the encoder layers. Afterwards, you can take the intermediate representation and do an upsampling operation via transposed convolutions. If you want to train a model that can actually generate samples instead of reconstructing, I recommend looking up variational autoencoders and generative adversarial networks.
The implementation will vary depending on the framework (E.g. PyTorch, TensorFlow, etc.)

FCN with patches creates boundary

I am trying to train a Unet model to do per pixel regression predictions on images. To do this, I separate my large image (1000x1000) to 200x200 pixel squares. Then use that to train an FCN model with a linear final layer. The loss function is MSE loss. In the prediction stage, I extract the same boxes but stitch it together and obtain a final output image. When I do that, the problem I am getting is that there is discontinuities between the boundary of boxes. (I can clearly see the boxes)
I've tried to deal with this by feeding 250x250 boxes to my FCN and calculating the loss for the 200x200 centre region. I do the same process for the prediction state. Extract 250x250 patches crop the 200x200 centre region and stitch the image back together. Please see some code below:
Loss Function:
criterion = nn.MSELoss()
optimizer = optim.Adam(self.model.parameters(), lr=LR)
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
output = model(inputs)
output = output.squeeze()
_, dimx, dimy = output.shape
loss = criterion(output[:,25:dimx-25, 25:dimy-25], labels[:,25:dimx-25, 25:dimy-25])
loss.backward()
optimizer.step()
My code for predictions is as follows:
pred = np.zeros((height, width))
for i in range(25, height, 200):
for j in range(25, width, 200):
patch = img[:, i-25:i+225, j-25:j+225]
patch = torch.from_numpy(patch)
patch = patch.unsqueeze(dim=0).to(device)
out = model(patch)
out = out[0,0,25:225, 25:225]
pred[i:i+200, j:j+200] = out.cpu().numpy()
I'm not sure if my problem makes complete sense. I can provide more clarification if necessary but I have been stuck on this for a while now.
It makes sense to have discontinuity near the boundary because there is no requirement for the network to have smooth predictions across boxes during the training.
I assume you have limited GPU memory, so you take only 200x200 pixels as input at a time; Thus, I would suggest the following two possible workarounds.
First, You could use torchvision.transform.RandomCrop to generate 200x200 cropped regions as inputs of the training. At the testing phase, you directly input the whole image to do the prediction. The intuition is that the model can see the full resolution of images, which is the same as testing data, while consuming fewer GPU memory during the training. In this case, you would also expect that the model needs more time to learn all training data patterns because it only sees partial data at a time.
Second, You could simply downsample training data, say 0.5x, and keep the output size, i.e. 1x. For example, in your case, after downsampling the input image to 200x200, the model takes it to predict 1000x1000 pixel level labels (you could use bilinear upsampling or deconv layers). This workaround method has been used in some segmentation implementations (AdaptSeg, DISE).
After some troubleshooting I realized that I had this problem because I was performing batch normalization between each convolutional layer. Removing that step solved the discontinuity problem.

how to use RGB values in feedforward neural network?

I have data set of colored images in the form of ndarray (100, 20, 20, 3) and 100 corresponding labels. When passing them as input to a fully connected neural network (not CNN), what should I do with the 3 values of RGB? Average them perhaps lose some information, but if not manipulating them, my main issue is batch size, as demo-ed below in pytorch.
for epoch in range(n_epochs):
for i, (images, labels) in enumerate(train_loader):
# because of rgb values, now images is 3 times the length of labels
images = Variable(images.view(-1, 400))
labels = Variable(labels)
optimizer.zero_grad()
outputs = net(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
This returns 'ValueError: Expected input batch_size (300) to match target batch_size (100).' Should I have reshaped images into (1, 1200) dimension tensors? Thanks in advance for answers.
Since size of labels is (100,), so your batch data should be with shape of (100, H, W, C). I'm assuming your data loader is returning a tensor with shape of (100,20,20,3). The error happens because you reshape the tensor to (300,400).
Check your network architecture whether the input tensor shape is (20,20,3).
If your network can only accept single channel images, you can first convert your RGB to grayscale images.
Or, modify your network architecture to make it accept 3 channels images. One convenient way is adding an extra layer reducing 3 channels to 1 channel, and you do not need to change the other parts of the network.
Use Grey scaled image to reduce the batch size

After going through convolution steps, what should be the shape of the tensor in fully connected layer?

So let's assume that I have RGB images of shape [128,128,3], I want to create a CNN with two Conv-ReLu-MaxPool layers as below.
def cnn(input_data):
#conv1
conv1_weight = tf.Variable(tf.truncated_normal([4,4,3,25], stddev=0.1,),tf.float32)
conv1_bias = tf.Variable(tf.zeros([25]), tf.float32)
conv1 = tf.nn.conv2d(input_data, conv1_weight, [1,1,1,1], 'SAME')
relu1 = tf.nn.relu(tf.nn.add(conv1, conv1_bias))
max_pool1 = tf.nn.max_pool(relu1, [1,2,2,1], [1,1,1,1], 'SAME')
#conv2
conv2_weight = tf.Variable(tf.truncated_normal([4,4,25,50]),0.1,tf.float32)
conv2_bias = tf.Variable(tf.zeros([50]), tf.float32)
conv2 = tf.nn.conv2d(max_pool1, conv2_weight, [1,1,1,1], 'SAME')
relu2 = tf.nn.relu(tf.nn.add(conv2, conv2_bias))
max_pool2 = tf.nn.max_pool(relu2, [1,2,2,1], [1,1,1,1], 'SAME')
After this step, I need to transform the output into 1xN layer for the next fully connected layer. However, I am not sure how I should determine what N is in 1xN. Is there a specific formula including the layer size, strides, max pool size, image size etc? I am pretty lost in this phase of the problem even though I think that I get the intuition behind a CNN.
I understand that you want to transform the multiple 2D feature maps that come out of the last convolutional/pooling layer to a vector that can be fed into a fully-connected layer. Or to be precise and include the batch dimension, go from shape [batch, width, height, feature_maps] to [batch, N].
The above already implies that N = batch * width * height since reshaping keeps the overall number of elements the same. width and height depend on the size of your inputs and the strides of your network layers (convolution and/or pooling).
A stride of x simply divides the size by x. You have inputs of size 128 in each dimension, and two pooling layers with stride 2. Thus after the first pooling layer your images are 64x64 and after the second they are 32x32, so width = height = 32. Normally we would have to account for padding as well but the point of SAME padding is precisely that we don't have to worry about that.
Finally, feature_maps is 50 since that is how many filters your last convolutional layer has (pooling doesn't modify this). So N = 32*32*50 = 51200.
Thus, you should be able to do tf.reshape(max_pool2, [-1, 51200]) (or tf.reshape(max_pool2, [-1, 32*32*50]) to keep it more interpretable) and feed the resulting 2D tensor through a fully-connected layer (i.e. tf.matmul).
The simplest way would be to just use tf.layers.flatten(max_pool2). This function does all the above for you and just gives you the [batch, N] result.
First of all since you are starting out, I would recommend Keras instead of pure tensorflow. And to answer your question regarding the shape refer this blog by Andrej karpathy
Quote from the blog:
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.
Now coming to your tensorflow's implementation:
For the conv1 stage you have given a 4*4 filter having a depth of 25. Since you have used padding="SAME" for conv1 and maxpooling1 your output 2D spatial dimensions will be same as input for both the cases. That is after conv1 your output size is: 128*128*25. For the same reason the output of your maxpool1 layer is also the same. Since you have given padding to be "SAME" for the second conv2 also your output shape is 128*128*50(you changed the output channels). Thus after maxpool2 your dimensions are: batch_size, 128*128*50. Thus before adding Dense layer you have 3 major options:
1) flatten the tensor results in a shape : batch_size, 128*128*50
2) global average pooling results in a shape : batch_size, 50
3) global max pooling also results in a shape : batch_size, 50.
Note:
global average pooling layer is similar to average pooling but, we average the entire feature map instead of a window. Hence the name global. For example: in your case you have batch_size, 128,128,50 as your dimensions. This means you have 50 feature maps with spatial dimensions 128*128. What global average pooling does is that, it
Averages the 128*128 feature map to give a single number. Thus you will have 50 values in total. This is very useful in designing fully convolutional architectures like inception, resnet etc. Because, this makes the network's input generic meaning you can send any image size as input to the network. Global max pooling is very similar to above but the slight difference is it finds the max value of the feature map instead of average.
Problems with this architecture:
Generally it is not recommended to use padding = "SAME" in maxpooling layers. If you see the source code of vgg16 you will see that after each block (conv relu and maxpooling) the input size is halved. Thus the general structure is you reduce the spatial dimension while increasing the depth/channels.
Flattening the layer:
var_name = tf.layers.flatten(max_pool2)
Should work, and it's what almost every example of a Tensorflow CNN uses.

Categories