I am trying to create Faster RCNN like model. I get stuck when it comes to the ROI pooling from the feature map. I know here billinear sampling can be used but, it may not help for end to end training. How to implement this ROI pooling layer in tensorflow?
Bilinear sampling - as the name suggests - can actually be used even with end-to-end training as it's basically a linear operation. However, the disadvantage would be that your local maxima (i.e. strong excitations or certain units) could vanish because your sampling points just happen to be close to the minima. To remedy this, you can instead apply a max_pool(features, kernel, stride) operation where kernel and stride are adjusted such that the final output of this max pool operation does always have the same dimensions.
An example: your features have size 12x12 and you would like to pool to 4x4, then setting kernel=(3,3) and stride=(3,3) would help you achieve that and for each 3x3 patch, the strongest excitations in the respective feature maps will be contained in the output.
Related
I recently came across a method in Pytorch when I try to implement AlexNet.
I don't understand how it works. Please explain the idea behind it with some examples. And how it is different from Maxpooling or Average poling in terms of Neural Network functionality
nn.AdaptiveAvgPool2d((6, 6))
In average-pooling or max-pooling, you essentially set the stride and kernel-size by your own, setting them as hyper-parameters. You will have to re-configure them if you happen to change your input size.
In Adaptive Pooling on the other hand, we specify the output size instead. And the stride and kernel-size are automatically selected to adapt to the needs. The following equations are used to calculate the value in the source code.
Stride = (input_size//output_size)
Kernel size = input_size - (output_size-1)*stride
Padding = 0
I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .
I do self-studying in Udacity PyTorch
Regarding to the last paragraph
Learning
In the code you've been working with, you've been setting the values of filter weights explicitly, but neural networks will actually learn the best filter weights as they train on a set of image data. You'll learn all about this type of neural network later in this section, but know that high-pass and low-pass filters are what define the behavior of a network like this, and you know how to code those from scratch!
In practice, you'll also find that many neural networks learn to detect the edges of images because the edges of object contain valuable information about the shape of an object.
I have studied all through the last 44th sections. But I couldn't be able to answer the following questions
What is the initialized weight when I do torch.nn.Conv2d? And how to define it myself?
How does PyTorch update weights in the convolutional layer?
When you declared nn.Conv2d the weights are initialized via this code.
In particular, if you give bias it uses initialization as proposed by Kaiming et.al. It initializes as uniform distribution between (-bound, bound) where bound=\sqrt{6/((1+a^2)fan_in)} (See here).
You can initialize weight manually too. This has been answered elsewhere (See here) and I won't repeat it.
When you call optimizer.step and optimizer has parameters of convolutional filter registered they are updated.
1.In PyTorch, Conv2d is designed to accept 4D Tensor of shape (N, C, H, W) as an input for forward pass, where N is the number of samples in mini-batch, C is the number of input channels (for example 3 color channel of an image), H and W are height and width of an image.
Your weights should reflect that and be 4D Tensor of shape (F, C, K_H, K_W) where F would be the number of different kernels you would like to have in this layer, C is the number of input channels, K_H and K_W are height and width of kernels. Exact values of initialization can be computed using formula in PyTorch docs, nn.Conv2d definition.
Here is a great figure which will help to visualize computation.
Cross-correlation computation with 2 input channels. Ref. http://www.d2l.ai/chapter_convolutional-neural-networks/channels.html, Fig. 6.4.1
2.Weights are updated using back propagation algorithm by calculating gradients. It is executed under the hood in PyTorch. If you are initializing weights yourself, you should add requires_grad=True for the weight tensor to specifically say that this tensor should be updated by back propagation.
I am building a 1D Convolutional Neural Network (CNN). From many sources I have understood that performance of the CNN increases if more layers are added. However, at each pooling layer, my output shape is 50% smaller than my input (because I use a pool size of 2). This means that I cant add more layers, once my output has shape 1.
Are there ways to overcome this 'decreasing shape problem' or is it just a matter of increasing my input shape?
I am building a 1D Convolutional Neural Network (CNN). From many sources I have understood that performance of the CNN increases if more layers are added.
This is not always true. It usually depends on the data you have and the task that you are trying to resolve.
Quoting https://www.quora.com/Why-do-we-use-pooling-layer-in-convolutional-neural-networks
Pooling allows features to shift relative to each other resulting in robust matching of features even in the presence of small distortions. There are also many other benefits of doing pooling, like:
Reduces the spatial dimension of the feature map.
And hence also reducing the number of parameters high up the processing hierarchy. This simplifies the overall model complexity.
Then depending on the stride, pooling size and padding, you might voluntarily reduce your output shape.
Going back to your question, if you don't want your shape to be decreasing, consider to use strides=1 and padding='same'.
(see https://keras.io/layers/pooling)
I'm implementing a UNet for binary segmentation while using Sigmoid and BCELoss. The problem is that after several iterations the network tries to predict very small values per pixel while for some regions it should predict values close to one (for ground truth mask region). Does it give any intuition about the wrong behavior?
Besides, there exist NLLLoss2d which is used for pixel-wise loss. Currently, I'm simply ignoring this and I'm using MSELoss() directly. Should I use NLLLoss2d with Sigmoid activation layer?
Thanks
Seems to me like that your Sigmoids are saturating the activation maps. The images are not properly normalised or some batch normalisation layers are missing. If you have an implementation that is working with other images check the image loader and make sure it does not saturate the pixel values. This usually happens with 16-bits channels. Can you share some of the input images?
PS Sorry for commenting in the answer. This is a new account and I am not allowed to comment yet.
You might want to use torch.nn.BCEWithLogitsLoss(), replacing the Sigmoid and the BCELoss function.
An excerpt from the docs tells you why its always better to use this loss function implementation.
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.