I must implement this network:
Similar to a siamese network with a contrastive loss. My problem is S1/F1. The paper tells this:
"F1 and S1 are neural networks that we use to learn the unit-normalized embeddings for the face and speech modalities, respectively. In Figure 1, we depict F1 and S1 in both training and testing routines. They are composed of 2D convolutional layers (purple), max-pooling layers (yellow), and fully connected layers (green). ReLU non-linearity is used between all layers. The last layer is a unit-normalization layer (blue). For both face and speech modalities, F1 and S1 return 250-dimensional unit-normalized embeddings".
My question is:
How can apply a 2D convolutional layer (purple) to input with shape (number of videos, number of frames, features)?
What is the last layer? Batch norm? F.normalize?
I will give an answer to your two questions without going too much into details:
If you're working with a CNN, you're most likely having spatial information in your input, that is your input is a two dimensional multi-channel tensor (*, channels, height, width), not a feature vector (*, features). You simply won't be able to apply a convolution on your input (at least a 2D conv), if you don't retain two-dimensionality.
The last layer is described as a "unit-normalization" layer. This is merely the operation of making the vector's norm unit (equal to 1). You can do this by dividing the said vector by its norm.
Related
I am creating a CNN to predict the distributed strain applied to an optical fiber from the measured light spectrum (2D), which is ideally a Lorentzian curve. The label is a 1D array where only the strained section is non-zero (the label looks like a square wave).
My CNN has 10 alternating convolution and pooling layers, all activated by RelU. This is then followed by 3 fully-connected hidden layers with softmax activation, then an output layer activated by RelU. Usually, CNNs and other neural networks make use of RelU for hidden layers and then softmax for output layer (in the case of classification problems). But in this case, I use softmax to first determine the positions of optical fiber for which strain is applied (i.e. non-zero) and then use RelU in the output for regression. My CNN is able to predict the labels rather accurately, but I cannot find any supporting publication where softmax is used in hidden layers, followed by RelU in the output layer; nor why this approach is conversely not recommended (i.e. not mathematically possible) other than those I found in Quora/Stackoverflow. I would really appreciate if anyone could enlighten me on this matter as I am pretty new to deep learning, and wish to learn from this. Thank you in advance!
If you look at the way a layer l sees the input from a previous layer l-1, it is assuming that the dimensions of the feature vector are linearly independent.
If the model is building some kind of confidence using a set of neurons, then the neurons better be linearly independent otherwise it is simply exaggerating the value of 1 neuron.
If you apply softmax in hidden layers then you are essentially combining multiple neurons and tampering with their independence. Also if you look at the reasons why ReLU is preferred is because it can give you a better gradients which other activations like sigmoid won't. Also if your goal is too add normalization to your layers, you’d be better off with using an explicit batch normalization layer
Hello I am a bit new to the deep learning community and I have been really fed up with how to feed in data throught a neural network. So I was doing the sentdex pytorch series and I was learning convnets. He was using the cats and dogs dataset of microsoft on kaggle. He had resized the image to 50 by 50 and turned them into grayscale. If you want to see the video to answer my question here it is -
https://pythonprogramming.net/convnet-model-deep-learning-neural-network-pytorch/
So a few thoughts came in my mind while watching the video. The input he passed is only the colour channel of the image -
At once at seeing the input he entered it came in my mind why is he only passing the number of channels which is a grayscale image. When a conv2d takes 3 inputs.
And I mean it litterally works. I tried researching a bit but no where I found a good explaination for the input shape that is being fed in here
So I have 2 thoughts and questions about this -
So does that line mean that the convolutional neural network will
only take in an image that is grayscale and is of any height and
width and if so please tell how to limit the dimensions like that to
make our cnn only accept a input shape of (50, 50, 1).
And if not then please explain what does it mean, and how we can make
it accept any input.
Convolutional layers use the convolution operation i.e. sliding of a kernel (matrix) over the input and taking the sum of elementwise products at each position while sliding. Thus, the input dimensions will affect the output dimensions, however, it is not necessary to fix the input dimensions.
Thus, the layer can be defined as nn.Conv2d(1, 32, 5) where 1 indicates number of channels of input, 32 is number of channels of output, and 5 is the size of the kernel (it is 5x5 in this case since it is 2D).
The 32 output channels will actually mean that there will be 32 such 5x5 kernels which will be applied to the input and each output will be stacked to get an output of h x w x 32. Note that this h and w will be different than the h_in and w_in in case of not using padding, but same if you use padding.
1 input channel mentioned in the layer means that the layer will accept only single channeled inputs (which are effectively grayscale images).
If you want to limit your CNN to use (50, 50, 1) inputs only, then you can resize the image before feeding it (you can do that using OpenCV).
Check this site for some animations of convolutions.
Update: Adding more things asked in the comments by the OP.
Yes, you can input images of any shape (I suppose they still have to be at least the size of the kernel). So, theoretically, you can input any image to a convolutional layer, but not necessarily to your CNN. That is because the CNN may have flattening operations followed by fully connected layers (nn.Linear). These flattening + fully connected will expect certain dimensions (which are fixed by you in the code), so you cannot give any input image to your CNN i.e. you have to ensure that flattening the last convolutional layer's output has dimensions equal to your first fully connected layer's dimensions.
Edit: You can actually give any sized input even for a CNN containing fully-connected layers by using a Global Average Pooling (GAP) layer to reduce the size to a fixed size irrespective of the input size. It is called Adaptive Average Pooling in PyTorch.
For example, consider this network (image attached)
In this, the convolutional kernels sizes are mentioned below the arrows, and the blue cuboids represent the output after each convolutional layer. At the end, there are fully connected layers (boxes with circles) which have fixed dimensions. So, the last convolutional layer output has dimensions 66256 = 9216, which is also the dimension of the first fully connected layer.
So, basically, you design your network such that the last convolutional output flattened has same dimensions as the first fully connected layer. Note that there are some networks called Fully Convolutional Networks (FCNs) which don't use these fully connected layers and thus are input size independent. The network design and choice of layers depends on your application.
For learning purposes, I'm trying to code from stratch a simple multi layer perceptron (MLP) neural network, with:
2500 inputs in the input layer,
100 neurons in hidden layer's #1 and #2,
and 10 outputs in the output layer
and backpropagation, without using tensorflow or such ready to use tools.
Each neuron in the hidden layer #1 has to be connected to the 2500 inputs and requires to store 2500 coefficients. The same applies for all neurons of all layers.
Question: which datastructure is usually used to store all the coefficients from the neurons of layer n-1 to specific neurons of layer n?
Is there a unique data structure (for example in Numpy) that can store all these coefficients for the whole MLP?
Is a tensor (n dim array) mandatory for such things?
Neural networks are mostly just a series of matrix multiplications, and non linear transformations. Hence n dimensional arrays are the natural storage method. Depending on the application you could use a sparse matrix which stores coeficients and indeces of those coeficients. But in general the storage is just matrices.
A good peak under the hood of libraries like tensorflow is to look at/implement a neural network in numpy.
I am trying to implement a kind of an LSTM network. The LSTM needs to take feature maps from N images at multiple layers. Hence, these feature maps need to be computed in the same way for all N images. So the network would look something like this ideally:
The problem is that there doesn't seem to be a way in Caffe to do this. I can slice my data point (which consists of 3 images), into these 3 images, and I can run seperate Conv+Pool layers on it to get my feature maps. But this is not what I want during training. All three images need to have the same Conv + Pooling weights before being passed to the LSTM layered network. How can this be implemented.
I cannot use the concept of batch sizes here since I am training on a multi frame sequence, so each batch consists of M data points, which has 3 images each.
Make the slice after the conv layer, so you would have applied the exact same weights to the 3 images. I quickly edited your image with the idea.
I would like to design a deep net with one (or more) convolutional layers (CNN) and one or more fully connected hidden layers on top.
For deep network with fully connected layers there are methods in theano for unsupervised pre-training, e.g., using denoising auto-encoders or RBMs.
My question is: How can I implement (in theano) an unsupervised pre-training stage for convolutional layers?
I do not expect a full implementation as an answer, but I would appreciate a link to a good tutorial or a reliable reference.
This paper describes an approach for building a stacked convolutional autoencoder. Based on that paper and some Google searches I was able to implement the described network. Basically, everything you need is described in the Theano convolutional network and denoising autoencoder tutorials with one crucial exception: how to reverse the max-pooling step in the convolutional network. I was able to work that out using a method from this discussion - the trickiest part is figuring out the right dimensions for W_prime as these will depend on the feed forward filter sizes and the pooling ratio. Here is my inverting function:
def get_reconstructed_input(self, hidden):
""" Computes the reconstructed input given the values of the hidden layer """
repeated_conv = conv.conv2d(input = hidden, filters = self.W_prime, border_mode='full')
multiple_conv_out = [repeated_conv.flatten()] * np.prod(self.poolsize)
stacked_conv_neibs = T.stack(*multiple_conv_out).T
stretch_unpooling_out = neibs2images(stacked_conv_neibs, self.pl, self.x.shape)
rectified_linear_activation = lambda x: T.maximum(0.0, x)
return rectified_linear_activation(stretch_unpooling_out + self.b_prime.dimshuffle('x', 0, 'x', 'x'))