A fairly easy one, but just getting crazy with it right now.
When applying dropout to regularize my neural network, where should it be applied?
For the example let's imagine 2 convolutional layers followed by 1 fully connected layer. "A2" are the activations of the second conv layer. Should I apply dropout to those activations, or should I apply it to the weights of the following fully connected layer? Or it doesn't really matter?
My intuition tells me that the right thing is to apply dropout on the weights of the fully connected layer and not in the activations of the second conv layer, but I have seen the opposite in many places.
I have seen two similar questions but none of them with a satisfying answer.
Both are valid. When you drop the activations it is called dropout and when you drop weights it is called dropconnect. DropConnect is a generalized version of the DropOut method. This image from the DropConnect paper explains it well.
In case of Dropconnect if all the weights for node u3 are zero(3/4th are zero) which is same as applying a dropout on r3 node. Another difference is in the mask matrix of weights.
Left one represents the mask matrix of dropconnect, while right one shows the effective mask matrix if dropout is applied to two consecutive layers.
Notice the pattern in mask matrix of dropout.
The authors show that the dropconnect beats dropout in benchmark datasets and produces state of the art results.
Since, dropconnect is the generalized version I would go with it.
Related
I must implement this network:
Similar to a siamese network with a contrastive loss. My problem is S1/F1. The paper tells this:
"F1 and S1 are neural networks that we use to learn the unit-normalized embeddings for the face and speech modalities, respectively. In Figure 1, we depict F1 and S1 in both training and testing routines. They are composed of 2D convolutional layers (purple), max-pooling layers (yellow), and fully connected layers (green). ReLU non-linearity is used between all layers. The last layer is a unit-normalization layer (blue). For both face and speech modalities, F1 and S1 return 250-dimensional unit-normalized embeddings".
My question is:
How can apply a 2D convolutional layer (purple) to input with shape (number of videos, number of frames, features)?
What is the last layer? Batch norm? F.normalize?
I will give an answer to your two questions without going too much into details:
If you're working with a CNN, you're most likely having spatial information in your input, that is your input is a two dimensional multi-channel tensor (*, channels, height, width), not a feature vector (*, features). You simply won't be able to apply a convolution on your input (at least a 2D conv), if you don't retain two-dimensionality.
The last layer is described as a "unit-normalization" layer. This is merely the operation of making the vector's norm unit (equal to 1). You can do this by dividing the said vector by its norm.
I am creating a CNN to predict the distributed strain applied to an optical fiber from the measured light spectrum (2D), which is ideally a Lorentzian curve. The label is a 1D array where only the strained section is non-zero (the label looks like a square wave).
My CNN has 10 alternating convolution and pooling layers, all activated by RelU. This is then followed by 3 fully-connected hidden layers with softmax activation, then an output layer activated by RelU. Usually, CNNs and other neural networks make use of RelU for hidden layers and then softmax for output layer (in the case of classification problems). But in this case, I use softmax to first determine the positions of optical fiber for which strain is applied (i.e. non-zero) and then use RelU in the output for regression. My CNN is able to predict the labels rather accurately, but I cannot find any supporting publication where softmax is used in hidden layers, followed by RelU in the output layer; nor why this approach is conversely not recommended (i.e. not mathematically possible) other than those I found in Quora/Stackoverflow. I would really appreciate if anyone could enlighten me on this matter as I am pretty new to deep learning, and wish to learn from this. Thank you in advance!
If you look at the way a layer l sees the input from a previous layer l-1, it is assuming that the dimensions of the feature vector are linearly independent.
If the model is building some kind of confidence using a set of neurons, then the neurons better be linearly independent otherwise it is simply exaggerating the value of 1 neuron.
If you apply softmax in hidden layers then you are essentially combining multiple neurons and tampering with their independence. Also if you look at the reasons why ReLU is preferred is because it can give you a better gradients which other activations like sigmoid won't. Also if your goal is too add normalization to your layers, you’d be better off with using an explicit batch normalization layer
In Keras if there are two dense layers in a neural network, then all neurons of first layer are connected to ALL neurons of second layer. Can I delete few connections from the dense layer based on certain criteria on weight, such that the resultant is a sparse layer in which all neurons in first layer are not connected to all neurons in second layer??
I tried to reduce the weights that were below threshold to zero. But this did not serve the purpose of deleting / removing the weight connections from the network because after I re-trained the network,, the weights that were forced to be zero regained some values due to gradient descent.
Have you tried adding dropout? This will randomly reset some subset of weights in a layer to 0 when performing updates, and sounds like what you want. This is one of many decent methods for combating overfitting.
https://keras.io/api/layers/regularization_layers/dropout/
I am trying to implement an LSTM neural network based on the variational RNN architecture defined in Yarin Gal and Zoubin Ghahramani's paper https://arxiv.org/pdf/1512.05287.pdf using Keras with Tensorflow backend in Python.
The idea is basically to apply the same dropout mask at each time step, both on the input/output connections and the recurrent connections, as shown in this figure:
Reading the Keras documentation, I see that we can apply dropout on LSTM cells with the arguments dropout and recurrent_dropout. My first question is:
Using only these arguments, is the same dropout mask applied at each time step? And if no, is there a way of doing it?
Then, I also see that we can create a Dropout layer after an LSTM cell, and using the noise_shape argument, we can force the layer to apply the same dropout mask at each time. I have done this by setting noise_shape=(K.shape(x)[0], 1, K.shape(x)[2]). My second question is:
Does the Dropout layer applies on recurrent connections if put after an LSTM layer?
To summarize, I have the feeling that with the first method, we can apply dropout on the recurrent connections but we cannot apply the same dropout mask at each time step, and that with the second method, it's the opposite. Am I wrong?
I'd like to reproduce a recurrent neural network where each time layer is followed by a dropout layer, and these dropout layers share their masks. This structure was described in, among others, A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.
As far as I understand the code, the recurrent network models implemented in MXNet do not have any dropout layers applied between time layers; the dropout parameter of functions such as lstm (R API, Python API) actually defines dropout on the input. Therefore I'd need to reimplement these functions from scratch.
However, the Dropout layer does not seem to take a variable that defines mask as a parameter.
Is it possible to make multiple dropout layers in different places of the computation graph, yet sharing their masks?
According to the discussion here, it is not possible to specify the mask and using random seed does not have an impact on dropout's random number generator.