I am trying to implement an LSTM neural network based on the variational RNN architecture defined in Yarin Gal and Zoubin Ghahramani's paper https://arxiv.org/pdf/1512.05287.pdf using Keras with Tensorflow backend in Python.
The idea is basically to apply the same dropout mask at each time step, both on the input/output connections and the recurrent connections, as shown in this figure:
Reading the Keras documentation, I see that we can apply dropout on LSTM cells with the arguments dropout and recurrent_dropout. My first question is:
Using only these arguments, is the same dropout mask applied at each time step? And if no, is there a way of doing it?
Then, I also see that we can create a Dropout layer after an LSTM cell, and using the noise_shape argument, we can force the layer to apply the same dropout mask at each time. I have done this by setting noise_shape=(K.shape(x)[0], 1, K.shape(x)[2]). My second question is:
Does the Dropout layer applies on recurrent connections if put after an LSTM layer?
To summarize, I have the feeling that with the first method, we can apply dropout on the recurrent connections but we cannot apply the same dropout mask at each time step, and that with the second method, it's the opposite. Am I wrong?
Related
I'm trying to apply a separate convolution to each layer of a 3-dimensional array, which brought me to the Keras TimeDistributed layer. But the documentation notes that:
"Because TimeDistributed applies the same instance of Conv2D to each of the
timestamps, the same set of weights are used at each timestamp."
However, I want to perform a separate convolution (with independently defined weights / filters) for each layer of the array, not using the same set of weights. Is there some built in way to do this? Any help is appreciated!
I designed a neural network model with large number of output predicted by softmax function. However, I want categorize all the outputs into 5 outputs without modifying the architecture of other layers. The model performs well in the first case but when I decrease the number of output it loses accuracy and get a bad generalization. My question is : Is there a method to make my model performs well even if there is just 5 outputs ? for example : adding dropout layer before output layer, using other activation function, etc.
If it is a plain neural network then yeah definitely use the RelU activation function in the hidden layers and add dropout layer for each hidden layer. Also you can normalize you data before feeding them to the network.
A fairly easy one, but just getting crazy with it right now.
When applying dropout to regularize my neural network, where should it be applied?
For the example let's imagine 2 convolutional layers followed by 1 fully connected layer. "A2" are the activations of the second conv layer. Should I apply dropout to those activations, or should I apply it to the weights of the following fully connected layer? Or it doesn't really matter?
My intuition tells me that the right thing is to apply dropout on the weights of the fully connected layer and not in the activations of the second conv layer, but I have seen the opposite in many places.
I have seen two similar questions but none of them with a satisfying answer.
Both are valid. When you drop the activations it is called dropout and when you drop weights it is called dropconnect. DropConnect is a generalized version of the DropOut method. This image from the DropConnect paper explains it well.
In case of Dropconnect if all the weights for node u3 are zero(3/4th are zero) which is same as applying a dropout on r3 node. Another difference is in the mask matrix of weights.
Left one represents the mask matrix of dropconnect, while right one shows the effective mask matrix if dropout is applied to two consecutive layers.
Notice the pattern in mask matrix of dropout.
The authors show that the dropconnect beats dropout in benchmark datasets and produces state of the art results.
Since, dropconnect is the generalized version I would go with it.
Tensorflow's DropoutWrapper allows to apply dropout to either the cell's inputs, outputs or states. However, I haven't seen an option to do the same thing for the recurrent weights of the cell (4 out of the 8 different matrices used in the original LSTM formulation). I just wanted to check that this is the case before implementing a Wrapper of my own.
EDIT:
Apparently this functionality has been added in newer versions (my original comment referred to v1.4): https://github.com/tensorflow/tensorflow/issues/13103
It's because original LSTM model only applies dropout on the input and output layers (only to the non-recurrent layers.) This paper is considered as a "textbook" that describes the LSTM with dropout: https://arxiv.org/pdf/1409.2329.pdf
Recently some people tried applying dropout in recurrent layers as well. If you want to look at the implementation and the math behind it, search for "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" by Yarin Gal. I'm not sure Tensorflow or Keras already implemented this approach though.
I'd like to reproduce a recurrent neural network where each time layer is followed by a dropout layer, and these dropout layers share their masks. This structure was described in, among others, A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.
As far as I understand the code, the recurrent network models implemented in MXNet do not have any dropout layers applied between time layers; the dropout parameter of functions such as lstm (R API, Python API) actually defines dropout on the input. Therefore I'd need to reimplement these functions from scratch.
However, the Dropout layer does not seem to take a variable that defines mask as a parameter.
Is it possible to make multiple dropout layers in different places of the computation graph, yet sharing their masks?
According to the discussion here, it is not possible to specify the mask and using random seed does not have an impact on dropout's random number generator.