mxnet: multiple dropout layers with shared mask - python

I'd like to reproduce a recurrent neural network where each time layer is followed by a dropout layer, and these dropout layers share their masks. This structure was described in, among others, A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.
As far as I understand the code, the recurrent network models implemented in MXNet do not have any dropout layers applied between time layers; the dropout parameter of functions such as lstm (R API, Python API) actually defines dropout on the input. Therefore I'd need to reimplement these functions from scratch.
However, the Dropout layer does not seem to take a variable that defines mask as a parameter.
Is it possible to make multiple dropout layers in different places of the computation graph, yet sharing their masks?

According to the discussion here, it is not possible to specify the mask and using random seed does not have an impact on dropout's random number generator.

Related

Which layer should I use when I build a Neural Network with Tensorflow 2.x?

I'm currently stuyind TensorFlow 2.0 and Keras. I know that the activation functions are used to calculate the output of each layer of a neural network, based on mathematical functions. However, when searching about layers, I can't find synthetic and easy-to-read information for a beginner in deep learning.
There's a keras documentation, but I would like to know synthetically:
what are the most common layers used to create a model (Dense, Flatten, MaxPooling2D, Dropout, ...).
In which case to use each of them? (Classification, regression, other)
what is the appropriate way to use each layer depending on each case?
Depending on the problem you want to solve, there are different activation functions and loss functions that you can use.
Regression problem: You want to predict the price of a building. You have N features. Of course, the price of the building is a real number, therefore you need to have mean_squared_error as a loss function and a linear activation for your last node. In this case, you can have a couple of Dense() layers with relu activation, while your last layer is a Dense(1,activation='linear').
In between the Dense() layers, you can add Dropout() so as to mitigate the overfitting effect(if present).
Classification problem: You want to detect whether or not someone is diabetic while taking into account several factors/features. In this case, you can use again stacked Dense() layers but your last layer will be a Dense(1,activation='sigmoid'), since you want to detect whether a patient is or not diabetic. The loss function in this case is 'binary_crossentropy'. In between the Dense() layers, you can add Dropout() so as to mitigate the overfitting effect(if present).
Image processing problems: Here you surely have stacks of [Conv2D(),MaxPool2D(),Dropout()]. MaxPooling2D is an operation which is typical for image processing and also some natural language processing(not going to expand upon here). Sometimes, in convolutional neural network architectures, the Flatten() layer is used. Its purpose is to reduce the dimensionality of the feature maps into 1D vector whose dimension is equal to the total number of elements within the entire feature map depth. For example, if you had a matrix of [28,28], flattening it would reduce it to (1,784), where 784=28*28.
Although the question is quite broad and maybe some people will vote to close it, I tried to provide you a short overview over what you asked. I recommend that your start learning the basics behind neural networks and then delve deeper into using a framework, such as TensorFlow or PyTorch.

Keras' Sequential vs Functional API for Multi-Task Learning Neural Network

I would like to design a neural network for a multi-task deep learning task. Within the Keras API we can either use the "Sequential" or "Functional" approach to build such a neural network. Underneath I provide the code I used to build a network using both approaches to build a network with two outputs:
Sequential
seq_model = Sequential()
seq_model.add(LSTM(32, input_shape=(10,2)))
seq_model.add(Dense(8))
seq_model.add(Dense(2))
seq_model.summary()
Functional
input1 = Input(shape=(10,2))
lay1 = LSTM(32, input_shape=(10,2))(input1)
lay2 = Dense(8)(lay1)
out1 = Dense(1)(lay2)
out2 = Dense(1)(lay2)
func_model = Model(inputs=input1, outputs=[out1, out2])
func_model.summary()
When I look at both the summary outputs for the models, each of them contains identical number of trainable params:
Up until now, this looks fine - however I start doubting myself when I plot both models (using keras.utils.plot_model) which results in the followings graphs:
Personally I do not know how to interpret these. When using a multi-task learning approach, I want all neurons (in my case 8) of the layer before the output-layer to connect to both output neurons. For me this clearly shows in the Functional API (where I have two Dense(1) instances), but this is not very clear from the Sequential API. Nevertheless, the amount of trainable params is identical; suggesting that also the Sequential API the last layer is fully connected to both neurons in the Dense output layer.
Could anybody explain to me the differences between those two examples, or are those fully identical and result in the same neural network architecture? Also, which one would be preferred in this case?
Thank you a lot in advance.
The difference between Sequential and functional keras API:
The sequential API allows you to create models layer-by-layer for most
problems. It is limited in that it does not allow you to create models
that share layers or have multiple inputs or outputs.
the functional API allows you to create models that have a lot more
flexibility as you can easily define models where layers connect to
more than just the previous and next layers. In fact, you can connect
layers to (literally) any other layer. As a result, creating complex
networks such as siamese networks and residual networks become
possible.
To answer your question:
No these APIs are not the same and the number of layers is normal that are the same number.
Which one to use? It depends on the use you want to make of this network. What are you doing the training for? What do you want the output to be?
I recommend this link to make the most of the concept.
Sequential Models & Functional Models
I hope I helped you understand better.
Both models are (in theory) equivalent, as the two output nodes do not have any interaction between them.
It is just that the required outputs have a different shape
[(batch_size,2)]
vs
[(batch_size,),(batch_size,)]
and thus, the loss will be different.
The total loss is averaged for the sequential model in this example, whereas it is summed up for the functional model with two outputs (at least with a default loss such as MSE).
Of course, you can also adapt the functional model to be exactly equivalent to the sequential model:
out1 = Dense(2)(lay2)
#out2 = Dense(1)(lay2)
func_model = Model(inputs=input1, outputs=out1)
Maybe you will also need some activations after the Dense layers.
Both networks are functionally equivalent. Dense layers are fully connected by definition, which is considered to be the most basic and simple design that can be assumed for "normal" neural networks not otherwise specified. The exact learned parameters and behavior may vary slightly based on the implementation. The graph presented is ambiguous only because it does not show the connection of the neurons (which may number in the millions), but rather provides a symbolic representation of the connectivity with its name (Dense), in this case indicating a fully connected layer.
I expect that the sequential model (or equivalent functional model using one dense layer with two neurons as the output) would be faster because it can use a simplified optimization path, but I have not tested this and I have no knowledge of the compile time optimizations performed by Tensorflow.

Variational Dropout in Keras

I am trying to implement an LSTM neural network based on the variational RNN architecture defined in Yarin Gal and Zoubin Ghahramani's paper https://arxiv.org/pdf/1512.05287.pdf using Keras with Tensorflow backend in Python.
The idea is basically to apply the same dropout mask at each time step, both on the input/output connections and the recurrent connections, as shown in this figure:
Reading the Keras documentation, I see that we can apply dropout on LSTM cells with the arguments dropout and recurrent_dropout. My first question is:
Using only these arguments, is the same dropout mask applied at each time step? And if no, is there a way of doing it?
Then, I also see that we can create a Dropout layer after an LSTM cell, and using the noise_shape argument, we can force the layer to apply the same dropout mask at each time. I have done this by setting noise_shape=(K.shape(x)[0], 1, K.shape(x)[2]). My second question is:
Does the Dropout layer applies on recurrent connections if put after an LSTM layer?
To summarize, I have the feeling that with the first method, we can apply dropout on the recurrent connections but we cannot apply the same dropout mask at each time step, and that with the second method, it's the opposite. Am I wrong?

Dropout for LSTM recurrent weights in tensorflow

Tensorflow's DropoutWrapper allows to apply dropout to either the cell's inputs, outputs or states. However, I haven't seen an option to do the same thing for the recurrent weights of the cell (4 out of the 8 different matrices used in the original LSTM formulation). I just wanted to check that this is the case before implementing a Wrapper of my own.
EDIT:
Apparently this functionality has been added in newer versions (my original comment referred to v1.4): https://github.com/tensorflow/tensorflow/issues/13103
It's because original LSTM model only applies dropout on the input and output layers (only to the non-recurrent layers.) This paper is considered as a "textbook" that describes the LSTM with dropout: https://arxiv.org/pdf/1409.2329.pdf
Recently some people tried applying dropout in recurrent layers as well. If you want to look at the implementation and the math behind it, search for "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" by Yarin Gal. I'm not sure Tensorflow or Keras already implemented this approach though.

Unset trainable attributes for parameters in lasagne / nolearn neural networks

I'm implementing a convolutional neural network using lasagne nolearn.
I'd like to fix some parameters that prelearned.
How can I set some layers untrainable?
Actually, though I removed 'trainable' attribute of some layers,
the number shown in the layer information before fitting, namely, such as
Neural Network with *** learnable parameters never change.
Besides, I'm afraid that the greeting function
in 'handers.py'
def _get_greeting(nn):
shapes = [param.get_value().shape for param in
nn.get_all_params() if param]
should be
nn.get_all_params(trainable=True) if param]
but I'm not sure how it affect on training.

Categories