Tensorflow : What is actually tf.nn.dropout output_keep_prob? - python

I am trying to understand concept of output_keep_prob:
So if my example is simple RNN :
with tf.variable_scope('encoder') as scope:
cells = rnn.LSTMCell(num_units=500)
cell = rnn.DropoutWrapper(cell=cells, output_keep_prob=0.5)
model = tf.nn.bidirectional_dynamic_rnn(cell, cell, inputs=embedding_lookup, sequence_length=sequence_le,
dtype=tf.float32)
My confusion is if I am giving output_keep_prob=0.5 what actually it means? I know that it makes less prone to overfitting (called regularizing) by adding a dropout. It randomly turns off activations of neurons during training, ok I got this point but I am confused when I give
output_keep_prob=0.5 and my no_of_nodes = 500 then 0.5 means it will randomly turn of 50% nodes at each iteration or it means it will keep only those connections which have more or equal probability of 0.5
keep_layers whose probability =>0.5
or
turn off 50% randomly nodes unit at each iteration ??
I tried to understand the concept by this stackoverflow answer but there is also same confusion that what actually 0.5 means? it should drop 50% nodes at each iteration or keep only those nodes which have probability more or equal to 0.5
if the answer is second keep only those nodes which have probability more or equal to 0.5 :
then it means suppose I have given 500 nodes units and only 30 nodes have 0.5 probability so it will turn of rest 470 nodes and will use only 30 nodes for incoming and outgoing connections?
Because this answer says :
Suppose you have 10 units in the layer and set the keep_prob to 0.1,
Then the activation of 9 randomly chosen units out of 10 will be set
to 0, and the remaining one will be scaled by a factor of 10. I think
a more precise description is that you only keep the activation of 10
percent of the nodes.
While other side this answer by #mrry says :
it means that each connection between layers (in this case between the
last densely connected layer and the readout layer) will only be used
with probability 0.5 when training.
can anyone give a clear explanation which one is correct and what actually this value represent in keep_prob?

Keep_prop means the probability of any given neuron's output to be preserved (as opposed to dropped, that is zeroed out.) In other words, keep_prob = 1 - drop_prob.
The tf.nn.dropout() description states that
By default, each element is kept or dropped independently.
So if you think about it, if you have a large amount of neurons, like 10,000 in a layer, and the keep_prob is let's say, 0.3, then 3,000 is the expected value of the number of neurons kept. So it's more or less the same thing to say that a keep_prob of 0.3 means to keep the value of 3,000 randomly chosen ones of the 10,000 neurons. But not exactly, because the actual number might vary a bit from 3,000.
Scaling comes into the picture because if you drop a certain number of neurons, then the expected sum of the layer will be reduced. So the remaining ones are multiplied to feed forward the same magnitude of values as they would otherwise. This is especially important if you load a pretrained network and want to continue training but with a different keep_prob value now.
(Please note, you can decide to introduce non-independence into the drop probabilities with the noise_shape argument, please see the tf.nn.drouput() description, but that is outside of the scope of this question.)
The random decision to drop a neuron or not is recalculated for each invocation of the network, so you will have a different set of neurons dropped on every iteration. The idea behind dropout is that subsequent layers cannot overfit and learn to watch for arbitrary constellations of certain activations. You ruin the "secret plan of lazy neurons to overfit" by always changing which previous activations are available.

An intuitive explanation for dropout efficiency might as follows.
Imagine that you have a team of workers and the overall goal is to learn how to erect a building. When each of the workers is overly specialized, if one gets sick or makes a mistake, the whole building will be severely affected. The solution proposed by “dropout” technique is to pick randomly every week some of the workers and send them to business trip. The hope is that the team overall still learns how to build the building and thus would be more resilient to noise or workers being on vacation.
Because of its simplicity and effectiveness, dropout is used today in various architectures, usually immediately after fully connected layers.
Generalization in machine learning refers to how well the concepts learned by the model apply to examples which were not seen during training. The goal of most machine learning models is to generalize well from the training data, in order to make good predictions in the future for unseen data. Overfitting happens when the models learns too well the details and the noise from training data, but it doesn’t generalize well, so the performance is poor for testing data. It is a very common problem when the dataset is too small compared with the number of model parameters that need to be learned. This problem is particularly acute in deep neural networks where is not uncommon to have millions of parameters.
Here is visual explanation i found :
More info here.

Related

Do LSTMs remember previous windows or is the hidden state reset?

I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?
What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.

Keras - Hyper Tuning the initial state of the model

I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?
... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?
I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.
maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)
Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.
Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.

How do neural network models learn different weights for each of the neuron in a single layer?

I had had an overwiew of how neural networks work and have come up with some interconnected questions, on which I am not able to find an answer.
Considering one-hidden-layer feedforward neural network: if the function for each of the hidden-layer neurons is the same
a1 = relu (w1x1+w2x2), a2=relu(w3x1+w4x2), ... 
How do we make the model learn different values of weights?
I do undestand the point of manually-established connections between neurons. As shown on the picture Manually established connections between neurons, that way we define the possible functions of functions (i.e., house size and bedrooms number taken together might represent a possible family size which the house would accomodate). But the fully-connected network doesn't make sense to me.
I get the point that a fully-connected neural network should somehow automatically define, which functions of functions make sense, but how does it do it?
Not being able to answer this question, I don't also understand why should increasing the number of neurons increase the accuracy of model prediction?
How do we make the model learn different values of weights?
By initializing the parameters before training starts. In case of a fully connected neural network otherwise we would have the same update step on each parameter - that is where your confusion is coming from. Initialization, either randomly or more sophisticated (e.g. Glorot) solves this.
Why should increasing the number of neurons increase the accuracy of the model prediction?
This is only partially true, increasing the number of neurons should improve your training accuracy (it is a different game for your validation and test performance). By adding units your model is able to store additional information or incorporate outliers into your network, and hence improve the accuracy of the prediction. Think of a 2D problem (predicting house prizes per sqm over sqm of some property). With two parameters you can fit a line, with three a curve and so on, the more parameters the more complex your curve can get and fit through each of your training points.
Great next step for a deep dive - Karpathy's lecture on Computer Vision at Stanford.

Keras - Using large numbers of features

I'm developing a Keras NN that predicts the label using 20,000 features. I can build the network, but have to use system RAM since the model is too large to fit in my GPU, which has meant it's taken days to run the model on my machine. The input is currently 500,20000,1 to an output of 500,1,1
-I'm using 5,000 nodes in the first fully connected (Dense) layer. Is this sufficient for the number of features?
-Is there a way of reducing the dimensionality so as to run it on my GPU?
I suppose each input entry has size (20000, 1) and you have 500 entries which make up your database?
In that case you can start by reducing the batch_size, but I also suppose that you mean that even the network weights don't fit in you GPU memory. In that case the only thing (that I know of) that you can do is dimensionality reduction.
You have 20000 features, but it is highly unlikely that all of them are important for the output value. With PCA (Principal Component Analysis) you can check the importance of all you parameters and you will probably see that only a few of them combined will be 90% or more important for the end result. In this case you can disregard the unimportant features and create a network that predicts the output based on let's say only 1000 (or even less) features.
An important note: The only reason I can think of where you would need that many features, is if you are dealing with an image, a spectrum (you can see a spectrum as a 1D image), ... In this case I recommend looking into convolutional neural networks. They are not fully-connected, which removes a lot of trainable parameters while probably performing even better.

Deep Q learning, LSTM and Q-values convergence

I am implementing a Reinforcement Learning agent that takes action given a time series of prices. The actions are, classically, buy sell or wait. The neural network gets as input one batch at the time, the window size is 96 steps, and I have around 80 features. So the input is something like 1x96x80. The algorithm is online and takes a random sample, every 96 new observations, from a replay memory that saves the last 480 observations (s,a,r,s'). I give a reward signal for each action for each timestep, and where the reward for buy is +1, the one for sell is -1 and so on. Thus I do not have to bother about exploration. I am using the standard way of calculating the loss (as in the Deep Mind original DQN paper), with two networks, one for the estimation of the Q values and the other that acts as a target and gets a soft update every step.
The agent picks at each step the action with the highest estimated Q values - as shown in the graphs below. My issue is with how the Q values behave depending on the architecture of the model. In the standard case I have two dense 'elu' layers, two LSTM layers and a final dense 'linear' layer with 3 units. With this configuration, the Q values fluctuate too heavily, I get a new high max every almost every step and the greedy policy picks different actions too frequently incurring in high transaction costs that destroy the performance (fig 1). In the other case (fig 2) I simply add another dense linear layer of 3 units before the last one, and in this case the Q values evolve much slower, increasing performance as it doesn't incur into high costs, but the tradeoff here is that I get a much less efficient learner which is slow in adapting to new conditions and keeps picking suboptimal actions for longer time, thus harming performance (but still, way better than before). For completeness, I tried with the LSTM returning the whole sequence and updating the gradient on that, and with only the last step. No real difference between the two.
Without a second linear layer
With a double linear layer The blue line is sell (and keep the short position), the orange is wait or close position, and the green is buy (or keep the long position). A position is therefore kept for longer periods if a curve is consistently higher than the others.
Ideally, I would like to find a way to fine tune between these two extremes, however, the second behavior appears only when I add a second dense layer of 3 units before the last one (it can be linear again or tanh), therefore I am not able to get all possibilities in between. Tuning other parameters, like the discount factor, the learning rate or the bias for LSTM does not really help (increasing the bias for LSTM to 5 does help but only in the first iterations, the it goes back to the same behaviour). Also, using GRU instead of LSTM does not change significantly the dynamics of the Q values.
I am at a dead end, any suggestion? I also do not understand why adding a simple linear final layer slows down the estimation of Q values so much.
EDIT: After enough iterations, even the second case (2 linear layers) slowly converge to the case where Q values are way too volatile. So the desired behaviour only last a few tens thousands of steps.

Categories