I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?
What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.
Related
I'm trying to use LSTM to predict information on timestep sequences.
My data looks that way: I have few different samples of relatively long sequences (>100000 timesteps) and I'm trying to solve a N-class classification problem where each sample is labeled as different ID. Now I'm trying to understand how to properly prepare my data so the LSTM will train on each sample individually.
In the most basic case, I just feed each sample completely to the network:
model = models.Sequential()
model.add(layers.Embedding(FEATURES_NUMBER, 30))
model.add(layers.LSTM(32, return_sequences=True))
model.compile(optimizer="adam",
loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(train_data,
train_labels,
epochs=10,
batch_size=128,
validation_data=(validation_data,validation_labels)
)
Where train_data is of shape: (4, 100000, 1).
But I'm being told by many blog posts around (like here) that training LSTM on very long sequences might harm the training. So, I don't understand how to properly split the data in correspondence with the LSTM internal state.
I can split each 100000 long sequence to 500 long sub-sequences and then my data will be of shape: (800, 500, 1). But can I tell the LSTM to still make sense of the larger sequences (Keep internal state between sub-sequences of the same larger sequence and re-initialize it when switching to new sequence)?
I'd be happy if someone could shed some light over that process!
The problem has been here for a while. Not sure if you are still interested in my humble two cents. Here is the thing about LSTM, as elegantly designed as it is, it still suffers from vanishing gradients and/or exploding gradients. Adding a forget gate for the cell state alleviates the problem as when a forget gate outputs 0, the back propagation process stops "flowing" backwards at that gate. However, this does not mean LSTM is free from exploding/vanishing gradients when you feed an infinitely long sequence into it. Just image a case where the forget gate output 1 50 times in a row and the gradients for the last output using BPTT would look very similar to the original RNN as the gates are "turned off".
Why would that happen? Because your initial set of params are likely not optimal. And also because a long sequence would requires BPTT through more time steps and thus make chances of that happening higher. You can try training you LSTM on your segmented data subsets.
I am trying to understand concept of output_keep_prob:
So if my example is simple RNN :
with tf.variable_scope('encoder') as scope:
cells = rnn.LSTMCell(num_units=500)
cell = rnn.DropoutWrapper(cell=cells, output_keep_prob=0.5)
model = tf.nn.bidirectional_dynamic_rnn(cell, cell, inputs=embedding_lookup, sequence_length=sequence_le,
dtype=tf.float32)
My confusion is if I am giving output_keep_prob=0.5 what actually it means? I know that it makes less prone to overfitting (called regularizing) by adding a dropout. It randomly turns off activations of neurons during training, ok I got this point but I am confused when I give
output_keep_prob=0.5 and my no_of_nodes = 500 then 0.5 means it will randomly turn of 50% nodes at each iteration or it means it will keep only those connections which have more or equal probability of 0.5
keep_layers whose probability =>0.5
or
turn off 50% randomly nodes unit at each iteration ??
I tried to understand the concept by this stackoverflow answer but there is also same confusion that what actually 0.5 means? it should drop 50% nodes at each iteration or keep only those nodes which have probability more or equal to 0.5
if the answer is second keep only those nodes which have probability more or equal to 0.5 :
then it means suppose I have given 500 nodes units and only 30 nodes have 0.5 probability so it will turn of rest 470 nodes and will use only 30 nodes for incoming and outgoing connections?
Because this answer says :
Suppose you have 10 units in the layer and set the keep_prob to 0.1,
Then the activation of 9 randomly chosen units out of 10 will be set
to 0, and the remaining one will be scaled by a factor of 10. I think
a more precise description is that you only keep the activation of 10
percent of the nodes.
While other side this answer by #mrry says :
it means that each connection between layers (in this case between the
last densely connected layer and the readout layer) will only be used
with probability 0.5 when training.
can anyone give a clear explanation which one is correct and what actually this value represent in keep_prob?
Keep_prop means the probability of any given neuron's output to be preserved (as opposed to dropped, that is zeroed out.) In other words, keep_prob = 1 - drop_prob.
The tf.nn.dropout() description states that
By default, each element is kept or dropped independently.
So if you think about it, if you have a large amount of neurons, like 10,000 in a layer, and the keep_prob is let's say, 0.3, then 3,000 is the expected value of the number of neurons kept. So it's more or less the same thing to say that a keep_prob of 0.3 means to keep the value of 3,000 randomly chosen ones of the 10,000 neurons. But not exactly, because the actual number might vary a bit from 3,000.
Scaling comes into the picture because if you drop a certain number of neurons, then the expected sum of the layer will be reduced. So the remaining ones are multiplied to feed forward the same magnitude of values as they would otherwise. This is especially important if you load a pretrained network and want to continue training but with a different keep_prob value now.
(Please note, you can decide to introduce non-independence into the drop probabilities with the noise_shape argument, please see the tf.nn.drouput() description, but that is outside of the scope of this question.)
The random decision to drop a neuron or not is recalculated for each invocation of the network, so you will have a different set of neurons dropped on every iteration. The idea behind dropout is that subsequent layers cannot overfit and learn to watch for arbitrary constellations of certain activations. You ruin the "secret plan of lazy neurons to overfit" by always changing which previous activations are available.
An intuitive explanation for dropout efficiency might as follows.
Imagine that you have a team of workers and the overall goal is to learn how to erect a building. When each of the workers is overly specialized, if one gets sick or makes a mistake, the whole building will be severely affected. The solution proposed by “dropout” technique is to pick randomly every week some of the workers and send them to business trip. The hope is that the team overall still learns how to build the building and thus would be more resilient to noise or workers being on vacation.
Because of its simplicity and effectiveness, dropout is used today in various architectures, usually immediately after fully connected layers.
Generalization in machine learning refers to how well the concepts learned by the model apply to examples which were not seen during training. The goal of most machine learning models is to generalize well from the training data, in order to make good predictions in the future for unseen data. Overfitting happens when the models learns too well the details and the noise from training data, but it doesn’t generalize well, so the performance is poor for testing data. It is a very common problem when the dataset is too small compared with the number of model parameters that need to be learned. This problem is particularly acute in deep neural networks where is not uncommon to have millions of parameters.
Here is visual explanation i found :
More info here.
I am using the LSTM cell in Tensorflow.
lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units)
I was wondering how the weights and states are initialized or rather what the default initializer is for LSTM cells (states and weights) in Tensorflow?
And is there an easy way to manually set an Initializer?
Note: For tf.get_variable() the glorot_uniform_initializer is used as far as I could find out from the documentation.
First of all, there is a difference between the weights of a LSTM (the usual parameter set of a ANN), which are by default also initialized by the Glorot or also known as the Xavier initializer (as mentioned in the question).
A different aspect is the cell state and the state of the initial recurrent input to the LSTM. Those are initialized by a matrix usually denoted as initial_state.
Leaving us with the question, how to initialize this initial_state:
Zero State Initialization is good practice if the impact of initialization is low
The default approach to initializing the state of an RNN is to use a zero state. This often works well, particularly for sequence-to-sequence tasks like language modeling where the proportion of outputs that are significantly impacted by the initial state is small.
Zero State Initialization in each batch can lead to overfitting
Zero Initialization for each batch will lead to the following: Losses at the early steps of a sequence-to-sequence model (i.e., those immediately after a state reset) will be larger than those at later steps, because there is less history. Thus, their contribution to the gradient during learning will be relatively higher. But if all state resets are associated with a zero-state, the model can (and will) learn how to compensate for precisely this. As the ratio of state resets to total observations increases, the model parameters will become increasingly tuned to this zero state, which may affect performance on later time steps.
Do we have other options?
One simple solution is to make the initial state noisy (to decrease the loss for the first time step). Look here for details and other ideas
I don't think you can initialize an individual cell, but when you execute the LSTM with tf.nn.static_rnn or tf.nn.dynamic_rnn, you can set the initial_state argument to a tensor containing the LSTM's initial values.
This is a follow up to the following question: Confused about how to implement time-distributed LSTM + LSTM
The current draft structure that is working well:
The basic idea is that there is a TimeDistributed deep LSTM input layer that works on each epoch of raw time series data and outputs a vector of features for each output. Then, the "outer" deep LSTM layer takes 7 of those sequential outputs and tries to classify the center epoch (assumed that 1 epoch does not have enough information to be classified by itself, and needs surrounding epochs). I say this is a draft because I haven't yet explored the feature space required for this to work well on many subjects.
There are several issues that still need to be resolved, but the one that I haven't found any clear-cut examples of online are trying to train this model in two parts: 1) the TimeDistributed later and 2) the "outer" layer. The reason being is that as I increase the number of epochs needed to classify (currently 7, but I expect it may get up to 21 or higher) more duplicated data is loaded, and the training speed is decreasing quickly.
One may propose an autoencoder for the first layer. However, I don't think this is the best solution. The reason I think so is that the features necessary to reproduce the input might very well be different than the features necessary to be used with other epochs to classify said layer. To expand: this is probable because the time series is semi-periodic, with most of the epoch providing little information other than the current period from important feature to important feature (and the number and location of these important features varies in each epoch).
I've got a model in Keras that I need to train, but this model invariably blows up my little 8GB memory and freezes my computer.
I've come to the limit of training just one single sample (batch size = 1) and still it blows up.
Please assume my model has no mistakes or bugs and this question is not about "what is wrong with my model". (Yes, smaller models work ok with the same data, but aren't good enough for the task).
How can I split my model in two and train each part separately, but propagating the gradients between them?
Is there a possibility? (There is no limitation about using theano or tensorflow)
Using CPU only, no GPU.
You can do this thing, but it will cause your training time to approach sizes that will only make the results useful for future generations.
Let's consider what all we have in our memory when we train with a batch size of 1 (assuming you've only read in that one sample into memory):
1) that sample
2) the weights of your model
3) the activations of each layer #your model stores these for backpropogation
None of this stuff is unnecessary for training. However, you could, theoretically, do a forward pass on the first half of the model, dump the weights and activations to disk, load the second half of the model, do a forward pass on that, then the backward pass on that, dump those weights and activations to disk, load back the weights and activations of the first half, then complete the backward pass on that. This process could be split up even more to the point of doing one layer at a time.
OTOH, this is akin to what swap space does, without you having to think about it. If you want a slightly less optimized version of this (which, optimization is clearly moot at this point), you can just increase your swap space to 500GB and call it a day.