I've got a model in Keras that I need to train, but this model invariably blows up my little 8GB memory and freezes my computer.
I've come to the limit of training just one single sample (batch size = 1) and still it blows up.
Please assume my model has no mistakes or bugs and this question is not about "what is wrong with my model". (Yes, smaller models work ok with the same data, but aren't good enough for the task).
How can I split my model in two and train each part separately, but propagating the gradients between them?
Is there a possibility? (There is no limitation about using theano or tensorflow)
Using CPU only, no GPU.
You can do this thing, but it will cause your training time to approach sizes that will only make the results useful for future generations.
Let's consider what all we have in our memory when we train with a batch size of 1 (assuming you've only read in that one sample into memory):
1) that sample
2) the weights of your model
3) the activations of each layer #your model stores these for backpropogation
None of this stuff is unnecessary for training. However, you could, theoretically, do a forward pass on the first half of the model, dump the weights and activations to disk, load the second half of the model, do a forward pass on that, then the backward pass on that, dump those weights and activations to disk, load back the weights and activations of the first half, then complete the backward pass on that. This process could be split up even more to the point of doing one layer at a time.
OTOH, this is akin to what swap space does, without you having to think about it. If you want a slightly less optimized version of this (which, optimization is clearly moot at this point), you can just increase your swap space to 500GB and call it a day.
Related
I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?
What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.
I'm trying to build a convolutional based model. I trained two different structures as following. As you can see for single layer there isn't any obvious change along number of epochs. Bi-layer Conv2D presents improving in accuracy and losses for train dataset, but validation characteristics are going to be a tragedy.
According to the fact that I can't increase my data-set what should I do to improve validation characteristics?
I've examined regularizer L1 & L2 but they didn't affect my model.
1) You can use adaptive learning rate (exponential decay or step dependent may work for you) Furthermore, you can try extreme high learning rates when your model goes into local minimum.
2) If you are training with images, you can flip, rotate or other stuff to increase your dataset size and maybe some other augmentation techniques might work for your case.
3) Try to change the model like deeper, shallower, wider, narrower.
4) If you are doing a classification model, please ensure that you are not using sigmoid as your activation function in the end unless you are doing binary classification.
5) Always check your dataset's situation before training session.
Your train-test split may not be suitable for your case.
There might be extreme noises in your data.
Some amount of your data might be corrupted.
Note: I will update them whenever a new idea comes to my mind. Furthermore, I didn't want to repeat the comments and other answers, both of them are having valuable information for your case.
The validation becomes a tragedy because model is overfitting on the training data you can try if any of this works,
1)Batch normalisation would be a good option to go with.
2)Try reducing batch size.
I tried a variety of models known to work well on small datasets, but as I suspected, and as is my ultimate verdict - it is a lost cause.
You don't have nearly enough data to train a good DL model, or even an ML model like SVM - as matter's exacerbated by having eight separate classes; your dataset would stand some chance with an SVM for binary classification, but none for 8-class. As a last resort, you can try XGBoost, but I wouldn't bet on it.
What can you do? Get more data. There's no way around it. I don't have an exact number, but for 8-class classification, I'd say you need anywhere from 50-200x your current data to get reasonable results. Mind also that your validation performance is bound to be much worse on a bigger validation set, accounted for in this number.
For readers, OP shared his dataset with me; shapes are: X = (1152, 1024, 1), y = (1152, 8)
I'm developing a Keras NN that predicts the label using 20,000 features. I can build the network, but have to use system RAM since the model is too large to fit in my GPU, which has meant it's taken days to run the model on my machine. The input is currently 500,20000,1 to an output of 500,1,1
-I'm using 5,000 nodes in the first fully connected (Dense) layer. Is this sufficient for the number of features?
-Is there a way of reducing the dimensionality so as to run it on my GPU?
I suppose each input entry has size (20000, 1) and you have 500 entries which make up your database?
In that case you can start by reducing the batch_size, but I also suppose that you mean that even the network weights don't fit in you GPU memory. In that case the only thing (that I know of) that you can do is dimensionality reduction.
You have 20000 features, but it is highly unlikely that all of them are important for the output value. With PCA (Principal Component Analysis) you can check the importance of all you parameters and you will probably see that only a few of them combined will be 90% or more important for the end result. In this case you can disregard the unimportant features and create a network that predicts the output based on let's say only 1000 (or even less) features.
An important note: The only reason I can think of where you would need that many features, is if you are dealing with an image, a spectrum (you can see a spectrum as a 1D image), ... In this case I recommend looking into convolutional neural networks. They are not fully-connected, which removes a lot of trainable parameters while probably performing even better.
I'm training an LSTM network with Tensorflow in Python and wanted to switch to tf.contrib.cudnn_rnn.CudnnLSTM for faster training. What I did is replaced
cells = tf.nn.rnn_cell.LSTMCell(self.num_hidden)
initial_state = cells.zero_state(self.batch_size, tf.float32)
rnn_outputs, _ = tf.nn.dynamic_rnn(cells, my_inputs, initial_state = initial_state)
with
lstm = tf.contrib.cudnn_rnn.CudnnLSTM(1, self.num_hidden)
rnn_outputs, _ = lstm(my_inputs)
I'm experiencing significant training speedup (more than 10x times), but at the same time my performance metric goes down. AUC on a binary classification is 0.741 when using LSTMCell and 0.705 when using CudnnLSTM. I'm wondering if I'm doing something wrong or it's the difference in implementation between those two and it's that's the case how to get my performance back while keep using CudnnLSTM.
The training dataset has 15,337 sequences of varying length (up to few hundred elements) that are padded with zeros to be the same length in each batch. All the code is the same including the TF Dataset API pipeline and all evaluation metrics. I ran each version few times and in all cases it converges around those values.
Moreover, I have few datasets that can be plugged into exactly the same model and the problem persists on all of them.
In the tensorflow code for cudnn_rnn I found a sentence saying:
Cudnn LSTM and GRU are mathematically different from their tf
counterparts.
But there's no explanation what those differences really are...
It seems tf.contrib.cudnn_rnn.CudnnLSTM are time-major, so those should be provided with sequence of shape (seq_len, batch_size, embedding_size) instead of (batch_size, seq_len, embedding_size), so you would have to transpose it (I think, can't be sure when it comes to messy Tensorflow documentation, but you may want to test that. See links below if you wish to check it).
More informations on the topic here (in there is another link pointing towards math differences), except one thing seems to be wrong: not only GRU is time-major, LSTM is as well (as pointed by this issue).
I would advise against using tf.contrib, as it's even messier (and will be, finally, left out of Tensorflow 2.0 releases) and stick to keras if possible (as it will be the main front-end of the upcoming Tensorflow 2.0) or tf.nn, as it's gonna be a part of tf.Estimator API (though it's far less readable IMO).
... or consider using PyTorch to save yourself the hassle, where input shapes (and their meaning) are provided in the documentation at the very least.
I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?
As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.