Keras - no good way to stop and resume training?

Keras - no good way to stop and resume training? - python

After a lot of research, it seems like there is no good way to properly stop and resume training using a Tensorflow 2 / Keras model. This is true whether you are using model.fit() or using a custom training loop.
There seem to be 2 supported ways to save a model while training:
Save just the weights of the model, using model.save_weights() or save_weights_only=True with tf.keras.callbacks.ModelCheckpoint. This seems to be preferred by most of the examples I've seen, however it has a number of major issues:
The optimizer state is not saved, meaning training resumption will not be correct.
Learning rate schedule is reset - this can be catastrophic for some models.
Tensorboard logs go back to step 0 - making logging essentually useless unless complex workarounds are implemented.
Save the entire model, optimizer, etc. using model.save() or save_weights_only=False. The optimizer state is saved (good) but the following issues remain:
Tensorboard logs still go back to step 0
Learning rate schedule is still reset (!!!)
It is impossible to use custom metrics.
This doesn't work at all when using a custom training loop - custom training loops use a non-compiled model, and saving/loading a non-compiled model doesn't seem to be supported.
The best workaround I've found is to use a custom training loop, manually saving the step. This fixes the tensorboard logging, and the learning rate schedule can be fixed by doing something like keras.backend.set_value(model.optimizer.iterations, step). However, since a full model save is off the table, the optimizer state is not preserved. I can see no way to save the state of the optimizer independently, at least without a lot of work. And messing with the LR schedule as I've done feels messy as well.
Am I missing something? How are people out there saving/resuming using this API?

tf.keras.callbacks.experimental.BackupAndRestore API for resuming training from interruptions has been added for tensorflow>=2.3. It works great in my experience.
Reference:
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/experimental/BackupAndRestore

You're right, there isn't builtin support for resumability - which is exactly what motivated me to create DeepTrain. It's like Pytorch Lightning (better and worse in different regards) for TensorFlow/Keras.
Why another library? Don't we have enough? You have nothing like this; if there was, I'd not build it. DeepTrain's tailored for the "babysitting approach" to training: train fewer models, but train them thoroughly. Closely monitor each stage to diagnose what's wrong and how to fix.
Inspiration came from my own use; I'd see "validation spikes" throughout a long epoch, and couldn't afford to pause as it'd restart the epoch or otherwise disrupt the train loop. And forget knowing which batch you were fitting, or how many remain.
How's it compare to Pytorch Lightning? Superior resumability and introspection, along unique train debug utilities - but Lightning fares better in other regards. I have a comprehensive list comparison in working, will post within a week.
Pytorch support coming? Maybe. If I convince the Lightning dev team to make up for its shortcomings relative to DeepTrain, then not - otherwise probably. In the meantime, you can explore the gallery of Examples.
Minimal example:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from deeptrain import TrainGenerator, DataGenerator
ipt = Input((16,))
out = Dense(10, 'softmax')(ipt)
model = Model(ipt, out)
model.compile('adam', 'categorical_crossentropy')
dg = DataGenerator(data_path="data/train", labels_path="data/train/labels.npy")
vdg = DataGenerator(data_path="data/val", labels_path="data/val/labels.npy")
tg = TrainGenerator(model, dg, vdg, epochs=3, logs_dir="logs/")
tg.train()
You can KeyboardInterrupt at any time, inspect the model, train state, data generator - and resume.

tf.keras.callbacks.BackupAndRestore can take care of this.

Just use the callback function as
callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="backup_directory")

Related

Tensorflow - What does Training and Prediction mode mean when making a model?

So, I've googled prior to asking this, obviously, however, there doesn't seem to be much mention on these modes directly. Tensorflow documentation mentions "test" mode in passing which, upon further reading, didn't make very much sense to me.
From what I've gathered, my best shot at this is that to reduce ram, when your model is in prediction mode, you just use a pretrained model to make some predictions based on your input?
If someone could help with this and help me understand, I would be extremely grateful.

Training refers to the part where your neural network learns. By learning I mean how your model changes it's weights to improve it's performance on a task given a dataset. This is achieved using the backpropogation algorithm.
Predicting, on the other hand, does not involve any learning. It is only to see how well your model performs after it has been trained. There are no changes made to the model when it is in prediction mode.

Why does the saved model starts with initital loss and accuracy values after loading in Keras?

I am building a model for machine comprehension. It's a heavy model required to train on lots of data and this requires me more time. I have used keras callbacks to save model after every epoch and also save a history of loss and accuracy.
The problem is, when I am loading a trained model, and try to continue it's training using initial_epoch argument, the loss and accuracy values are same as untrained model.
Here is the code: https://github.com/ParikhKadam/bidaf-keras
The code used to save and load model is in /models/bidaf.py
The script I am using to load the model is:
from .models import BidirectionalAttentionFlow
from .scripts.data_generator import load_data_generators
import os
import numpy as np
def main():
emdim = 400
bidaf = BidirectionalAttentionFlow(emdim=emdim, num_highway_layers=2,
num_decoders=1, encoder_dropout=0.4, decoder_dropout=0.6)
bidaf.load_bidaf(os.path.join(os.path.dirname(__file__), 'saved_items', 'bidaf_29.h5'))
train_generator, validation_generator = load_data_generators(batch_size=16, emdim=emdim, shuffle=True)
model = bidaf.train_model(train_generator, epochs=50, validation_generator=validation_generator, initial_epoch=29,
save_history=False, save_model_per_epoch=False)
if __name__ == '__main__':
main()
The training history is quite good which is:
epoch,accuracy,loss,val_accuracy,val_loss
0,0.5021367247352657,5.479433422293752,0.502228641179383,5.451400522458351
1,0.5028450897193741,5.234336488338403,0.5037527732234647,5.0748545675049
2,0.5036885394022954,5.042028017280698,0.5039489093881276,5.0298488218407975
3,0.503893446146289,4.996997425685413,0.5040753162241299,4.976164487656699
4,0.5040576918224873,4.955544574118662,0.5041905890181151,4.931354981493792
5,0.5042372655790888,4.909940965651957,0.5043896965802341,4.881359395178988
6,0.504458428129642,4.8542871887472465,0.5045972716586732,4.815464454729135
7,0.50471843351102,4.791098495962496,0.5048680457262408,4.747811231472629
8,0.5050776754196002,4.713560494026321,0.5054184527602898,4.64730478015052
9,0.5058853749443502,4.580552254050073,0.5071290369370443,4.446513280167718
10,0.5081544614246304,4.341471499420364,0.5132941329030303,4.145318906086552
11,0.5123970410575613,4.081624463197288,0.5178775145611896,4.027316586998608
12,0.5149879128865782,3.9577423109634613,0.5187159608315838,3.950151870168726
13,0.5161411008840144,3.8964761709052578,0.5191430166876064,3.906301355196609
14,0.5168211272672539,3.8585826589385697,0.5191263493850466,3.865382308412537
15,0.5173216891201444,3.830764191839807,0.519219763635108,3.8341492204942607
16,0.5177805591697787,3.805340048675155,0.5197178382215892,3.8204319018292585
17,0.5181171635676399,3.7877712072310343,0.5193657963810704,3.798006804522368
18,0.5184295824699279,3.77086071548255,0.5193122694008523,3.7820449101377243
19,0.5187343664397653,3.7555085003534194,0.5203585262348183,3.776260506494833
20,0.519005008308583,3.7430062334375065,0.5195983755362352,3.7605361109533995
21,0.5192872482429703,3.731001830462149,0.5202017035842986,3.7515058917231405
22,0.5195097722222706,3.7194103983513553,0.5207148585133065,3.7446572377159795
23,0.5197511249107636,3.7101052441559905,0.5207420740297026,3.740088335181619
24,0.5199862479678652,3.701593302911729,0.5200187951731082,3.7254406861185188
25,0.5200847805044403,3.6944093077914464,0.520112738649039,3.7203616696860786
26,0.5203289568582412,3.6844954882274092,0.5217114634669081,3.7214983577364547
27,0.5205629846610852,3.6781935968943595,0.520915311442328,3.705435317731209
28,0.5206827641463226,3.6718110897539193,0.5214088439286978,3.7003081666703377
Also, I have already taken care of loading custom objects such as layers, loss function and accuracy.
I am kind of frustrated by now as I took me days to train this model upto epochs and now I can't resume training. I have referred various threads in keras issues and found many people are facing such issues but can't find a solution.
Someone in a thread said that "Keras will not save RNN states" (I ain't using stateful RNNs) and someone else said "Keras reinitializes all the weights before saving which we can handle using a flag." I mean, if such problems exist in Keras, what will be the use of functions like save().
I have also tried saving only weights after every epoch and then building model from scratch and then loading those weights into it. But that didn't work. You can find the old code I used to save weights only in the above listed github repo's older branches.
I have referred this issue with no help - #4875
That issue is open from past two years. Can't understand what all the developers are doing! Is anyone here who can help? Should I switch to tensorflow or I will face the same issues in that too?
Please help...
Edit1:
I haven't tried saving model using model.save() but I have seen people on other threads saying that the issue was solved with model.save() and models.save_model(). If it is actually solved, ModelCheckpoint should also save optimizer state to resume training but it doesn't (or can't) whatsoever the reason. I have verified the code of ModelCheckpoint callback which indirectly calls model.save() which leads a call to models.save_model(). So theoretically, if the issue in the base i.e. models.save_model() is solved, then it should also be solved in other functions.
Sorry, but I don't have a powerful machine to check this practically. If someone here has, I have shared my code on github and the link is provided in the issue. Please try the resume training on it and detect the cause of this problem.
I am using the computer provided by a national institute and hence, students here need to share this single computer for their projects. I can't use that computer for such tasks. Thank you..
Edit2:
Recently, I tried to check if the weights are saved correctly. For that, I evaluated the model with my validation generator. I saw that the output loss and accuracy remained the same as those in the beginning of model training. Seeing this, I reached a conclusion that its actually the issue with saving weights of the model. I might be wrong here..
BTW, I have also used multi_gpu_model() in my model code. Can it cause this issue? I can't try training model on CPU as it too heavy for that and will take a few days for only 1 epoch to complete. Can anyone help debugging?
I see no response in such issues these days. Just list current issues on the README.md in keras github so that users can be aware before trying keras out and wasting months behind it.

How to use keras model inside other model in TPU

I am trying to convert a keras model to tpu model in google colab, but this model has another model inside.
Take a look at the code:
https://colab.research.google.com/drive/1EmIrheKnrNYNNHPp0J7EBjw2WjsPXFVJ
This is a modified version of one of the examples in the google tpu documentation:
https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb
If the sub_model is converted and used directly it works, but if the sub model is inside another model it does not work. I need the sub model type of network because i am trying to train a GAN network that has 2 networks inside (gan=generator+discriminator) so if this test works probably it will work with the gan too.
I have tried several things:
Convert to tpu the model without converting the sub model, in that case when training starts an error is prompted related to the inputs of the sub model.
Convert both the model and sub model to tpu, in that case an error is prompted when converting the "parent" model, the exception only says at the end "layers".
Convert only the sub model to tpu, in that case no error is prompted but the training is not accelerated by the tpu and it is extremely slow like if no conversion to tpu was made at all.
Using fixed batch size or not, both have the same result, the model does not work.
Any ideas? Thanks a lot.

Divide into parts only use submodel at tpu first. Then put something simple instead of submodel and use the model in TPU. If this does not work , create something very simple which includes similar structure with models you are sure that are working and then step by step add things to converge your complex model which you want to use in TPU.
I am struggling with such things. What I did at the very beginning using MNIST is trained the model and get the coefficients outside rewrite relu dense dropout and NN matricies myself and run the model using numpy and then cupy and then pyopencl and then I replaced functions with my own raw cuda C and opencl functions so that getting deeper and simpler I can find what is wrong when something does not work. At last I write my genetic selective training algo and learned a lot.
And most important it gave me the opportunity to try some crazy ideas for training and modelling and manuplating and making sense of NN coffecients.
The problem in my opinion is TF - Keras etc are too high level. Optimizers - Solvers , there is too much unknown. Even neural networks are not under control. GAN is problematic while training it does not converge everytime takes days to train most of the time. Even if you train. You dont know any idea how it converges. Most of the tricks - techniques which protects you from vanishing gradient are not mathematically backed they are nevertheless works very amazingly. (?!?)
**Go simpler deeper and and complexity step by step. Follow a practicing on which you comprehend as much as you can ** It will cost some time and energy but you will benefit it tremendously in my opinion.

Alternate tensorflow hub module tags in runtime

I have a standard pipeline that evaluates the model after training an epoch. I need resnet50 to be finetunable while training, so I instantiate like so:
resnet50_module = hub.Module("https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/1",
trainable=True, name="resnet50_finetunable", tags={"train"})
However, I read here that I should unset the tags when evaluating.
I realize that I can save the model, close the session, reset the graph, rebuild the model with the tags=None and load the weights from a checkpoint to do the eval. This seems very wasteful specially since the size of the model is huge due to resnet50, and I need to do hundreds of epochs to get good results. Is there a way to alternate between tags without this?
Thanks!

I'm afraid there is no good way to do this without going through a checkpoint.
Variables are created when hub.Module() is called, so they are tied to a particular graph version (tags={"train"} for training or the empty tag set for inference). What you describe could be read as a feature request to set that separately for each application of the module, but that doesn't exist yet (and has some ramifications).
Is checkpointing to local disk really that expensive compared to the eval you want to run? Wouldn't you want to checkpoint at times anyways, to allow resuming after a crash?

Why doesn't Tensorflow automatically handle hidden states of recurrent cells?

I am going through couple of Tensorflow examples that use LSTM cells and trying to understand the purpose of initial_state variable that is used in one implementation but not in the other for some unknown reason.
For example PTB example uses it as:
self._initial_state = cell.zero_state(config.batch_size, data_type())
state = self._initial_state
where it represents hidden state transitions and used to keep the hidden state intact during batch training. This variable should be zeroed between the epochs naturally. And yet some recurrent Bi-LSTM models don't use initial_state at all which makes you think that either it is somehow done by Tensorflow behind-the-scenes or not necessary at all hence the confusion. So, why do some recurrent models use it and others don't? In Torch for example, same mechanism is as simple as:
local params, grad_params = model:getParameters()
-- start training loop
while epoch < max_epoch do
for mini_batch in training_data do
(...)
grad_params:zero()
end
end
The hidden state is handled by the framework no need for all that really clunky stuff or am I missing something here. Can you please explain how does it work in Tensorflow?

As I understood, it it appears to be specific setup for Tensorflow PTB model which is supposed to be running not only with single LSTM cells but with several ones (who would even try to train it on more than 2 cells I wonder). For that it needs to keep track of c and h tensors between the cells and thus the _initial_state variable. It also is supposed to be running in parallel over several GPUs as well, continue if interrupted etc. And that is why PTB example code looks ugly and overengineered to a newcomer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.