Restore models in tensorflow1.0 from many steps - python

I'd like to train my model with many epoches using Tensorflow v1.0. And my idea is to save every model in every epoch. But soon i found the current model would replace the last one.(i mean the last one would vanish.) So i want to know how to get all of the models and restore them one by one. I think it's hard and haven't got a nice solution. Thanks for every suggestion!

tf.Train.Saver().save() has an argument global_step.
From the documentation:
Savers can automatically number checkpoint filenames with a provided counter. This lets you keep multiple checkpoints at different steps while training a model.
So you should try something like:
saver = tf.Train.Saver(...)
sess = tf.Session(...)
for epoch in num_epochs:
... train model...
saver.save(sess, "MODEL_NAME", global_step=epoch)
Note that by default, Tensorflow keeps only the last 5 checkpoints. If you want to keep them all you should initialize your Saver with something in the lines of:
saver = tf.Train.Saver(max_to_keep=num_epochs)

Related

tf.keras.callbacks.ModelCheckpoint vs tf.train.Checkpoint

I am kinda new to TensorFlow world but have written some programs in Keras. Since TensorFlow 2 is officially similar to Keras, I am quite confused about what is the difference between tf.keras.callbacks.ModelCheckpoint and tf.train.Checkpoint. If anybody can shed light on this, I would appreciate it.
It depends on whether a custom training loop is required. In most cases, it's not and you can just call model.fit() and pass tf.keras.callbacks.ModelCheckpoint. If you do need to write your custom training loop, then you have to use tf.train.Checkpoint (and tf.train.CheckpointManager) since there's no callback mechanism.
TensorFlow is a 'computation' library and Keras is a Deep Learning library which can work with TF or PyTorch, etc. So what TF provides is a more generic not-so-customized-for-deep-learning version. If you just compare the docs you can see how more comprehensive and customized ModelCheckpoint is. Checkpoint just reads and writes stuff from/to disk. ModelCheckpoint is much smarter!
Also, ModelCheckpoint is a callback. It means you can just make an instance of it and pass it to the fit function:
model_checkpoint = ModelCheckpoint(...)
model.fit(..., callbacks=[..., model_checkpoint, ...], ...)
I took a quick look at Keras's implementation of ModelCheckpoint, it calls either save or save_weights method on Model which in some cases uses TensorFlow's CheckPoint itself. So it is not a wrapper per se but certainly is on a lower level of abstraction -- more specialized for saving Keras models.
I also had a hard time differentiating between the checkpoint objects used when I looked at other people's code, so I wrote down some notes about when to use which one and how to use them in general.
Either-way, I think it might be useful for other people having the same issue:
Saving model Checkpoints
These are 2 ways of saving your model's checkpoints, each is for a different use case:
1) Checkpoint & CheckpointManager
This is use-full when you are managing the training loops yourself.
You use them like this:
1.1) Checkpoint
Definition from the docs:
"A Checkpoint object can be constructed to save either a single or group of trackable objects to a checkpoint file".
How to initialise it:
You can pass it key value pairs for:
All the custom function calls or objects that make up your model and you want to keep track of:
Like a generator, discriminiator, loss function, optimizer etc
ckpt = Checkpoint(discr_opt=discr_opt,
genrt_opt=genrt_opt,
wgan = wgan,
d_model = d_model,
g_model = g_model)
1.2) CheckpointManager
This literally manages the checkpoints you have defined to be stored at a location and things like how many to to keep.
Definition from the docs:
"Manages multiple checkpoints by keeping some and deleting unneeded ones"
How to initialise it:
Initialise it with the CheckPoint object you create as first argument.
The directory where to save the checkpoint files.
And you probably want to define how much you want to keep, since this can be a lot of complex models
manager = CheckpointManager(ckpt, "training_checkpoints_wgan", max_to_keep=3)
How to use it:
We have setup the manager object with our specified checkpoints, so it's ready to use.
Call this at the end of each training epoch
manager.save()
2) ModelCheckpoint (callback)
You want to use this callback when you are not managing epoch iterations yourself. For example when you have setup a relatively simple Sequential model and you call model.fit(), which manages the training process for you.
Definition from the docs:
"Callback to save the Keras model or model weights at some frequency."
How to initialise it:
Pass in the path where to save the model
The option save_weights_only is set to False by default:
If you want to only save the weights make sure to update this
The option save_best_only is set to False by default:
If you want to only save the best model instead of all of them, you can set this to True.
verbose is set to 0 (False), so you can update this to 1 to validate it
mc = ModelCheckpoint("training_checkpoints/cp.ckpt", save_best_only=True, save_weights_only=False)
How to use it:
The model checkpoint callback is now ready to for training.
You pass in the object in you into your callbacks list when you fit the model:
model.fit(X, y, epochs=100, callbacks=[mc])

Tensorflow simple_save with checkpoints

I am trying to save my model at different steps while training. Let's say I would like to save after 5 epochs.
At this moment I am using:
tf.saved_model.simple_save(
sess, model_folder, inputs, outputs
)
which works as a charm. Nevertheless, I realize it is saving the whole graph and weights on each iteration, which has a high computational cost.
I would like to update the weights of my model keeping the graph from the previous save (since it is not changing during training)
I have read about tf.train.Saver which seems to fit with my intentions. But this forces me to specify all the variables I want to save, this is not as practical as simple_save method. So I am wondering if there is any way of using simple_save in a checkpoint fashion.
I think you have wrong understanding of the tf.train.Saver. You can do something as simple as:
saver = tf.train.Saver()
with tf.Session() as sess:
for e in range(epochs):
...
if e % 5 == 0:
saver.save(sess, "/path/where/to/save/model")
So no need to specify every single variable you want to save.

Trained neural network produces different predictions with same data (TensorFlow)

I have trained a neural network with TensorFlow. After training i saved it and loaded it again in a new '. py' file to avoid retraining on accident. As i was testing it with some extra data i found out that it predicts different things for the same data. Should it not theoretically compute the same thing for the same data?
Some information
feed forward net
4 hidden layers with 900 neurons each
5000 training epochs
reached accuracy of ~80%
data was normalized using normalize from sklearn. preprocessing
cost function: tensorflow.nn.softmax_cross_entropy_with_logits
optimizer: tf.train.AdamOptimizer
I am giving my network the data as a matrix, same way i used for training. (each row containing a data sample, having as many columns as there are input neurons)
Out of ten prediction cycles with the same data my network produces different results in at least 2 cycles (max observed 4 so far)
How can this be. By theory all that is happening are data processing calculations of the form W_i*x_i + b_i. As my x_i, W_i and b_i do not change anymore how come that the prediction varies? May there be a mistake in model reloading routine?
with tf.Session() as sess:
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
result = (sess.run(tf.argmax(prediction.eval(feed_dict=x:input_data}),1)))
print(result)
So this is a really stupid mistake by me. Now it works fine with loading the model from a save. The problem was caused by the global variables initializer. If you leave it out, it will work fine. The previously found information may prove useful for someone so i will leave it here. Solution is now:
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, 'path to your saved file C:x/y/z/model/model.ckpt')
After this you can go on as usually. I do not really know why variables initializer prevents this from working. As i see it, it should be something like: initialize all variables to exist and with random values and then got to that saved file and use values from there, but apparently something else happens...
So i have been doing some testing and found out the following about this problem.
As i have been trying to reuse my created model i had to use the tf.global_variables_initializer(). By doing so it has overwritten my imported graph and all the values were random, which explains different network outputs. This still left me with a problem to solve: how do i load my network? The workaround i am currently using is not optimal by far but it at least allows me to use my saved model. Tensor flow allows one to give unique names to the functions and tensors used. By doing so i could access them through the graph:
with tf.Session() as sess:
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
graph = tf.get_default_graph()
graph.get_tensor_by_name('name:0')
Using this method i could access all my saved values, but they were separated! It means that i had 1x weight and 1x bias per operation used, which led to a bunch of new variables. If you do not know the names, use following:
print(graph.get_all_collection_keys())
This prints the collection names (our variables are stored in collections)
print(graph.get_collection('name'))
This allows us to access the collection as see what are the names/keys for our variables.
This led to another problem. I could no longer use my model as global variables initializer had everything overwritten. By thus i had to redefine the whole model manually with weight and biases that i got previously.
Unfortunately, this is the only thing i could come up with. If anyone has a better idea, please let me know.
The whole thing with mistake looked like this:
imports...
placeholders for data...
def my_network(data):
## network definition with tf functions ##
return output
def train_my_net():
prediction = my_network(data)
cost function
optimizer
with tf.Session() as sess:
for i in how many epochs i want:
training routine
save
def use_my_net():
prediction = my_network(data)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
print(sess.run(prediction.eval(feed_dict={placeholder:data})))
graph = tf.get_default_graph()

Tensorflow how to modify a pre-trained Model saved as checkpoints

I trained a FCN model in Tensorflow following implementation in link and saved the complete model as checkpoint, Now I want to use the saved model(pre-trained) for different problem.
I tried to restore the model from checkpoint by specifying the weights in Saver as:
saver = tf.train.Saver({"weights" : [w1_1,w1_2,w2_1,w2_2,w3_1,w3_2,w3_3,w3_4, w4_1, w4_2, w4_3, w4_4,w5_1,w5_2,w5_3,w6,w7]})
I am getting weights as:
w1_1=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,scope='inference/conv1_1_w')
and so on....
I am not able to restore it successfully (up to specific layer).
Tensorflow version:0.12r
Either you can call init = tf.initialize_variables([list_of_vars]) followed by sess.run(init) and that would reinitialize those variables for you, or you can recreate the graph with same structure from the point where you want to freeze the weights but keep different names for variables. Further in case you only want to train certain variables only, you can pass those variables only to optimizer. tf.train.AdamOptimizer(learning_rate).minimize(loss,var_list = [wi, wj, ....])

Fine-tune inception network 2 times (Tensorflow)

I want to run flowers_train.py script from here: https://github.com/tensorflow/models/tree/master/inception/inception
to fine-tune the inception network on the flowers dataset. The difference is that I want to save a checkpoint and then run the flowers_train.py script again, but now restoring the previous saved checkpoint. I noticed that using this restorer again:
restorer = tf.train.Saver(variables_to_restore)
gives me high loss in the first steps. So do I need to use restorer = tf.train.Saver() ?
I also noticed that the provided checkpoint file is 434,9MB, but the checkpoint I am saving is 389,9MB.
What is variables_to_restore? If it does not include the last layer, then yes, you will see a higher loss and a smaller file size.

Categories