I am writing neural network code in tensorflow. I made it to save variables in every 1000 epoch. So, I expect to save variables of 1001th epoch, 2001th epoch, 3001th epoch ... for different files.
The code below is the save function I made.
def save(self, epoch):
model_name = "MODEL_save"
checkpoint_dir = os.path.join(model_name)
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
self.saver.save(self.sess, checkpoint_dir + '/model', global_step=epoch)
self.saver.save(self.sess, checkpoint_dir + '/model')
print("path for saved %s" % checkpoint_dir)
I made this code to save two times once the function is called. Because I wanted to save history of variables for every 1000 epoch by using 'global_step=epoch'. And wanted to save latest variables in the file without epoch specified.
I call this function whenever the epoch condition is met like below.
for epoch in xrange(self.m_total_epoch):
.... CODE FOR NEURAL NETWORK ....
if epoch%1000 == 1 and epoch != 1:
self.save(epoch)
Assuming current epoch is 29326, I expect all the saved files in the directory from 1001, 2001, 3001 ... 29001. However, there are only partial of files from 26001, 27001, 28001, 29001. I checked it happened in other computers. It is different from what I expected. Why does it happen?
tf.train.Saver has a max_to_keep argument in its constructor that keeps only the latest models saved. And this max_to_keep argument, somewhat suprisingly, has a default value of 5. So by default, you will only have the latest 5 models.
To keep all models, set this variable to None:
saver = tf.train.Saver(max_to_keep=None)
Related
I defined a scalar _log_alpha and an optimizer to otimize it.
self._log_alpha = torch.log(torch.ones(1) * alpha).to(get_device()).requires_grad_(True)
self._log_alpha_optimizer = optim.Adam([self._log_alpha], lr=lr)
If I start training from the very beginning, it works just fine. The _log_alpha changes a little bit every time I call
self._log_alpha_optimizer.zero_grad()
log_alpha_loss.backward()
self._log_alpha_optimizer.step()
However, if I train several steps, and save the optimizer's state_dict and _log_alpha
ckpt = {'log_alpha_optimizer_state_dict': self._log_alpha_optimizer.state_dict(),
'log_alpha': self._log_alpha}
torch.save(ckpt, save_dir)
and then load them to resume training
ckpt = torch.load(load_dir, map_location=torch.device(get_device()))
self._log_alpha_optimizer.load_state_dict(ckpt['log_alpha_optimizer_state_dict'])
self._log_alpha = ckpt['log_alpha']
self._log_alpha.requires_grad_(True)
the _log_alpha won't change anymore.
I also defined some nn.Modules, whose optimizers still work after saving and loading. I wonder which part of the _log_alpha_optimizer is wrong?
I'm trying to save iterative checkpoints of my models, but also the model that achieved the best score on an independent validation dataset. My checkpoints however, overwrite my best model. Effectively, I'm using something like:
saver = tf.train.Saver()
with tf.Session() as sess:
for epoch in range(20):
# Train model [...]
# and save a checkpoint
saver.save(sess, "iter", global_step=epoch)
if best_validiation_acc < last_validation_acc:
saver.save(sess, "best_model")
How do I get my best model to not be overwritten by my iterated saves?
The reason is that you're using the same tf.train.Saver for both, so it remembers last max_to_keep=5 checkpoint files, no matter how you name them.
The simplest solution is to set max_to_keep=None, which will force the saver to keep all checkpoints and not overwrite anything. However, you would probably prefer to overwrite at least the iteration checkpoints. The solution in this case is:
iter_saver = tf.train.Saver(max_to_keep=3) # keep 3 last iterations
best_saver = tf.train.Saver(max_to_keep=5) # keep 5 last best models
with tf.Session() as sess:
for epoch in range(20):
# Train model [...]
# and save a checkpoint
iter_saver.save(sess, "iter/model", global_step=epoch)
if best_validiation_acc < last_validation_acc:
best_saver.save(sess, "best/model")
I'd also use different directories, so that the checkpoint file won't clash.
I wanted to save multiple models for my experiment but I noticed that tf.train.Saver() constructor could not save more than 5 models. Here is a simple code:
import tensorflow as tf
x = tf.Variable(tf.zeros([1]))
saver = tf.train.Saver()
sess = tf.Session()
for i in range(10):
sess.run(tf.initialize_all_variables())
saver.save( sess, '/home/eneskocabey/Desktop/model' + str(i) )
When I ran this code, I saw only 5 models on my Desktop. Why is this? How can I save more than 5 models with the same tf.train.Saver() constructor?
The tf.train.Saver() constructor takes an optional argument called max_to_keep, which defaults to keeping the 5 most recent checkpoints of your model. To save more models, simply specify a value for that argument:
import tensorflow as tf
x = tf.Variable(tf.zeros([1]))
saver = tf.train.Saver(max_to_keep=10)
sess = tf.Session()
for i in range(10):
sess.run(tf.initialize_all_variables())
saver.save(sess, '/home/eneskocabey/Desktop/model' + str(i))
To keep all checkpoints, pass the argument max_to_keep=None to the saver constructor.
If you use your own tf.Session() for the training:
In order to keep the intermediate checkpoints and not the last 5, you need to change 2 parameters in the tf.train.Saver():
max_to_keep - indicates the maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, no checkpoints are deleted from the filesystem but only the last one is kept in the checkpoint file. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)
keep_checkpoint_every_n_hours - In addition to keeping the most recent max_to_keep checkpoint files, you might want to keep one checkpoint file for every N hours of training. This can be useful if you want to later analyze how a model progressed during a long training session. For example, passing keep_checkpoint_every_n_hours=2 ensures that you keep one checkpoint file for every 2 hours of training. The default value of 10,000 hours effectively disables the feature.
So if you do the following, you will store a checkpoint every 2 hours and if the total number of saved checkpoints reaches 10, then the oldest checkpoint will be deleted and a new one will replace it:
saver = tf.train.Saver(max_to_keep=10, keep_checkpoint_every_n_hours=2)
If you use tf.estimator.Estimator() then the saving of the checkpoint is done by the Estimator itself. That's why you need to pass it a tf.estimator.RunConfig() with some of the following parameters:
keep_checkpoint_max - The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)
save_checkpoints_steps - Save checkpoints every this many steps. Can not be specified with save_checkpoints_secs.
save_checkpoints_secs - Save checkpoints every this many seconds. Can not be specified with save_checkpoints_steps. Defaults to 600 seconds if both save_checkpoints_steps and save_checkpoints_secs are not set in constructor. If both save_checkpoints_steps and save_checkpoints_secs are None, then checkpoints are disabled.
So if you do the following, you will store a checkpoint every 100 iterations and if the total number of saved checkpoints reaches 10, then the oldest checkpoint will be deleted and a new one will replace it:
run_config = tf.estimator.RunConfig()
run_config = run_config.replace(keep_checkpoint_max=10,
save_checkpoints_steps=100)
classifier = tf.estimator.Estimator(
model_fn=model_fn, model_dir=model_dir, config=run_config)
I am training neural nets with TensorFlow, and the model's training is working using a custom implementation of batch gradient descent. I have a logging function which records validation error, and it gets down to about 2.6%. I'm saving the model every 10 epochs using a tf.train.Saver.
However, when I load the variables into memory again using a tf.train.Saver with the same script, the model performs poorly--with about the performance it does when the weights are randomly initialized. I have inspected the constitutional filters in the checkpoint and they don't seem to be random however.
I have not included all of my code, since its around 400 lines long, but I've included what seem to be important sections here and summarized the other functionality.
class ModelTrainer:
def __init__(self, ...hyperparameters...):
# Intitialize datasets and hyperparameters
for each gpu
# Create loss function and gradient assigned to this gpu using tf.device("/gpu:n")
with tf.device("/cpu:0")
# Average and clip gradients from the gpu's
# Create this batch gradient descent operation for each trainable variable
variable.assign_sub(learning_rate * averaged_and_clipped_gradient).op
def train(self, ...hyperparameters...)
saver = train.Saver(tf.all_variables(), max_to_keep = 30)
init = tf.initialize_all_variables()
sess = tf.Session()
if starting_point is not None: # Used to evaluate existing models
saver.restore(sess, starting_point)
else:
sess.run(init)
for i in range(number_of_batches)
# ... Get training batch ...
gradients = sess.run(calculate_gradients, feeds = training_batch)
# Average "gradients" variable across multiple batches
# Must be done because of GPU memory limitations
if i % meta_batch_size == 0:
sess.run(apply_gradients_operators,
feeds = gradients_that_have_been_averaged_across_multiple_batches)
# Log validation error
if i % save_after_n_batches == 0:
saver.save(sess, "some-filename", global_step=self.iter_num)
As expected, running these two functions creates a set of checkpoint files called "some-filename-40001" or whatever other iteration number the training is at when that file is saved. Unfortunately when I load these checkpoints back in using the start_point parameter they perform on par with random initialization.
Initially I assumed it was something to do with the way I'm training the model, since I haven't found anyone else with this issue, but the validation error behaves as expected.
Edit: More odd results. After more experimentation, I have found that when I load the saved model using the code:
with tf.Session() as sess:
saver = tf.train.import_meta_graph("saved-checkpoint-40.meta")
saver.restore(sess, "saved-checkpoint-40")
# ... Use model in some way ...
I get different, but still incorrect results.
I have finished running a big model in tensorflow python. But I have not saved it inside the session. Now that the training is over, I want to save the variables. I am doing the following:
saver=tf.train.Saver()
with tf.Session(graph=graph) as sess:
save_path = saver.save(sess, "86_model.ckpt")
print("Model saved in file: %s" % save_path)
This returns : ValueError: No variables to save. According to their website what is missing is initialize_all_variables(). The documentation says little about what exactly that does. The word "initialize" scares me, I do not want to reset all my trained values. Any way to save my model without re-running it?
It seems like from the tensorflow documentation, the "session" is the thing that holds the information from the trained model. So presumably somewhere you called sess.run() to train your model - what you want to do is call sess.save() using THAT session, not a new one you create with this saver object.
I believe its because you are not initializing all of your variables in the saver. This should work
with tf.Session() as sess:
tf.initialize_all_variables().run()
saver = tf.train.Saver(tf.all_variables())
-------everything your session does -------------
checkpoint_path = os.path.join(save_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step = your_global_step)
How about using skflow ? With skflow(now skflow is integrated in tensorflow) you can specify the parameter model_dir on your constructor and that automatically will save your model while training(it will save checkpoints so if something goes wrong during training, you can restart from last checkpooint).