I am working on a VAE project in TensorFlow where the encoder/decoder networks are build in functions. The idea is to be able to save, then load the trained model and do sampling, using the encoder function.
After restoring the model, I am having trouble getting the decoder function to run and give me back the restored, trained variables, getting an "Uninitialized value" error. I assume it is because the function is either creating a new new one, overwriting the existing, or otherwise. But I cannot figure out how to solve this. Here is some code:
class VAE(object):
def __init__(self, restore=True):
self.session = tf.Session()
if restore:
self.restore_model()
self.build_decoder = tf.make_template('decoder', self._build_decoder)
#staticmethod
def _build_decoder(z, output_size=768, hidden_size=200,
hidden_activation=tf.nn.elu, output_activation=tf.nn.sigmoid):
x = tf.layers.dense(z, hidden_size, activation=hidden_activation)
x = tf.layers.dense(x, hidden_size, activation=hidden_activation)
logits = tf.layers.dense(x, output_size, activation=output_activation)
return distributions.Independent(distributions.Bernoulli(logits), 2)
def sample_decoder(self, n_samples):
prior = self.build_prior(self.latent_dim)
samples = self.build_decoder(prior.sample(n_samples), self.input_size).mean()
return self.session.run([samples])
def restore_model(self):
print("Restoring")
self.saver = tf.train.import_meta_graph(os.path.join(self.save_dir, "turbolearn.meta"))
self.saver.restore(self.sess, tf.train.latest_checkpoint(self.save_dir))
self._restored = True
want to run samples = vae.sample_decoder(5)
In my training routine, I run:
if self.checkpoint:
self.saver.save(self.session, os.path.join(self.save_dir, "myvae"), write_meta_graph=True)
UPDATE
Based on the suggested answer below, I changed the restore method
self.saver = tf.train.Saver()
self.saver.restore(self.session, tf.train.latest_checkpoint(self.save_dir))
But now get a value error when it creates the Saver() object:
ValueError: No variables to save
The tf.train.import_meta_graph restores the graph, meaning rebuilds the network architecture that was stored to the file. The call to tf.train.Saver.restore on the other hand only restores the variable values from the file to the current graph in the session (this naturally fails if the some values of in the file belong to variables that do not exist in the currently active graph).
So if you already build the network layers in the code, you don't need to call tf.train.import_meta_graph. Otherwise this might be causing you problems.
Not sure how the rest of your code looks like but here are some suggestions. First build the graph, then create the session, and finally restore if applicable. Your init might look like this then
def __init__(self, restore=True):
self.build_decoder = tf.make_template('decoder', self._build_decoder)
self.session = tf.Session()
if restore:
self.restore_model()
However if you are only restoring the encoder, and building the decoder anew, you might build the decoder last. But then don't forget to initialize its variables before usage.
Related
Sorry about the vague title but I'm not sure exactly how to describe it.
I am currently running tests on a model written in tensorflow.compat.v1. When it is used for inference, it must be restored as follows:
class Model
def __init__(self, filepath):
...
self.sess, self.saver = self.setup_tf()
self.merge = tf.compat.v1.summary.merge(tf.compat.v1.summary.scalar('loss', self.loss)])
def setup_tf():
sess = tf.compat.v1.Session() # TF session
saver = tf.compat.v1.train.Saver(max_to_keep=1)
latest_snapshot = tf.train.latest_checkpoint(join("../", self.model_dir))
saver.restore(sess, latest_snapshot)
return sess, saver
I have 2 of these tests, involving restoring a model and performing some inferences with it. After successful loading and inference of the first model, the second test fails to restore the model completely. The specific error here is
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
class TestInference(unittest.TestCase):
def setUp(self):
pass
def tearDown(self):
pass:
def test1:
model = Model(filepath)
model.infer()
def test2:
model = Model(filepath)
model.infer()
However when I run each test individually (commenting one out), there is no problem in restoring the model. Even when I swap the order of the models being loaded around, there the first one is always successful and the second one fails.
I figure that I should be using setUp and tearDown for each test, but I'm confused on what exactly I should be 'tearing down'? Is it garbage collection? I have tried gc.collect, but the same error occurs.
Here is the testing class, if it helps:
As Pytorch Lightning provides automatic saving for model checkpoints, I use it to save top-k best models. Specifically in Trainer setting,
checkpoint_callback = ModelCheckpoint(
monitor='val_acc',
dirpath='checkpoints/',
filename='{epoch:02d}-{val_acc:.2f}',
save_top_k=5,
mode='max',
)
This is working well but it does not save some attribute of the model object. My model stores some Tensor at every training epoch end such that
class SampleNet(pl.LightningModule):
def __init__(self):
super().__init__()
self.save_hyperparameters()
self.layer = torch.nn.Linear(100, 1)
self.loss = torch.nn.CrossEntropy()
self.some_data = None # Initialize as None
def training_step(self, batch):
x, t = batch
out = self.layer(x)
loss = self.loss(out, t)
results = {'loss': loss}
return results
def training_epoch_end(self, outputs):
self.some_data = some_tensor_object
This is a simplified example but I want the checkpoint file made by above checkpoint_callback to remember the attribute self.some_data but when I load the model from checkpoint, it always reset to None. I confirmed that it is successfully updated during the training.
I tried not to initialize it as None in init but then the attribute will disappear when loading model.
Saving the attribute as a distinct pt file is something I want to avoid as it is associated with model configuration so I manually need to match the file with corresponding checkpoint file later.
Would it be possible to include such tensor attribute in checkpoint file?
Simply use the model class hooks on_save_checkpoint() and on_load_checkpoint() for all sorts of objects that you want to save alongside the default attributes.
def on_save_checkpoint(self, checkpoint) -> None:
"Objects to include in checkpoint file"
checkpoint["some_data"] = self.some_data
def on_load_checkpoint(self, checkpoint) -> None:
"Objects to retrieve from checkpoint file"
self.some_data= checkpoint["some_data"]
See module docs
It doesn't seem to be possible directly, as to extract the parameters most likely nn.Module.state_dict() is used.
This methods only extracts the values of the tensors that are actually considered as parameters. So in this case a workaround would be saving your data as a parameter (see docs):
self.some_data = torch.nn.parameter.Parameter(your_data)
I've been trying to research model/weight saving for a while, but I still can't fully grasp it. I feel what I'd like to do should be simple enough, but I've not found a solution.
The final goal is to do transfer laerning with a collection of pretrained networks. I write my models/layers as classes, so class method(s) for saving the weights and restoring would be ideal.
Example:
If I have a graph, features > A > B > labels, where A and B are sub-networks, I'd like to save and/or restore weights for these sections. Say I already have the weights for A trained, but the variable scope is now different, how would I restore the weights I've trained for A from a different training session? At the end of training this new graph i'd like 1 directory for my new A weights, 1 directory for my new B weights, and 1 directory for the full graph (I can handle the full graph bit).
It's very possible I keep overlooking the solution, but model saving is so poorly documented.
Hope I've explained the scenario well.
You can do this with tf.train.init_from_checkpoint
Define your model
def model_fn():
with tf.variable_scope('One'):
layer = any_tf_layer
with tf.variable_scope('Two'):
layer = any_tf_layer
Output variable names in checkpoint file
vars = [i[0] for i in tf.train.list_variables(ckpt_file)]
Then you can create assignment map to load only variables, defined in your model.
You can also assign new names to restored variables
map = {variable.op.name: variable for variable in tf.global_variables() if variable.op.name in vars}
This line is placed before session or outside model function for Estimator API
tf.train.init_from_checkpoint(ckpt_file, map)
https://www.tensorflow.org/api_docs/python/tf/train/init_from_checkpoint
You also can do it with tf.train.Saver
First you need to know the names of variables
vars_dict = {}
for var_current in tf.global_variables():
print(var_current)
print(var_current.op.name) # this gets only name
for var_ckpt in tf.train.list_variables(ckpt):
print(var_ckpt[0]) this gets only name
When you know exact names of all variables you can assign whatever value you need, provided variables have same shape and dtype. So to get a dictionary
vars_dict[var_ckpt[0]) = tf.get_variable(var_current.op.name, shape) # remember to specify shape, you can always get it from var_current
saver = tf.train.Saver(vars_dict)
Take a look at my other answer to similar question
How to restore pretrained checkpoint for current model in Tensorflow?
What is the best way to store a trainer and all necessary components?
1. Storing:
Store checkpoint of the trainer: Use its trainer.save_checkpoint(filename, external_state={}) function
Additionally store the model separately: Use the z.save(filename) method, every cntk operation has. You can also get z = trainer.model.
2. Reloading:
Restore the model: Use C.load_model(...). (Don't get confused by the deprecated persist namespace from the Cntk 1.)
Get the inputs from the restored model.
Restore the trainer itself: Use trainer.restore_from_checkpoint as eg. shown here. The problem is, this function already needs a trainer object which probably has to be initialized in the same way as the trainer used to create the check point!?
How do I now restore the label-inputs which are going into the error function used by the trainer? In the following code I marked the variables which I think I have to restore after I once stored them.
z = C.layers.Dense(.... )
loss = error = C.squared_error(z, **l**)
**trainer** = C.Trainer(**z**, (loss, error), [mylearner], my_tensorboard_writer)
You can restore your trainer, but I actually prefer to just load my model m. The simple reason is that it is much easier to create a whole new trainer, beacuse then you can change all the other parameters of the trainer more easily.
Then you can get the input variable from the loaded model (if your network has only one input):
input_var = m.arguments[0]
then you need the output of your model:
output = m(input_var)
and define the loss function using your target output target_output:
C.squared_error(output, target_output)
using your model and the loss function you can recreate your trainer from there, setting the learning rate etc. as you like
Meaning to say if I have the following operations for training purposes in my graph initially:
with tf.Graph.as_default() as g:
images, labels = load_batch(...)
with slim.argscope(...):
logits, end_points = inceptionResnetV2(images, num_classes..., is_training = True)
loss = slim.losses.softmax_cross_entropy(logits, labels)
optimizer = tf.train.AdamOptimizer(learning_rate = 0.002)
train_op = slim.learning.create_train_op(loss, optimizer)
sv = tf.train.Supervisor(...)
with sv.managed_session() as sess:
#perform your regular training loop here with sess.run(train_op)
Which allows me to train my model just fine, but I would like to run a small validation dataset that evaluates my model every once in a while inside my sess, would it take too much memory to consume a nearly exact replica within the same graph like:
images_val, labels_val = load_batch(...)
with slim.argscope(...):
logits_val, end_points_val = inceptionResnetV2(images, num_classes..., is_training = False)
predictions = end_points_val['Predictions']
acc, acc_updates = tf.contrib.metrics.streaming_accuracy(predictions, labels_val)
#and then following this, we can run acc_updates in a session to update the accuracy, which we can then print to monitor
My concern is that to evaluate my validation dataset, I need to set the is_training argument to False so that I can disable dropout. But will creating an entire inception-resnet-v2 model from scratch just for validation inside the same graph consume too much memory? Or should I just create an entirely new file that runs the validation on my own?
Ideally, I wanted to have 3 kinds of dataset - a training one, a small validation dataset to test during training, and a final evaluation dataset. This small validation dataset will help me see if my model is overfitting to the training data. If my proposed idea consumes too much memory, however, would it be equivalent to just occasionally monitor the training data score? Is there a better idea to test the validation set while training?
TensorFlow's devs thought about it and made variable ready to be shared.
You can see here the doc.
Using scopes the right way make it possible to reuse some variable.
One very good example (the context is language model but never mind) is TensorFlow PTB Word LM.
The global pseudo-code of this approach is something like:
class Model:
def __init__(self, train=True, params):
""" Build the model """
tf.placeholder( ... )
tf.get_variable( ...)
def main(_):
with tf.Graph.as_default() as g:
with tf.name_scope("Train"):
with tf.variable_scope("Model", reuse=None):
train = Model(train=True, params )
with tf.name_scope("Valid"):
# Now reuse variables = no memory cost
with tf.variable_scope("Model", reuse=True):
# But you can set different parameters
valid = Model(train=False, params)
session = tf.Session
...
Thus you can share some variable without having the exact same model as the parameters may change the model itself.
Hope this helps
pltrdy