Is loading and predicting a model on different threads supported and correct in Tensorflow?
Some background:
When trying to load a model in thread A and then predicting it in thread B, we are given the following error message:
ValueError: Tensor SOME_TENSOR is not an element of this graph.
I've found this TF GitHub thread, proposing to solve it by storing the graph when loading the model and using it as default when predicting. Sort of like this:
# thread A
global graph
graph = tf.get_default_graph()
...
# thread B
with graph.as_default():
preds = model.predict(image)
I've tried doing that, yet I also got errors due to sessions being different and variables uninitialised:
tensorflow.python.framework.errors_impl.FailedPreconditionError:
Failed precondition: Error while reading resource variable lstm_2_3/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/lstm_2_3/bias)
I started by fixing it with keras.backend.get_session().run(tf.compat.v1.global_variables_initializer()) yet that didn't quite work, producing wrong predictions. Instead I decided to treat session in the same way as the graph, and pass it over from when I load the model.
So the solution I have is as follows:
# thread A
global graph
global sess
graph = tf.get_default_graph()
sess = K.get_session()
...
# thread B
with graph.as_default():
try:
preds = model.predict(image)
except FailedPreconditionError:
K.set_session(sess)
preds = model.predict(image)
Not gonna lie, it feels hacky. Is this the right way to handle model loading / prediction on separate threads? Is there anything wrong with this approach?
It seems that keras/tensorflow simply aren't thread-safe, however it is possible to make it work in this case. This is hardly the correct way to fix the issue, but what helped me is changing the way I load the model.
def load_threadsafe():
model = load() # your usual model loading
model._make_predict_function()
return model
Note the call to the protected _make_predict_function method, which is still hacky. The solution was shared by #fgerard on this issue. As far as I can tell _make_predict_function is called internally on the first prediction, and calling it from the non-main thread causes issues. Hence the solution is to call it explicitly before any predictions are made on the thread that loads the model.
Related
I need to have a frozen graph (GrafDef file) while using Tensorflow 2.X.
That is because I use a tool which expects a frozen graph, however, my training needed to be done on TF2.X and Keras.
I tried many different ways to save my TF2 model. The variant with which I was able to get the most useful formats is the following:
sess = tf.compat.v1.Session()
saver = tf.compat.v1.train.Saver(var_list=cnn.trainable_variables)
save_path = saver.save(sess, os.path.join(CHKPT_DIR, CHKPT_FILE))
tf.compat.v1.train.write_graph(sess.graph_def, CHKPT_DIR, TRAIN_GRAPH, as_text=False)
That way I was able to get the following files:
float_model.ckpt.data-00000-of-00001
float_model.ckpt.index
checkpoint
training_model.pb
Of these files I need the *.ckpt and training_model.pb to freeze my model. However, when using the freeze_graph.sh (with TF1.X, different virtual environment), it throws the error
ValueError: No variables to save
This is although I give it the variables as a list via var_list=cnn.trainable_variables. cnn.trainable_variables also is not empty and seems to have all the used variables of my model.
Thus, I tried using the following method, according to TF2.X standards (assuming cnn is my model):
cnn.save(CHKPT_PATH)
checkpoint = tf.train.Checkpoint(cnn)
save_path = checkpoint.save(CHKPT_PATH)
Here I get the following files:
float_model.ckpt-1.data-00000-of-00001
float_model.ckpt-1.index
checkpoint
floating_model.ckpt/keras_metadata.pb
floating_model.ckpt/saved_model.pb
floating_model.ckpt/assets
floating_model.ckpt/variables
But here is where I get confused. Is there some kind of frozen graph available already? Or is there some kind of equivalent in here? And if not, how to get it with TF2.X if possible? I found the sentence
The .save() method is already saving a *.pb ready for inference.
in this post. So the frozen graph is ready for inference, and thus one of these files must be equivalent to a frozen graph, right?
I am kinda new to TensorFlow world but have written some programs in Keras. Since TensorFlow 2 is officially similar to Keras, I am quite confused about what is the difference between tf.keras.callbacks.ModelCheckpoint and tf.train.Checkpoint. If anybody can shed light on this, I would appreciate it.
It depends on whether a custom training loop is required. In most cases, it's not and you can just call model.fit() and pass tf.keras.callbacks.ModelCheckpoint. If you do need to write your custom training loop, then you have to use tf.train.Checkpoint (and tf.train.CheckpointManager) since there's no callback mechanism.
TensorFlow is a 'computation' library and Keras is a Deep Learning library which can work with TF or PyTorch, etc. So what TF provides is a more generic not-so-customized-for-deep-learning version. If you just compare the docs you can see how more comprehensive and customized ModelCheckpoint is. Checkpoint just reads and writes stuff from/to disk. ModelCheckpoint is much smarter!
Also, ModelCheckpoint is a callback. It means you can just make an instance of it and pass it to the fit function:
model_checkpoint = ModelCheckpoint(...)
model.fit(..., callbacks=[..., model_checkpoint, ...], ...)
I took a quick look at Keras's implementation of ModelCheckpoint, it calls either save or save_weights method on Model which in some cases uses TensorFlow's CheckPoint itself. So it is not a wrapper per se but certainly is on a lower level of abstraction -- more specialized for saving Keras models.
I also had a hard time differentiating between the checkpoint objects used when I looked at other people's code, so I wrote down some notes about when to use which one and how to use them in general.
Either-way, I think it might be useful for other people having the same issue:
Saving model Checkpoints
These are 2 ways of saving your model's checkpoints, each is for a different use case:
1) Checkpoint & CheckpointManager
This is use-full when you are managing the training loops yourself.
You use them like this:
1.1) Checkpoint
Definition from the docs:
"A Checkpoint object can be constructed to save either a single or group of trackable objects to a checkpoint file".
How to initialise it:
You can pass it key value pairs for:
All the custom function calls or objects that make up your model and you want to keep track of:
Like a generator, discriminiator, loss function, optimizer etc
ckpt = Checkpoint(discr_opt=discr_opt,
genrt_opt=genrt_opt,
wgan = wgan,
d_model = d_model,
g_model = g_model)
1.2) CheckpointManager
This literally manages the checkpoints you have defined to be stored at a location and things like how many to to keep.
Definition from the docs:
"Manages multiple checkpoints by keeping some and deleting unneeded ones"
How to initialise it:
Initialise it with the CheckPoint object you create as first argument.
The directory where to save the checkpoint files.
And you probably want to define how much you want to keep, since this can be a lot of complex models
manager = CheckpointManager(ckpt, "training_checkpoints_wgan", max_to_keep=3)
How to use it:
We have setup the manager object with our specified checkpoints, so it's ready to use.
Call this at the end of each training epoch
manager.save()
2) ModelCheckpoint (callback)
You want to use this callback when you are not managing epoch iterations yourself. For example when you have setup a relatively simple Sequential model and you call model.fit(), which manages the training process for you.
Definition from the docs:
"Callback to save the Keras model or model weights at some frequency."
How to initialise it:
Pass in the path where to save the model
The option save_weights_only is set to False by default:
If you want to only save the weights make sure to update this
The option save_best_only is set to False by default:
If you want to only save the best model instead of all of them, you can set this to True.
verbose is set to 0 (False), so you can update this to 1 to validate it
mc = ModelCheckpoint("training_checkpoints/cp.ckpt", save_best_only=True, save_weights_only=False)
How to use it:
The model checkpoint callback is now ready to for training.
You pass in the object in you into your callbacks list when you fit the model:
model.fit(X, y, epochs=100, callbacks=[mc])
I want to load a TensorFlow model (checkpoint) and use in in a while loop.
Loading the model takes some time, so I want to do that before the while loop.
If I use:
with tf.Graph().as_default():
with tf.Session() as sess:
print("loading checkpoint ...")
saver = tf.train.import_meta_graph(str(modelpath / 'mfn.ckpt.meta'))
saver.restore(sess, str(modelpath / 'mfn.ckpt'))
while:
...
the problem is that the session is closed after the end of while.
I saw now this post about a similar problem.
The answer seemed to be using TensorFlow Serving. Unfortunately, therefore the model has to be in the format of SavedModel class. I do not have a SavedModel but only the checkpoints.
I tried saving the loaded checkpoint with the tf.saved_model.builder.SavedModelBuilder()
but ran into some issues. I made a post about those issues separately here.
Is there another way of running a loaded model (as in the code above) outside of a session?
From your illustration code, I guess you're using TensorFlow version 1.x (with tf.Graph, tf.Session ...). Is this right?
So, about your question: "Is there another way of running a loaded model (as in the code above) outside of a session?",
I have a suggestion: have you ever tried to convert your code to TensorFlow version 2.x?
If you can do this, after that:
You can easily save-load your TF model using tf.saved_model.save() and tf.saved_model.load() methods,
Then, use the loaded model easily in a while loop.
I'm using models I didn't create but modified (from this repo https://github.com/GeorgeSeif/Semantic-Segmentation-Suite)
I have trained models and can use them to predict well enough but I want to run entire folders of images through and split the work between multiple gpus. I don't fully understand how tf.device() works and what I have tried didnt work at all.
I assumed I could do something like so:
for i, d in enumerate(['\gpu:0', '\gpu:1']):
with tf.device(d):
output = sess.run(network, feed_dict={net_input: image_batch[i]})
But this doesnt actually allocate the tasks to the different GPUs, it doesn't raise an error either.
My question is, is it possible to allocate the different images to different instances of the session on seperate GPUs without explicitly modifying the network code pre train. I would like to avoid running two different python scripts with CUDA_VISIBLE_DEVICES = ...
Is there a simple way to do this?
From what I understand the definitions of the operations have to be nested in a "with tf.device()" block, however when inferencing the operation is just the loading of the model and weights but if I put that in a "with tf.device()" block I get an error saying the graph already exists and cannot be defined twice.
tf.device only applies when building the graph, not executing it, so wrapping session.run calls in a device context does nothing.
Instead I recommend you use tf replicator or tf distribution strategy (tf.distribute / tf.contrib.distribute depending on the tf version), specifically the MirroredStrategy.
I'm training a model where the input vector is the output of another model. This involves restoring the first model from a checkpoint file while initializing the second model from scratch (using tf.initialize_variables()) in the same process.
There is a substantial amount of code and abstraction, so I'm just pasting the relevant sections here.
The following is the restoring code:
self.variables = [var for var in all_vars if var.name.startswith(self.name)]
saver = tf.train.Saver(self.variables, max_to_keep=3)
self.save_path = tf.train.latest_checkpoint(os.path.dirname(self.checkpoint_path))
if should_restore:
self.saver.restore(self.sess, save_path)
else:
self.sess.run(tf.initialize_variables(self.variables))
Each model is scoped within its own graph and session, like this:
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph)
with self.sess.graph.as_default():
# Create variables and ops.
All the variables within each model are created within the variable_scope context manager.
The feeding works as follows:
A background thread calls sess.run(inference_op) on input = scipy.misc.imread(X) and puts the result in a blocking thread-safe queue.
The main training loop reads from the queue and calls sess.run(train_op) on the second model.
PROBLEM:
I am observing that the loss values, even in the very first iteration of the training (second model) keep changing drastically across runs (and become nan in a few iterations). I confirmed that the output of the first model is exactly the same everytime. Commenting out the sess.run of the first model and replacing it with identical input from a pickled file does not show this behaviour.
This is the train_op:
loss_op = tf.nn.sparse_softmax_cross_entropy(network.feedforward())
# Apply gradients.
with tf.control_dependencies([loss_op]):
opt = tf.train.GradientDescentOptimizer(lr)
grads = opt.compute_gradients(loss_op)
apply_gradient_op = opt.apply_gradients(grads)
return apply_gradient_op
I know this is vague, but I'm happy to provide more details. Any help is appreciated!
The issue is most certainly happening due to concurrent execution of different session objects. I moved the first model's session from the background thread to the main thread, repeated the controlled experiment several times (running for over 24 hours and reaching convergence) and never observed NaN. On the other hand, concurrent execution diverges the model within a few minutes.
I've restructured my code to use a common session object for all models.