Basically I have two models to run in sequence. However, the first is a object-based model trained in TF2, the second is trained in TF1.x saved as name-based ckpt.
The foundamental conflict here is that in tf.compat.v1 mode I have to disable_eager_execution to run the mode, while the other model needs Eager execution (otherwise ~2.5 times slower).
I tried to find a way to convert the TF1 ckpt to object-based TF2 model but I don't think it's an easy way... maybe I have to rebuild the model and copy the weights according to the variable one by one (nightmare).
So Anyone knew if there's a way to just temporarily turn off the eager_excution? That would solve everything here... I would really appreciate it!
I regretfully have to inform you that, in my experience, this is not possible. I had the same issue. I believe the tensorflow documentation actually states that once it is turned off it stays off for the remainder of the session. You cannot turn it back on even if you try. This is a problem anytime you turn off eager execution, and the status will remain as long as the Tensorflow module is loaded in a particular python instance.
My suggestion for transferring the model weights and biases is to dump them to a pickle file as numpy arrays. I know it's possible because the Tensorflow 1.X model I was using did this in its code (I didn't write that model). I ended up loading that pickle file and reconstructing a new Tensorflow 2.X model via a for loop. This works well for sequential models. If any branching occurs the looping method won't work super well, or it will be hard successfully implement.
As a heads up, unless you want to train the model further the best way to load initialize those weights is to use the tf.constant_initializer (or something along those lines). When I did convert the model to Tensorflow 2.X, I ended up creating a custom initializer, but apparently you can just use a regular initializer and then set weights and biases via model or layer attributes or functions.
I ultimately had to convert the Tensorflow 1.x + compat code to Tensorflow 2.X, so I could train the models natively in Tensorflow 2.X.
I wish I could offer better news and information, but this was my experience and solution to the same problem.
Related
I would like to integrate an attentional component into the LSTM model I'm creating. Unfortunately, with tensorflow 2.3.1 that I'm using, it appears that if you subclass the LSTMCell, you have to run the model on CPU. From the tensorflow documentation:
CuDNN is only available at the layer level, and not at the cell level.
Which means I'm relegated to the CPU if I try something like this:
output=keras.layers.RNN(AttentionLSTMCell(400), return_sequences=True, stateful=False)(input_layer);
Where AttentionLSTMCell is a custom class I made that will take in some additional constants (generally an output of the previous timestamp and some new input) that will condition the output of the LSTM. In fact, the documentation seems to suggest that even only specific conditioning is allowed. I am about to dig into creating a full custom Layer (perhaps copy the existing and see if I can add my new inputs in call), but is there a better way? It makes prototyping quite difficult. Large recurrent networks are slow to train, especially in my case where I integrate image data as input.
I'm using models I didn't create but modified (from this repo https://github.com/GeorgeSeif/Semantic-Segmentation-Suite)
I have trained models and can use them to predict well enough but I want to run entire folders of images through and split the work between multiple gpus. I don't fully understand how tf.device() works and what I have tried didnt work at all.
I assumed I could do something like so:
for i, d in enumerate(['\gpu:0', '\gpu:1']):
with tf.device(d):
output = sess.run(network, feed_dict={net_input: image_batch[i]})
But this doesnt actually allocate the tasks to the different GPUs, it doesn't raise an error either.
My question is, is it possible to allocate the different images to different instances of the session on seperate GPUs without explicitly modifying the network code pre train. I would like to avoid running two different python scripts with CUDA_VISIBLE_DEVICES = ...
Is there a simple way to do this?
From what I understand the definitions of the operations have to be nested in a "with tf.device()" block, however when inferencing the operation is just the loading of the model and weights but if I put that in a "with tf.device()" block I get an error saying the graph already exists and cannot be defined twice.
tf.device only applies when building the graph, not executing it, so wrapping session.run calls in a device context does nothing.
Instead I recommend you use tf replicator or tf distribution strategy (tf.distribute / tf.contrib.distribute depending on the tf version), specifically the MirroredStrategy.
I have a standard pipeline that evaluates the model after training an epoch. I need resnet50 to be finetunable while training, so I instantiate like so:
resnet50_module = hub.Module("https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/1",
trainable=True, name="resnet50_finetunable", tags={"train"})
However, I read here that I should unset the tags when evaluating.
I realize that I can save the model, close the session, reset the graph, rebuild the model with the tags=None and load the weights from a checkpoint to do the eval. This seems very wasteful specially since the size of the model is huge due to resnet50, and I need to do hundreds of epochs to get good results. Is there a way to alternate between tags without this?
Thanks!
I'm afraid there is no good way to do this without going through a checkpoint.
Variables are created when hub.Module() is called, so they are tied to a particular graph version (tags={"train"} for training or the empty tag set for inference). What you describe could be read as a feature request to set that separately for each application of the module, but that doesn't exist yet (and has some ramifications).
Is checkpointing to local disk really that expensive compared to the eval you want to run? Wouldn't you want to checkpoint at times anyways, to allow resuming after a crash?
is it possible to share tensorflow checkpoint files with other users (plattform & CPU/GPU independet)? I had shared a tensorflow implementation of the DeconvNet and now I want to provide the trained weights. Can I simply upload the saved model or is there another tf way? I'm asking because I read a tutorial were the weights were stored using numpy.savetxt and then restored during the weight initalization. But this method was used for the MNIST example which uses a very small net..
Thanks!
You could save metagraph + provide code to restore and run your model --
http://tensorflow.org/how_tos/meta_graph
One downside of this is that it doesn't provide annotations of which tensors to feed/fetch, so you need to provide some code showing how to use it.
SavedModel is the next iteration of TensorFlow checkpoint format that takes care of that, but it doesn't have much documentation yet.
I use pickle, in binary mode, to dump and load big numpy matrix and it works quite well.
I am using Tensorflow + Python.
I am curious if I can release a saved Tensorflow model (architecture + trained variables) without detailed source code. I'm aware of tf.train.Saver(), but it looks to save only variables, and in order to restore/run them, a user needs to "define" the same architecture.
For the testing/running purpose only, is there a way to release a saved {architecture+trained variables} without source code, so that a user can just cast a query and get a result?
The TensorFlow Serving project is intended to make this use case straightforward (assuming that the end user is only using the model for inference, not training). TensorFlow Serving includes an Exporter class that takes your tf.train.Saver, the tf.GraphDef that defines your overall model, and a "signature" that describes the inputs to and output from your model.
The basics tutorial has a good introduction to exporting your model.
You can build a Saver from the MetaGraphDef (saved with checkpoints by default: those .meta files). and then use that Saver to restore your model. So users don't have to re-define your graph in their code. But then they still need to figure out the model signature (input, output variables). I solve this using tf.Collection (but i am interested to find better ways to do it as well).
You can take a look at my example implementation (the eval.py evaluate a model without re-defining a model):
reconstruct saver from meta graph https://github.com/falcondai/cifar10/blob/master/eval.py#L18
get input variables from collections https://github.com/falcondai/cifar10/blob/master/eval.py#L58
how to define your model https://github.com/falcondai/cifar10/blob/master/models/cp2f3d.py