Tensorflow graph with multiple inputs wihtout tf.placeholder for validation

Tensorflow graph with multiple inputs wihtout tf.placeholder for validation - python

I use the tf.data API for my models. For now I define the output of the tf.data iterator as input for my network. After I got rid of the feed_dict method, my performance increased significantly.
Now I want to implement a validation set that runs at least after each training once. Is there a way to implement a validation run for tf.data or do I have to set a placeholder and manually switch the tf.datasets and use feed_dicts again? What is the recommended way for validation tests?

The hack-ish way - node replacement
The most trivial way, although definitely not the most beautiful, would just be to use the node created by the tf.data API as an input to the feed_dict - this is because in Tensorflow you can replace the value of any node in the computation graph by feeding it's value directly to the feed_dict.
So this would be something like
batch_input = tf_train_data_foo()
validation_input = tf_validation_data_foo()
model = build_model(batch_input)
optimization_step = some_optimization_foo(model)
# Regular train
session.run(optimization_step)
# Validation run
validation_data = session.run(validation_input)
session.run(model, {batch_input: validation_data})
The better way - variable reuse
If all the construction uses tf.get_variable instead of creating new variables, and the scopes are all set to enable obtaining an existing variable, you can just call your build_model function twice - once with the train data (from tf.data) and once with your validation data. You can see more details on variable re-use on this answer

Related

Validate on entire validation set when using ddp backend with PyTorch Lightning

I'm training an image classification model with PyTorch Lightning and running on a machine with more than one GPU, so I use the recommended distributed backend for best performance ddp (DataDistributedParallel). This naturally splits up the dataset, so each GPU will only ever see one part of the data.
However, for validation, I would like to compute metrics like accuracy on the entire validation set and not just on a part. How would I do that? I found some hints in the official documentation, but they do not work as expected or are confusing to me. What's happening is that validation_epoch_end is called num_gpus times with 1/num_gpus of the validation data each. I would like to aggregate all results and only run the validation_epoch_end once.
In this section they state that when using dp/ddp2 you can add an additional function called like this
def validation_step(self, batch, batch_idx):
loss, x, y, y_hat = self.step(batch)
return {"val_loss": loss, 'y': y, 'y_hat': y_hat}
def validation_step_end(self, self, *args, **kwargs):
# do something here, I'm not sure what,
# as it gets called in ddp directly after validation_step with the exact same values
return args[0]
However, the results are not being aggregated and validation_epoch_end is still called num_gpu times. Is this kind of behavior not available for ddp? Is there some other way how achieve this aggregation behavior?

training_epoch_end() and validation_epoch_end() receive data that is aggregated from all training / validation batches of the particular process. They simply receive a list of what you returned in each training or validation step.
When using the DDP backend, there's a separate process running for every GPU. There's no simple way to access the data that another process is processing, but there's a mechanism for synchronizing a particular tensor between the processes.
The easiest approach for computing a metric on the entire validation set is to calculate the metric in pieces and then synchronize the resulting tensor, for example by taking the average. self.log() calls will automatically synchronize the value between GPUs when you use sync_dist=True. How the value is synchronized is determined by the reduce_fx argument, which by default is torch.mean.
If you're happy with averaging the metric over batches too, you don't need to override training_epoch_end() or validation_epoch_end() — self.log() will do the averaging for you.
If the metric cannot be calculated separately for each GPU and then averaged, it can get a bit more challenging. It's possible to update some state variables at each step, and then synchronize the state variables at the end of an epoch and calculate the metric. The recommended way is to create a class that derives from the Metric class from the TorchMetrics project. Add the state variables in the constructor using add_state() and override the update() and compute() methods. The API will take care of synchronizing the state variables between the GPU processes.
There's already an accuracy metric in TorchMetrics and the source code is a good example of how to use the API.

I think you are looking for training_step_end/validation_step_end.
...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.

tf.keras.callbacks.ModelCheckpoint vs tf.train.Checkpoint

I am kinda new to TensorFlow world but have written some programs in Keras. Since TensorFlow 2 is officially similar to Keras, I am quite confused about what is the difference between tf.keras.callbacks.ModelCheckpoint and tf.train.Checkpoint. If anybody can shed light on this, I would appreciate it.

It depends on whether a custom training loop is required. In most cases, it's not and you can just call model.fit() and pass tf.keras.callbacks.ModelCheckpoint. If you do need to write your custom training loop, then you have to use tf.train.Checkpoint (and tf.train.CheckpointManager) since there's no callback mechanism.

TensorFlow is a 'computation' library and Keras is a Deep Learning library which can work with TF or PyTorch, etc. So what TF provides is a more generic not-so-customized-for-deep-learning version. If you just compare the docs you can see how more comprehensive and customized ModelCheckpoint is. Checkpoint just reads and writes stuff from/to disk. ModelCheckpoint is much smarter!
Also, ModelCheckpoint is a callback. It means you can just make an instance of it and pass it to the fit function:
model_checkpoint = ModelCheckpoint(...)
model.fit(..., callbacks=[..., model_checkpoint, ...], ...)
I took a quick look at Keras's implementation of ModelCheckpoint, it calls either save or save_weights method on Model which in some cases uses TensorFlow's CheckPoint itself. So it is not a wrapper per se but certainly is on a lower level of abstraction -- more specialized for saving Keras models.

I also had a hard time differentiating between the checkpoint objects used when I looked at other people's code, so I wrote down some notes about when to use which one and how to use them in general.
Either-way, I think it might be useful for other people having the same issue:
Saving model Checkpoints
These are 2 ways of saving your model's checkpoints, each is for a different use case:
1) Checkpoint & CheckpointManager
This is use-full when you are managing the training loops yourself.
You use them like this:
1.1) Checkpoint
Definition from the docs:
"A Checkpoint object can be constructed to save either a single or group of trackable objects to a checkpoint file".
How to initialise it:
You can pass it key value pairs for:
All the custom function calls or objects that make up your model and you want to keep track of:
Like a generator, discriminiator, loss function, optimizer etc
ckpt = Checkpoint(discr_opt=discr_opt,
genrt_opt=genrt_opt,
wgan = wgan,
d_model = d_model,
g_model = g_model)
1.2) CheckpointManager
This literally manages the checkpoints you have defined to be stored at a location and things like how many to to keep.
Definition from the docs:
"Manages multiple checkpoints by keeping some and deleting unneeded ones"
How to initialise it:
Initialise it with the CheckPoint object you create as first argument.
The directory where to save the checkpoint files.
And you probably want to define how much you want to keep, since this can be a lot of complex models
manager = CheckpointManager(ckpt, "training_checkpoints_wgan", max_to_keep=3)
How to use it:
We have setup the manager object with our specified checkpoints, so it's ready to use.
Call this at the end of each training epoch
manager.save()
2) ModelCheckpoint (callback)
You want to use this callback when you are not managing epoch iterations yourself. For example when you have setup a relatively simple Sequential model and you call model.fit(), which manages the training process for you.
Definition from the docs:
"Callback to save the Keras model or model weights at some frequency."
How to initialise it:
Pass in the path where to save the model
The option save_weights_only is set to False by default:
If you want to only save the weights make sure to update this
The option save_best_only is set to False by default:
If you want to only save the best model instead of all of them, you can set this to True.
verbose is set to 0 (False), so you can update this to 1 to validate it
mc = ModelCheckpoint("training_checkpoints/cp.ckpt", save_best_only=True, save_weights_only=False)
How to use it:
The model checkpoint callback is now ready to for training.
You pass in the object in you into your callbacks list when you fit the model:
model.fit(X, y, epochs=100, callbacks=[mc])

Tensorflow v1.10+ why is an input serving receiver function needed when checkpoints are made without it?

I'm in the process of adapting my model to TensorFlow's estimator API.
I recently asked a question regarding early stopping based on validation data where in addition to early stopping, the best model at this point should be exported.
It seems that my understanding of what a model export is and what a checkpoint is is not complete.
Checkpoints are made automatically. From my understanding, the checkpoints are sufficient for the estimator to start "warm" - either using so per-trained weights or weights prior to an error (e.g. if you experienced a power outage).
What is nice about checkpoints is that I do not have to write any code besides what is necessary for a custom estimator (namely, input_fn and model_fn).
While, given an initialized estimator, one can just call its train method to train the model, in practice this method is rather lackluster. Often one would like to do several things:
compare the network periodically to a validation dataset to ensure you are not over-fitting
stop the training early if over-fitting occurs
save the best model whenever the network finishes (either by hitting the specified number of training steps or by the early stopping criteria).
To someone new to the "high level" estimator API, a lot of low level expertise seems to be required (e.g. for the input_fn) as how one could get the estimator to do this is not straight forward.
By some light code reworking #1 can be achieved by using tf.estimator.TrainSpec and tf.estimator.EvalSpec with tf.estimator.train_and_evaluate.
In the previous question user #GPhilo clarifies how #2 can be achieved by using a semi-unintuitive function from the tf.contrib:
tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator,'my_metric_to_monitor', 10000)
(unintuitive as "the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range").
#GPhilo - noting that it is unrelated to #2 - also answered how to do #3 (as requested in the original post). Yet, I do not understand what an input_serving_fn is, why it is needed, or how to make it.
This is further confusing to me as no such function is needed to make checkpoints, or for the estimator to start "warm" from the checkpoint.
So my questions are:
what is the difference between a checkpoint and an exported best model?
what exactly is a serving input receiver function and how to write one? (I have spent a bit of time reading over the tensorflow docs and do not find it sufficient to understand how I should write one, and why I even have to).
how can I train my estimator, save the best model, and then later load it.
To aid in answering my question I am providing this Colab document.
This self contained notebook produces some dummy data, saves it in TF Records, has a very simple custom estimator via model_fn and trains this model with an input_fn that uses the TF Record files. Thus it should be sufficient for someone to explain to me what placeholders I need to make for the input serving receiver function and and how I can accomplish #3.
Update
#GPhilo foremost I can not understate my appreciation for you thoughtful consideration and care in aiding me (and hopefully others) understand this matter.
My “goal” (motivating me to ask this question) is to try and build a reusable framework for training networks so I can just pass a different build_fn and go (plus have the quality of life features of exported model, early stopping, etc).
An updated (based off your answers) Colab can be found here.
After several readings of your answer, I have found now some more confusion:
1.
the way you provide input to the inference model is different than the one you use for the training
Why? To my understanding the data input pipeline is not:
load raw —> process —> feed to model
But rather:
Load raw —> pre process —> store (perhaps as tf records)
# data processing has nothing to do with feeding data to the model?
Load processed —> feed to model
In other words, it is my understanding (perhaps wrongly) that the point of a tf Example / SequenceExample is to store a complete singular datum entity ready to go - no other processing needed other than reading from the TFRecord file.
Thus there can be a difference between the training / evaluation input_fn and the inference one (e.g. reading from file vs eager / interactive evaluation of in memory), but the data format is the same (except for inference you might want to feed only 1 example rather than a batch…)
I agree that the “input pipeline is not part of the model itself”. However, in my mind, and I am apparently wrong in thinking so, with the estimator I should be able to feed it a batch for training and a single example (or batch) for inference.
An aside: “When evaluating, you don't need the gradients and you need a different input function. “, the only difference (at least in my case) is the files from which you reading?
I am familiar with that TF Guide, but I have not found it useful because it is unclear to me what placeholders I need to add and what additional ops needed to be added to convert the data.
What if I train my model with records and want to inference with just the dense tensors?
Tangentially, I find the example in the linked guide subpar, given the tf record interface requires the user to define multiple times how to write to / extract features from a tf record file in different contexts. Further, given that the TF team has explicitly stated they have little interest in documenting tf records, any documentation built on top of it, to me, is therefore equally unenlightening.
Regarding tf.estimator.export.build_raw_serving_input_receiver_fn.
What is the placeholder called? Input? Could you perhaps show the analog of tf.estimator.export.build_raw_serving_input_receiver_fn by writing the equivalent serving_input_receiver_fn
Regarding your example serving_input_receiver_fn with the input images. How do you know to call features ‘images’ and the receiver tensor ‘input_data’ ? Is that (the latter) standard?
How to name an export with signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY.

What is the difference between a checkpoint and an exported best model?
A checkpoint is, at its minimum, a file containing the values of all the variables of a specific graph taken at a specific time point.
By specific graph I mean that when loading back your checkpoint, what TensorFlow does is loop through all the variables defined in your graph (the one in the session you're running) and search for a variable in the checkpoint file that has the same name as the one in the graph. For resuming training, this is ideal because your graph will always look the same between restarts.
An exported model serves a different purpose. The idea of an exported model is that, once you're done training, you want to get something you can use for inference that doesn't contain all the (heavy) parts that are specific to training (some examples: gradient computation, global step variable, input pipeline, ...).
Moreover, and his is the key point, typically the way you provide input to the inference model is different than the one you use for the training. For training, you have an input pipeline that loads, preprocess and feeds data to your network. This input pipeline is not part of the model itself and may have to be altered for inference. This is a key point when operating with Estimators.
Why do I need a serving input receiver function?
To answer this I'll take first a step back. Why do we need input functions at all ad what are they? TF's Estimators, while perhaps not as intuitive as other ways to model networks, have a great advantage: they clearly separate between model logic and input processing logic by means of input functions and model functions.
A model lives in 3 different phases: Training, Evaluation and Inference. For the most common use-cases (or at least, all I can think of at the moment), the graph running in TF will be different in all these phases. The graph is the combination of input preprocessing, model and all the machinery necessary to run the model in the current phase.
A few examples to hopefully clarify further: When training, you need gradients to update the weights, an optimizer that runs the training step, metrics of all kinds to monitor how things are going, an input pipeline that grabs data from the training set, etc. When evaluating, you don't need the gradients and you need a different input function. When you are inferencing, all you need is the forward part of the model and again the input function will be different (no tf.data.* stuff but typically just a placeholder).
Each of these phases in Estimators has its own input function. You're familiar with the training and evaluation ones, the inference one is simply your serving input receiver function. In TF lingo, "serving" is the process of packing a trained model and using it for inference (there's a whole TensorFlow serving system for large-scale operation but that's beyond this question and you most likely won't need it anyhow).
Time to quote a TF guide on the topic:
During training, an input_fn() ingests data and prepares it for use by
the model. At serving time, similarly, a serving_input_receiver_fn()
accepts inference requests and prepares them for the model. This
function has the following purposes:
To add placeholders to the graph that the serving system will feed
with inference requests.
To add any additional ops needed to convert
data from the input format into the feature Tensors expected by the
model.
Now, the serving input function specification depends on how you plan of sending input to your graph.
If you're going to pack the data in a (serialized) tf.Example (which is similar to one of the records in your TFRecord files), your serving input function will have a string placeholder (that's for the serialized bytes for the example) and will need a specification of how to interpret the example in order to extract its data. If this is the way you want to go I invite you to have a look at the example in the linked guide above, it essentially shows how you setup the specification of how to interpret the example and parse it to obtain the input data.
If, instead, you're planning on directly feeding input to the first layer of your network you still need to define a serving input function, but this time it will only contain a placeholder that will be plugged directly into the network. TF offers a function that does just that: tf.estimator.export.build_raw_serving_input_receiver_fn.
So, do you actually need to write your own input function? IF al you need is a placeholder, no. Just use build_raw_serving_input_receiver_fn with the appropriate parameters. IF you need fancier preprocessing, then yes, you might need to write your own. In that case, it would look something like this:
def serving_input_receiver_fn():
"""For the sake of the example, let's assume your input to the network will be a 28x28 grayscale image that you'll then preprocess as needed"""
input_images = tf.placeholder(dtype=tf.uint8,
shape=[None, 28, 28, 1],
name='input_images')
# here you do all the operations you need on the images before they can be fed to the net (e.g., normalizing, reshaping, etc). Let's assume "images" is the resulting tensor.
features = {'input_data' : images} # this is the dict that is then passed as "features" parameter to your model_fn
receiver_tensors = {'input_data': input_images} # As far as I understand this is needed to map the input to a name you can retrieve later
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
How can I train my estimator, save the best model, and then later load it?
Your model_fn takes the mode parameter in order for you to build conditionally the model. In your colab, you always have a optimizer, for example. This is wrong ,as it should only be there for mode == tf.estimator.ModeKeys.TRAIN.
Secondly, your build_fn has an "outputs" parameter that is meaningless. This function should represent your inference graph, take as input only the tensors you'll fed to it in the inference and return the logits/predictions.
I'll thus assume the outputs parameters is not there as the build_fn signature should be def build_fn(inputs, params).
Moreover, you define your model_fn to take features as a tensor. While this can be done, it both limits you to having exactly one input and complicates things for the serving_fn (you can't use the canned build_raw_... but need to write your own and return a TensorServingInputReceiver instead). I'll choose the more generic solution and assume your model_fn is as follows (I omit the variable scope for brevity, add it as necessary):
def model_fn(features, labels, mode, params):
my_input = features["input_data"]
my_input.set_shape(I_SHAPE(params['batch_size']))
# output of the network
onet = build_fn(features, params)
predicted_labels = tf.nn.sigmoid(onet)
predictions = {'labels': predicted_labels, 'logits': onet}
export_outputs = { # see EstimatorSpec's docs to understand what this is and why it's necessary.
'labels': tf.estimator.export.PredictOutput(predicted_labels),
'logits': tf.estimator.export.PredictOutput(onet)
}
# NOTE: export_outputs can also be used to save models as "SavedModel"s during evaluation.
# HERE is where the common part of the graph between training, inference and evaluation stops.
if mode == tf.estimator.ModeKeys.PREDICT:
# return early and avoid adding the rest of the graph that has nothing to do with inference.
return tf.estimator.EstimatorSpec(mode=mode,
predictions=predictions,
export_outputs=export_outputs)
labels.set_shape(O_SHAPE(params['batch_size']))
# calculate loss
loss = loss_fn(onet, labels)
# add optimizer only if we're training
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdagradOptimizer(learning_rate=params['learning_rate'])
# some metrics used both in training and eval
mae = tf.metrics.mean_absolute_error(labels=labels, predictions=predicted_labels, name='mea_op')
mse = tf.metrics.mean_squared_error(labels=labels, predictions=predicted_labels, name='mse_op')
metrics = {'mae': mae, 'mse': mse}
tf.summary.scalar('mae', mae[1])
tf.summary.scalar('mse', mse[1])
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics, predictions=predictions, export_outputs=export_outputs)
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op, eval_metric_ops=metrics, predictions=predictions, export_outputs=export_outputs)
Now, to set up the exporting part, after your call to train_and_evaluate finished:
1) Define your serving input function:
serving_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
{'input_data':tf.placeholder(tf.float32, [None,#YOUR_INPUT_SHAPE_HERE (without batch size)#])})
2) Export the model to some folder
est.export_savedmodel('my_directory_for_saved_models', serving_fn)
This will save the current state of the estimator to wherever you specified. If you want a specifc checkpoint, load it before calling export_savedmodel.
This will save in "my_directory_for_saved_models" a prediction graph with the trained parameters that the estimator had when you called the export function.
Finally, you might want t freeze the graph (look up freeze_graph.py) and optimize it for inference (look up optimize_for_inference.py and/or transform_graph) obtaining a frozen *.pb file you can then load and use for inference as you wish.
Edit: Adding answers to the new questions in the update
Sidenote:
My “goal” (motivating me to ask this question) is to try and build a
reusable framework for training networks so I can just pass a
different build_fn and go (plus have the quality of life features of
exported model, early stopping, etc).
By all means, if you manage, please post it on GitHub somewhere and link it to me. I've been trying to get just the same thing up and running for a while now and the results are not quite as good as I'd like them to be.
Question 1:
In other words, it is my understanding (perhaps wrongly) that the
point of a tf Example / SequenceExample is to store a complete
singular datum entity ready to go - no other processing needed other
than reading from the TFRecord file.
Actually, this is typically not the case (although, your way is in theory perfectly fine too).
You can see TFRecords as a (awfully documented) way to store a dataset in a compact way. For image datasets for example, a record typically contains the compressed image data (as in, the bytes composing a jpeg/png file), its label and some meta information. Then the input pipeline reads a record, decodes it, preprocesses it as needed and feeds it to the network. Of course, you can move the decoding and preprocessing before the generation of the TFRecord dataset and store in the examples the ready-to-feed data, but the size blowup of your dataset will be huge.
The specific preprocessing pipeline is one example what changes between phases (for example, you might have data augmentation in the training pipeline, but not in the others). Of course, there are cases in which these pipelines are the same, but in general this is not true.
About the aside:
“When evaluating, you don't need the gradients and you need a
different input function. “, the only difference (at least in my case)
is the files from which you reading?
In your case that may be. But again, assume you're using data augmentation: You need to disable it (or, better, don't have it at all) during eval and this alters your pipeline.
Question 2: What if I train my model with records and want to inference with just the dense tensors?
This is precisely why you separate the pipeline from the model.
The model takes as input a tensor and operates on it. Whether that tensor is a placeholder or is the output of a subgraph that converts it from an Example to a tensor, that's a detail that belongs to the framework, not to the model itself.
The splitting point is the model input. The model expects a tensor (or, in the more generic case, a dict of name:tensor items) as input and uses that to build its computation graph. Where that input comes from is decided by the input functions, but as long as the output of all input functions has the same interface, one can swap inputs as needed and the model will simply take whatever it gets and use it.
So, to recap, assuming you train/eval with Examples and predict with dense tensors, your train and eval input functions will set up a pipeline that reads examples from somewhere, decodes them into tensors and returns those to the model to use as inputs. Your predict input function, on the other hand, just sets up one placeholder per input of your model and returns them to the model, because it assumes you'll put in the placeholders the data ready to be fed to the network.
Question 3:
You pass the placeholder as a parameter of build_raw_serving_input_receiver_fn, so you choose its name:
tf.estimator.export.build_raw_serving_input_receiver_fn(
{'images':tf.placeholder(tf.float32, [None,28,28,1], name='input_images')})
Question 4:
There was a mistake in the code (I had mixed up two lines), the dict's key should have been input_data (I amended the code above).
The key in the dict has to be the key you use to retrieve the tensor from features in your model_fn. In model_fn the first line is:
my_input = features["input_data"]
hence the key is 'input_data'.
As per the key in receiver_tensor, I'm still not quite sure what role that one has, so my suggestion is try setting a different name than the key in features and check where the name shows up.
Question 5:
I'm not sure I understand, I'll edit this after some clarification

TensorFlow Custom Estimator predict throwing value error

Note: this question has an accompanying, documented Colab notebook.
TensorFlow's documentation can, at times, leave a lot to be desired. Some of the older docs for lower level apis seem to have been expunged, and most newer documents point towards using higher level apis such as TensorFlow's subset of keras or estimators. This would not be so problematic if the higher level apis did not so often rely closely on their lower levels. Case in point, estimators (especially the input_fn when using TensorFlow Records).
Over the following Stack Overflow posts:
Tensorflow v1.10: store images as byte strings or per channel?
Tensorflow 1.10 TFRecordDataset - recovering TFRecords
Tensorflow v1.10+ why is an input serving receiver function needed when checkpoints are made without it?
TensorFlow 1.10+ custom estimator early stopping with train_and_evaluate
TensorFlow custom estimator stuck when calling evaluate after training
and with the gracious assistance of the TensorFlow / StackOverflow community, we have moved closer to doing what the TensorFlow "Creating Custom Estimators" guide has not, demonstrating how to make an estimator one might actually use in practice (rather than toy example) e.g. one which:
has a validation set for early stopping if performance worsen,
reads from TF Records because many datasets are larger than the TensorFlow recommend 1Gb for in memory, and
that saves its best version whilst training
While I still have many questions regarding this (from the best way to encode data into a TF Record, to what exactly the serving_input_fn expects), there is one question that stands out more prominently than the rest:
How to predict with the custom estimator we just made?
Under the documentation for predict, it states:
input_fn: A function that constructs the features. Prediction continues until input_fn raises an end-of-input exception (tf.errors.OutOfRangeError or StopIteration). See Premade Estimators for more information. The function should construct and return one of the following:
A tf.data.Dataset object: Outputs of Dataset object must have same constraints as below.
features: A tf.Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn. They should satisfy the expectation of model_fn from inputs.
A tuple, in which case the first item is extracted as features.
(perhaps) Most likely, if one is using estimator.predict, they are using data in memory such as a dense tensor (because a held out test set would likely go through evaluate).
So I, in the accompanying Colab, create a single dense example, wrap it up in a tf.data.Dataset, and call predict to get a ValueError.
I would greatly appreciate it if someone could explain to me how I can:
load my saved estimator
given a dense, in memory example, predict the output with the estimator

to_predict = random_onehot((1, SEQUENCE_LENGTH, SEQUENCE_CHANNELS))\
.astype(tf_type_string(I_DTYPE))
pred_features = {'input_tensors': to_predict}
pred_ds = tf.data.Dataset.from_tensor_slices(pred_features)
predicted = est.predict(lambda: pred_ds, yield_single_examples=True)
next(predicted)
ValueError: Tensor("IteratorV2:0", shape=(), dtype=resource) must be from the same graph as Tensor("TensorSliceDataset:0", shape=(), dtype=variant).
When you use the tf.data.Dataset module, it actually defines an input graph which is independant from the model graph. What happens here is that you first created a small graph by calling tf.data.Dataset.from_tensor_slices(), then the estimator API created a second graph by calling dataset.make_one_shot_iterator() automatically. These 2 graphs can't communicate so it throws an error.
To circumvent this, you should never create a dataset outside of estimator.train/evaluate/predict. This is why everything data related is wrapped inside input functions.
def predict_input_fn(data, batch_size=1):
dataset = tf.data.Dataset.from_tensor_slices(data)
return dataset.batch(batch_size).prefetch(None)
predicted = est.predict(lambda: predict_input_fn(pred_features), yield_single_examples=True)
next(predicted)
Now, the graph is not created outside of the predict call.
I also added dataset.batch() because the rest of your code expect batched data and it was throwing a shape error. Prefetch just speed things up.

In `tf.estimator`, how to `tf.assign` a variable at end of training (not at every iteration)?

I am using the tf.estimator API to train models.
As I understand, the model_fn defines the computation graph, which returns a different tf.estimator.EstimatorSpec according to the mode.
In mode==tf.estimator.ModeKeys.TRAIN, one can specify a train_op to be called at each training iteration, which in turns changes trainable instances of tf.Variable, to optimise a certain loss.
Let's call the train_op optimizer, and the variables A and B.
In order to speed up prediction and evaluation, I would like to have an auxiliary non-trainable tf.Variable Tensor C, exclusively dependent on the already trained variables. The values of this tensor would thus be exportable. This Tensor does not affect training loss. Let's assume we want:
C = tf.Variable(tf.matmul(A,B))
update_op = tf.assign(C, tf.matmul(A,B))
What I tried:
Passing tf.group(optimizer, update_op) as train_op in the EstimatorSpec works good but slows down training a lot, since the train_op now updates C at each iteration.
Because C is only needed at eval/predict time, one call of update_op at the end of training is enough.
Is it possible to assign a Variable at the end of training a tf.estimator.Estimator?

In general a single iteration of the model function is not aware of whether training is going to end after it runs, so I doubt that this can be done straightforwardly. I see two options:
If you need the auxiliary variable only after training, you could use tf.estimator.Estimator.get_variable_value (see here) to extract the values of variables A and B after training as numpy arrays and do your computations to get C. However then C would not be part of the model.
Use a hook (see here). You can write a hook with an end method that will be called at the end of the session (i.e. when training stops). You would probably need to look into how hooks are defined/used -- e.g. here you can find implementations of most "basic" hooks already in Tensorflow. A rough skeleton could look something like this:
class UpdateHook(SessionRunHook):
def __init__(update_variable, other_variables):
self.update_op = tf.assign(update_variable, some_fn(other_variables))
def end(session):
session.run(self.update_op)
Since the hooks needs access to the variables, you would need to define the hook inside the model function. You can pass such hooks to the training process in the EstimatorSpec (see here).
I haven't tested this! I'm not sure if you can define ops inside a hook. If not, it should hopefully work to define the update op inside the model function and pass that to the hook directly.

Using hook is a solution.
But note that if you want to change the value of variables, you must not change in the end() function since the results of changing cannot be stored in the checkpoint file. If you change the value in the, for example, after_run function, then the results will be stored in the checkpoint.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.