How to access results from BestExporter while using train_and_evaluate? - python

When I use tf.estimator.train_and_evaluate with a BestExporter in my EvalSpec the return value at the end might not include an export_result since the final evaluation call won't necessarily lead to an export. This happens for instance if your last checkpoint doesn't lead to a lower loss on your evaluation set.
How do you access the last export_result that led to an export from the BestExporter? Ideally I would like to have a list of each (metrics, export_results) at the end of train_and_evaluate instead of just the last one.
For anyone desperate for a workaround you can access the directory using python built-ins like this.
estimator = tf.estimator.Estimator(...)
best_exporter = tf.estimator.BestExporter(...)
# Add best_exporter to your eval_spec
# Make train_spec
metrics, export_results = tf.estimator.train_and_evaluate(...)
best_export_dir = os.path.join(estimator.model_dir, 'export', best_exporter.name)
savedmodels = os.listdir(best_export_dir)
best_model = savedmodels[-1]
Obviously a better method would be preferred. The particular issue I'm describing here is that export_results might just be [None] since the last checkpoint didn't result in an export even when there has been an earlier export.
For anyone who cares these are the relevant bits of code from tensorflow r1.13 tracing the life of export_results from call to value,
tf.estimator.train_and_evaluate 471
_TrainingExecutor.run 611
_TrainingExecutor.run_local 703
_NewCheckpointListenerForEvaluate.after_save 517
_NewCheckpointListenerForEvaluate._evaluate 536
_Evaluator.evaluate_and_export 924
_Evaluator._export_eval_result 948

I might have found a solution if you are willing to (slightly) change the source code, specifically the _SavedModelExporter class implementation, in tensorflow_estimator\python\estimator\exporter.py.
First, I am using the package tensorflow_estimator, instead of getting estimator from tf.estimator. If the solution doesn't work in your case, consider using tensorflow_estimator - you should not lose anything by that.
Basically, _SavedModelExporter has a method called export, which, in my case (tensorflow 1.13.2, tensorflow_estimator 1.13.0), starts in line 116 and has the following implementation:
def export(self, estimator, export_path, checkpoint_path, eval_result,
is_the_final_export):
del is_the_final_export
export_result = estimator.export_savedmodel(
export_path,
self._serving_input_receiver_fn,
assets_extra=self._assets_extra,
as_text=self._as_text,
checkpoint_path=checkpoint_path,
strip_default_attrs=self._strip_default_attrs)
################
###I ADDED THIS
################
results_file = os.path.join(export_result, b"model_eval.txt")
with open(results_file, mode="w") as f:
for result in eval_result:
f.write(result + ": " + str(eval_result[result]) + "\n")
################
###END OF I ADDED THIS
################
return export_result
In the code above, as marked, I added code which loops through the dictionary of evaluation results (eval_result variable, already available for us but not used here!) and saves it as lines to a file. This file will be saved inside the same folder which contains the exported model, that is, something like export\best_exporter\1565348723\.
Some points:
1) You asked for a returned value, and I am not giving you that. I am instead saving it to file, since I think this is the solution with least changes to the source code. Do let me know if you cannot work with that.
2) You can develop on this solution. For example, you can probably save all entries to the same file instead of saving one file per exported model.
3) All three implemented exporters (LatestExporter, FinalExporter and BestExporter) are making calls to _SavedModelExporter, which we just changed. So you can either live with this behavior for all different Exporters, or have some variable, default to False, which controls whether the saving to file will happen or not. Then, expose this variable through the call to BestExporter.
Hope I could help with something.

Related

Problem in tqdm function in a Doc2Vec model

I am using this article https://actsusanli.medium.com/ to implement the Doc2Vec model and I have a problem in the training step.
model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs = 40)
As you can see, I am using the tqdm function. When I ran the code the tqdm is 100%, after some minutes, but the algorithm still runs in the same shell for a long time.
Do you have any idea if this is a problem of tqdm function or something else?
By using the "list comprehension" ([..])...
[x for x in tqdm(train_tagged.values)]
...you are having tqdm iterate once over your train_tagged.values sequence, into an actual in-memory Python list. This will show the tqdm progress rather quickly – then completely finish any involvement with tqdm.
Then, you're passing that plain result list (without any tqdm features) into Doc2Vec.train(), where Doc2Vec does its epochs=40 training passes. tqdm is no longer involved, so there'll be no incremental progress-bar output.
You might be tempted to try (or have already tried) something that skips the extra list creation, passing the tqdm-wrapped sequence directly in like:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(tqdm(corpus), total_examples=len(corpus), epochs = 40)
But this has a different problem: the tqdm-wrapper is only designed to allow (& report the progress of) one iteration over the wrapped sequence. So this will show that one iteration's incremental progress.
But when .train() tries its next necessary 39 re-iterations, to complete its epochs=40 training-runs, the single-pass tqdm object will be exhausted, preventing full & proper training.
Note that there is an option for progress-logging within Gensim, by setting the Python logging level (globally, or just for the class Doc2Vec) to INFO. Doc2Vec will then emit a log-line showing progress, within each epoch and between epochs, about every 1 second. But: you can also make such logging less-frequent by supplying a different seconds value to the optional report_delay argument of .train(), for example report_delay=60 (for a log line every minute instead of every second).
If you really want a progress-bar, it should possible to use tqdm - but you will have to work around its assumption that the iterable you're wrapping with tqdm() will only be iterated over once.
I believe there'd be two possible approaches, each with different tradeoffs:
(1) Instead of letting .train() repeat the corpus N times, do it yourself - adjusting the other .train() parameters accordingly. Roughly, that'd mean changing a line like...
model.train(corpus, total_examples=len(corpus), epochs=40)
...into something that turns your desired 40 epochs into something that looks like just one iteration to both tqdm & Gensim's .train(), like...
repeated_corpus = itertools.chain(*[corpus]*40)
repeated_len = 40 * len(corpus)
model.train(tqdm(repeated_corpus, total=repeated_len), total_examples=repeated_len, epochs=1)
(Note that you now have to give tqdm a hint as to the sequence's length, because the one-time chained-iterator from itertools.chain() doesn't report its own length.)
Then you'll get one progress-bar across the whole, training corpus - which the model is now seeing as one pass over a larger corpus, but ultimately involves the same 40 passes.
You'll want to reinterpret any remaining log lines with this change in mind, and you'll lose a chance to install your own per-epoch callbacks via the model's end-of-epoch callback mechanism. (But, that's a seldom-used feature, anyway.)
(2) Instead of wrapping the corpus with a single tqdm() (which can only show a progress-bar for one-iteration), wrap the corpus as a new fully-re-iterable object that itself will start a new tqdm() each time. For example, something like:
class TqdmEveryIteration(object):
def __init__(self, inner_iterable):
self.inner_iterable = inner_iterable
def iter(self):
return tqdm(inner_iterable)
Then, using this new extra tqdm-adding wrapper, you should be able to do:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(TqdmEveryIteration(corpus), total_examples=len(corpus), epochs = 40)
In this case, you should get one progress bar per epoch, because a new tqdm() wrapper will be started each training pass.
(If you try either of these approaches & they work well, please let me know! They should be roughly correct, but I haven't tested them yet.)
Separately: if the article from the author at actsusanli.medium.com that you're modeling your work on is...
https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4
...note that it's using an overly-complex & fragile anti-pattern, calling .train() multiple times in a loop with manual alpha management. That has problems as described in this other answer. But that approach would also have the side-effect of re-wrapping the corpus each time in a new tqdm (like the TqdmEveryIteration class above), so despite its other issues, would achieve one actual progress-bar each call to .train().
(I sent the author a private note via Medium about a month ago about this problem.)

How do you use Tensorflow Keras Custom Objects with tf.saved_model.Asset?

I have a custom Keras Layer that reads from a pickle file to initialize some weights, and I'd like to be able to use tf.keras.utils.register_keras_serializable() on it. The issue is that my __init__ function takes the path to the pickle file, which might not be available when the layer is deserialized again. Keras Assets should theoretically make the layer more portable, but I can't figure out how to get it to work with the layer's get_config().
Barebones version of my code:
#tf.keras.utils.register_keras_serializable()
class AssetLayer(tf.keras.layers.Layer):
def __init__(self, asset_path, **kwargs):
super().__init__(**kwargs)
self.asset_path = asset_path
self.asset = tf.saved_model.Asset(asset_path)
data = tf.io.read_file(self.asset)
# do something with data
def get_config(self):
return {
**super().get_config(),
"asset_path": self.asset_path,
}
def call(self, arg):
# arbitrary call function
return arg
If a model using this layer is loaded using tf.keras.models.load_model(), Keras will call get_config() to reinitialize the layer using the saved asset_path which might not be pointing to the right place at deserialization time. Ideally it would point to the path of the saved asset, but I don't know how to make it do that.
For instance, I've tried this code
!echo abcd > file.txt
model = tf.keras.Sequential([AssetLayer("file.txt")])
model(tf.ones(3))
model.save("test")
# reloading
!rm file.txt
reloaded_model = tf.keras.models.load_model("test")
which gives me an error saying file.txt is not found.
I've also tried removing the get_config() function entirely. This makes it so the layer can be successfully reloaded while retaining access to the asset variable, but other attributes in the layer such as self.asset_path aren't accessible. This isn't ideal for debugging purposes, so I'm wondering if there's a better way.
I'm currently using Tensorflow 2.5.0`
Edited code:
Previous to this part , code is fine .Issue is replicating because of
!rm file.txt
(so I put it at the end)
!echo abcd > file.txt
model = tf.keras.Sequential([AssetLayer("file.txt")])
model(tf.ones(3))
model.save("./content/sample_data/test.h5")
# reloading
reloaded_model = tf.keras.models.load_model("/content/content/sample_data/test.h5")
reloaded_model.summary()
!rm file.txt
Reference: https://www.tensorflow.org/guide/keras/save_and_serialize
It seems "tf.saved_model.Asset" do not support "tf.keras.models.load_model"
Try use tf.saved_model.save / tf.saved_model.load instead

TensorFlow: restoring model in a MonitoredSession

I have a model that contains multiple variables including a global step. I've been able to successfully use a MonitoredSession to save checkpoints and summaries every 100 steps. I was expecting the MonitoredSession to automatically restore all my variables when the session is run in multiple passes (based on this documentation), however this does not happen. If I take a look at the global step after running the training session again, I find that it starts back from zero. This is a simplified version of my code without the actual model. Let me know if more code is needed to solve this problem
train_graph = tf.Graph()
with train_graph.as_default():
# I create some datasets using the Dataset API
# ...
global_step = tf.train.create_global_step()
# Create all the other variables and the model here
# ...
saver_hook = tf.train.CheckpointSaverHook(
checkpoint_dir='checkpoint/',
save_secs=None,
save_steps=100,
saver=tf.train.Saver(),
checkpoint_basename='model.ckpt',
scaffold=None)
summary_hook = tf.train.SummarySaverHook(
save_steps=100,
save_secs=None,
output_dir='summaries/',
summary_writer=None,
scaffold=None,
summary_op=train_step_summary)
num_steps_hook = tf.train.StopAtStepHook(num_steps=500) # Just for testing
with tf.train.MonitoredSession(
hooks=[saver_hook, summary_hook, num_steps_hook]) as sess:
while not sess.should_stop():
step = sess.run(global_step)
if (step % 100 == 0):
print(step)
sess.run(optimizer)
When I run this code the first time, I get this output
0
100
200
300
400
The checkpoint folder at this point has checkpoints for every hundredth step up to 500. If I run the program again I would expect to see the counter start at 500 and the increase up to 900, but instead I just get the same thing and all of my checkpoints get overwritten. Any ideas?
Alright, I figured it out. It was actually really simple. First, it's easier to use a MonitoredTraningSession() instead of a MonitoredSession(). This wrapper session takes as an argument 'checkpoint_dir'. I thought that the saver_hook would take care of restoring, but that's not the case. In order to fix my problem I just had to change the line where I define the session like so:
with tf.train.MonitoredTrainingSession(hooks=[saver_hook, summary_hook], checkpoint_dir='checkpoint'):
It can also be done with the MonitoredSession directly, but you need to set up a session_creator instead.

Is there a way to dynamically fetch on the graph?

I'd like to write one decoder for both training (should pass gradient down to the encoder) and beam-search mode (single steps from python, sadly, so not linked to the encoder directly).
Ideally, something like this would work:
decoder(beamSearchFlag_boolPlaceholder, initalState_fromEncoder, initialState_placeholder, input):
initialState = tf.cond(beamSearchFlag_boolPlaceholder,
lambda: initialState_placeholder,
lambda: initalState_fromEncoder)
... = cell(input, initialState)
But with cond() TF still needs to resolve the dependencies of both branches. The _fromEncoder branch is executed when beamSearchFlag==False, even without effect, and that's a big part of unnecessary graph. Is there a way around this?

TensorFlow - how to evaluate all test set with every example once and only once

I am running cifar10 example from TensorFlow. But there is a problem for evaluation.
I have a test set and I want to evaluate every example from it once and only once. But the code (line 121) now only takes from an queue (line 126) which can not guarantee that. I have also made a modification that input is a '.tfrecords' file. Is there any suggestion?
Thank you in advance.
The function tf.train.string_input_producer that creates the queue of filenames here can accept an argument num_epochs. You can specify that you want it to run only 1 epoch.
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames, num_epochs=1)
I have figured a solution but rather imperfect. The clue is exclude it from variables to load and then initialize the limit_epochs by one's own. Following is the detailed step:
Add the code
del variables_to_restore['input_producer/limit_epochs/epochs'] after variables_to_restore = variable_averages.variables_to_restore(). And it will stop loading input_producer/limit_epochs to the model.
Next, add the code sess.run(tf.initialize_variables([v for v in tf.all_variables() if v.name.startswith("input_producer")])) in a session to activate the variable.
In the end, do the operation filename_queue = tf.train.string_input_producer(filenames, num_epochs=1).
And try to save the files before shutting down the threads.
The imperfection is you have to make every thread read only one example if you want it fits arbitrary test examples.

Categories