Why doesn't Tensorflow automatically handle hidden states of recurrent cells?

Why doesn't Tensorflow automatically handle hidden states of recurrent cells? - python

I am going through couple of Tensorflow examples that use LSTM cells and trying to understand the purpose of initial_state variable that is used in one implementation but not in the other for some unknown reason.
For example PTB example uses it as:
self._initial_state = cell.zero_state(config.batch_size, data_type())
state = self._initial_state
where it represents hidden state transitions and used to keep the hidden state intact during batch training. This variable should be zeroed between the epochs naturally. And yet some recurrent Bi-LSTM models don't use initial_state at all which makes you think that either it is somehow done by Tensorflow behind-the-scenes or not necessary at all hence the confusion. So, why do some recurrent models use it and others don't? In Torch for example, same mechanism is as simple as:
local params, grad_params = model:getParameters()
-- start training loop
while epoch < max_epoch do
for mini_batch in training_data do
(...)
grad_params:zero()
end
end
The hidden state is handled by the framework no need for all that really clunky stuff or am I missing something here. Can you please explain how does it work in Tensorflow?

As I understood, it it appears to be specific setup for Tensorflow PTB model which is supposed to be running not only with single LSTM cells but with several ones (who would even try to train it on more than 2 cells I wonder). For that it needs to keep track of c and h tensors between the cells and thus the _initial_state variable. It also is supposed to be running in parallel over several GPUs as well, continue if interrupted etc. And that is why PTB example code looks ugly and overengineered to a newcomer.

Related

Keras - no good way to stop and resume training?

After a lot of research, it seems like there is no good way to properly stop and resume training using a Tensorflow 2 / Keras model. This is true whether you are using model.fit() or using a custom training loop.
There seem to be 2 supported ways to save a model while training:
Save just the weights of the model, using model.save_weights() or save_weights_only=True with tf.keras.callbacks.ModelCheckpoint. This seems to be preferred by most of the examples I've seen, however it has a number of major issues:
The optimizer state is not saved, meaning training resumption will not be correct.
Learning rate schedule is reset - this can be catastrophic for some models.
Tensorboard logs go back to step 0 - making logging essentually useless unless complex workarounds are implemented.
Save the entire model, optimizer, etc. using model.save() or save_weights_only=False. The optimizer state is saved (good) but the following issues remain:
Tensorboard logs still go back to step 0
Learning rate schedule is still reset (!!!)
It is impossible to use custom metrics.
This doesn't work at all when using a custom training loop - custom training loops use a non-compiled model, and saving/loading a non-compiled model doesn't seem to be supported.
The best workaround I've found is to use a custom training loop, manually saving the step. This fixes the tensorboard logging, and the learning rate schedule can be fixed by doing something like keras.backend.set_value(model.optimizer.iterations, step). However, since a full model save is off the table, the optimizer state is not preserved. I can see no way to save the state of the optimizer independently, at least without a lot of work. And messing with the LR schedule as I've done feels messy as well.
Am I missing something? How are people out there saving/resuming using this API?

tf.keras.callbacks.experimental.BackupAndRestore API for resuming training from interruptions has been added for tensorflow>=2.3. It works great in my experience.
Reference:
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/experimental/BackupAndRestore

You're right, there isn't builtin support for resumability - which is exactly what motivated me to create DeepTrain. It's like Pytorch Lightning (better and worse in different regards) for TensorFlow/Keras.
Why another library? Don't we have enough? You have nothing like this; if there was, I'd not build it. DeepTrain's tailored for the "babysitting approach" to training: train fewer models, but train them thoroughly. Closely monitor each stage to diagnose what's wrong and how to fix.
Inspiration came from my own use; I'd see "validation spikes" throughout a long epoch, and couldn't afford to pause as it'd restart the epoch or otherwise disrupt the train loop. And forget knowing which batch you were fitting, or how many remain.
How's it compare to Pytorch Lightning? Superior resumability and introspection, along unique train debug utilities - but Lightning fares better in other regards. I have a comprehensive list comparison in working, will post within a week.
Pytorch support coming? Maybe. If I convince the Lightning dev team to make up for its shortcomings relative to DeepTrain, then not - otherwise probably. In the meantime, you can explore the gallery of Examples.
Minimal example:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from deeptrain import TrainGenerator, DataGenerator
ipt = Input((16,))
out = Dense(10, 'softmax')(ipt)
model = Model(ipt, out)
model.compile('adam', 'categorical_crossentropy')
dg = DataGenerator(data_path="data/train", labels_path="data/train/labels.npy")
vdg = DataGenerator(data_path="data/val", labels_path="data/val/labels.npy")
tg = TrainGenerator(model, dg, vdg, epochs=3, logs_dir="logs/")
tg.train()
You can KeyboardInterrupt at any time, inspect the model, train state, data generator - and resume.

tf.keras.callbacks.BackupAndRestore can take care of this.

Just use the callback function as
callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="backup_directory")

How does PyTorch's loss.backward() work when "retain_graph=True" is specified?

I'm a newbie with PyTorch and adversarial networks. I've tried to look for an answer on the PyTorch documentation and from previous discussions both in the PyTorch and StackOverflow forums, but I couldn't find anything useful.
I'm trying to train a GAN with a Generator and a Discriminator, but I cannot understand if the whole process is working or not. As far as I'm concerned, I should train the Generator first and, then, updating the Discriminator's weights (similarly as this). My code for updating the weights of both models is:
# computing loss_g and loss_d...
optim_g.zero_grad()
loss_g.backward()
optim_g.step()
optim_d.zero_grad()
loss_d.backward()
optim_d.step()
where loss_g is the generator loss, loss_d is the discriminator loss, optim_g is the optimizer referring to the generator's parameters and optim_d is the discriminator optimizer.
If I run the code like this, I get an error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
So I specify loss_g.backward(retain_graph=True), and here comes my doubt: why should I specify retain_graph=True if there are two networks with two different graphs? Am I getting something wrong?

Having two different networks doesn't necessarily mean that the computational graph is different. The computational graph only tracks the operations that were performed from the input to the output and it doesn't matter where the operation takes place. In other words, if you use the output of the first model in the second model (e.g. model2(model1(input))), you have the same sequential operations as if they were part of the same model. In fact, that is no different from having different parts of the model, such as multiple convolutions, that you apply one after the other.
The error you get, indicates that you are trying to backpropagate from the discriminator through the generator, which would mean that the discriminator's output directly adapts the generator's parameters for the discriminator to be successful. In an adversarial setting that is precisely what you want to avoid, they should be independent from each other. By setting retrain_graph=True you incorrectly hide this bug. In nearly all cases retain_graph=True is not the solution and should be avoided.
To resolve that issue, the two models need to be made independent from each other. The crossover between the two models happens when you use the generators output for the discriminator, since it should decide whether that was real or fake. Something along these lines:
fake = generator(noise)
real_prediction = discriminator(real)
# Using the output of the generator, continues the graph.
fake_prediction = discriminator(fake)
Even though fake comes from the generator, as far as the discriminator is concerned, it's merely another input, just like real. Therefore fake should be treated the same as real, where it is not attached to any computational graph. That can easily be done with torch.Tensor.detach, which decouples the tensor from the graph.
fake = generator(noise)
real_prediction = discriminator(real)
# Detach to make it independent of the generator
fake_prediction = discriminator(fake.detach())
That is also done in the code you referenced, from erikqu/EnhanceNet-PyTorch - train.py:
hr_imgs = torch.cat([discriminator(hr), discriminator(generated_hr.detach())], dim=0)

Tensorflow v1.10+ why is an input serving receiver function needed when checkpoints are made without it?

I'm in the process of adapting my model to TensorFlow's estimator API.
I recently asked a question regarding early stopping based on validation data where in addition to early stopping, the best model at this point should be exported.
It seems that my understanding of what a model export is and what a checkpoint is is not complete.
Checkpoints are made automatically. From my understanding, the checkpoints are sufficient for the estimator to start "warm" - either using so per-trained weights or weights prior to an error (e.g. if you experienced a power outage).
What is nice about checkpoints is that I do not have to write any code besides what is necessary for a custom estimator (namely, input_fn and model_fn).
While, given an initialized estimator, one can just call its train method to train the model, in practice this method is rather lackluster. Often one would like to do several things:
compare the network periodically to a validation dataset to ensure you are not over-fitting
stop the training early if over-fitting occurs
save the best model whenever the network finishes (either by hitting the specified number of training steps or by the early stopping criteria).
To someone new to the "high level" estimator API, a lot of low level expertise seems to be required (e.g. for the input_fn) as how one could get the estimator to do this is not straight forward.
By some light code reworking #1 can be achieved by using tf.estimator.TrainSpec and tf.estimator.EvalSpec with tf.estimator.train_and_evaluate.
In the previous question user #GPhilo clarifies how #2 can be achieved by using a semi-unintuitive function from the tf.contrib:
tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator,'my_metric_to_monitor', 10000)
(unintuitive as "the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range").
#GPhilo - noting that it is unrelated to #2 - also answered how to do #3 (as requested in the original post). Yet, I do not understand what an input_serving_fn is, why it is needed, or how to make it.
This is further confusing to me as no such function is needed to make checkpoints, or for the estimator to start "warm" from the checkpoint.
So my questions are:
what is the difference between a checkpoint and an exported best model?
what exactly is a serving input receiver function and how to write one? (I have spent a bit of time reading over the tensorflow docs and do not find it sufficient to understand how I should write one, and why I even have to).
how can I train my estimator, save the best model, and then later load it.
To aid in answering my question I am providing this Colab document.
This self contained notebook produces some dummy data, saves it in TF Records, has a very simple custom estimator via model_fn and trains this model with an input_fn that uses the TF Record files. Thus it should be sufficient for someone to explain to me what placeholders I need to make for the input serving receiver function and and how I can accomplish #3.
Update
#GPhilo foremost I can not understate my appreciation for you thoughtful consideration and care in aiding me (and hopefully others) understand this matter.
My “goal” (motivating me to ask this question) is to try and build a reusable framework for training networks so I can just pass a different build_fn and go (plus have the quality of life features of exported model, early stopping, etc).
An updated (based off your answers) Colab can be found here.
After several readings of your answer, I have found now some more confusion:
1.
the way you provide input to the inference model is different than the one you use for the training
Why? To my understanding the data input pipeline is not:
load raw —> process —> feed to model
But rather:
Load raw —> pre process —> store (perhaps as tf records)
# data processing has nothing to do with feeding data to the model?
Load processed —> feed to model
In other words, it is my understanding (perhaps wrongly) that the point of a tf Example / SequenceExample is to store a complete singular datum entity ready to go - no other processing needed other than reading from the TFRecord file.
Thus there can be a difference between the training / evaluation input_fn and the inference one (e.g. reading from file vs eager / interactive evaluation of in memory), but the data format is the same (except for inference you might want to feed only 1 example rather than a batch…)
I agree that the “input pipeline is not part of the model itself”. However, in my mind, and I am apparently wrong in thinking so, with the estimator I should be able to feed it a batch for training and a single example (or batch) for inference.
An aside: “When evaluating, you don't need the gradients and you need a different input function. “, the only difference (at least in my case) is the files from which you reading?
I am familiar with that TF Guide, but I have not found it useful because it is unclear to me what placeholders I need to add and what additional ops needed to be added to convert the data.
What if I train my model with records and want to inference with just the dense tensors?
Tangentially, I find the example in the linked guide subpar, given the tf record interface requires the user to define multiple times how to write to / extract features from a tf record file in different contexts. Further, given that the TF team has explicitly stated they have little interest in documenting tf records, any documentation built on top of it, to me, is therefore equally unenlightening.
Regarding tf.estimator.export.build_raw_serving_input_receiver_fn.
What is the placeholder called? Input? Could you perhaps show the analog of tf.estimator.export.build_raw_serving_input_receiver_fn by writing the equivalent serving_input_receiver_fn
Regarding your example serving_input_receiver_fn with the input images. How do you know to call features ‘images’ and the receiver tensor ‘input_data’ ? Is that (the latter) standard?
How to name an export with signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY.

What is the difference between a checkpoint and an exported best model?
A checkpoint is, at its minimum, a file containing the values of all the variables of a specific graph taken at a specific time point.
By specific graph I mean that when loading back your checkpoint, what TensorFlow does is loop through all the variables defined in your graph (the one in the session you're running) and search for a variable in the checkpoint file that has the same name as the one in the graph. For resuming training, this is ideal because your graph will always look the same between restarts.
An exported model serves a different purpose. The idea of an exported model is that, once you're done training, you want to get something you can use for inference that doesn't contain all the (heavy) parts that are specific to training (some examples: gradient computation, global step variable, input pipeline, ...).
Moreover, and his is the key point, typically the way you provide input to the inference model is different than the one you use for the training. For training, you have an input pipeline that loads, preprocess and feeds data to your network. This input pipeline is not part of the model itself and may have to be altered for inference. This is a key point when operating with Estimators.
Why do I need a serving input receiver function?
To answer this I'll take first a step back. Why do we need input functions at all ad what are they? TF's Estimators, while perhaps not as intuitive as other ways to model networks, have a great advantage: they clearly separate between model logic and input processing logic by means of input functions and model functions.
A model lives in 3 different phases: Training, Evaluation and Inference. For the most common use-cases (or at least, all I can think of at the moment), the graph running in TF will be different in all these phases. The graph is the combination of input preprocessing, model and all the machinery necessary to run the model in the current phase.
A few examples to hopefully clarify further: When training, you need gradients to update the weights, an optimizer that runs the training step, metrics of all kinds to monitor how things are going, an input pipeline that grabs data from the training set, etc. When evaluating, you don't need the gradients and you need a different input function. When you are inferencing, all you need is the forward part of the model and again the input function will be different (no tf.data.* stuff but typically just a placeholder).
Each of these phases in Estimators has its own input function. You're familiar with the training and evaluation ones, the inference one is simply your serving input receiver function. In TF lingo, "serving" is the process of packing a trained model and using it for inference (there's a whole TensorFlow serving system for large-scale operation but that's beyond this question and you most likely won't need it anyhow).
Time to quote a TF guide on the topic:
During training, an input_fn() ingests data and prepares it for use by
the model. At serving time, similarly, a serving_input_receiver_fn()
accepts inference requests and prepares them for the model. This
function has the following purposes:
To add placeholders to the graph that the serving system will feed
with inference requests.
To add any additional ops needed to convert
data from the input format into the feature Tensors expected by the
model.
Now, the serving input function specification depends on how you plan of sending input to your graph.
If you're going to pack the data in a (serialized) tf.Example (which is similar to one of the records in your TFRecord files), your serving input function will have a string placeholder (that's for the serialized bytes for the example) and will need a specification of how to interpret the example in order to extract its data. If this is the way you want to go I invite you to have a look at the example in the linked guide above, it essentially shows how you setup the specification of how to interpret the example and parse it to obtain the input data.
If, instead, you're planning on directly feeding input to the first layer of your network you still need to define a serving input function, but this time it will only contain a placeholder that will be plugged directly into the network. TF offers a function that does just that: tf.estimator.export.build_raw_serving_input_receiver_fn.
So, do you actually need to write your own input function? IF al you need is a placeholder, no. Just use build_raw_serving_input_receiver_fn with the appropriate parameters. IF you need fancier preprocessing, then yes, you might need to write your own. In that case, it would look something like this:
def serving_input_receiver_fn():
"""For the sake of the example, let's assume your input to the network will be a 28x28 grayscale image that you'll then preprocess as needed"""
input_images = tf.placeholder(dtype=tf.uint8,
shape=[None, 28, 28, 1],
name='input_images')
# here you do all the operations you need on the images before they can be fed to the net (e.g., normalizing, reshaping, etc). Let's assume "images" is the resulting tensor.
features = {'input_data' : images} # this is the dict that is then passed as "features" parameter to your model_fn
receiver_tensors = {'input_data': input_images} # As far as I understand this is needed to map the input to a name you can retrieve later
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
How can I train my estimator, save the best model, and then later load it?
Your model_fn takes the mode parameter in order for you to build conditionally the model. In your colab, you always have a optimizer, for example. This is wrong ,as it should only be there for mode == tf.estimator.ModeKeys.TRAIN.
Secondly, your build_fn has an "outputs" parameter that is meaningless. This function should represent your inference graph, take as input only the tensors you'll fed to it in the inference and return the logits/predictions.
I'll thus assume the outputs parameters is not there as the build_fn signature should be def build_fn(inputs, params).
Moreover, you define your model_fn to take features as a tensor. While this can be done, it both limits you to having exactly one input and complicates things for the serving_fn (you can't use the canned build_raw_... but need to write your own and return a TensorServingInputReceiver instead). I'll choose the more generic solution and assume your model_fn is as follows (I omit the variable scope for brevity, add it as necessary):
def model_fn(features, labels, mode, params):
my_input = features["input_data"]
my_input.set_shape(I_SHAPE(params['batch_size']))
# output of the network
onet = build_fn(features, params)
predicted_labels = tf.nn.sigmoid(onet)
predictions = {'labels': predicted_labels, 'logits': onet}
export_outputs = { # see EstimatorSpec's docs to understand what this is and why it's necessary.
'labels': tf.estimator.export.PredictOutput(predicted_labels),
'logits': tf.estimator.export.PredictOutput(onet)
}
# NOTE: export_outputs can also be used to save models as "SavedModel"s during evaluation.
# HERE is where the common part of the graph between training, inference and evaluation stops.
if mode == tf.estimator.ModeKeys.PREDICT:
# return early and avoid adding the rest of the graph that has nothing to do with inference.
return tf.estimator.EstimatorSpec(mode=mode,
predictions=predictions,
export_outputs=export_outputs)
labels.set_shape(O_SHAPE(params['batch_size']))
# calculate loss
loss = loss_fn(onet, labels)
# add optimizer only if we're training
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdagradOptimizer(learning_rate=params['learning_rate'])
# some metrics used both in training and eval
mae = tf.metrics.mean_absolute_error(labels=labels, predictions=predicted_labels, name='mea_op')
mse = tf.metrics.mean_squared_error(labels=labels, predictions=predicted_labels, name='mse_op')
metrics = {'mae': mae, 'mse': mse}
tf.summary.scalar('mae', mae[1])
tf.summary.scalar('mse', mse[1])
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics, predictions=predictions, export_outputs=export_outputs)
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op, eval_metric_ops=metrics, predictions=predictions, export_outputs=export_outputs)
Now, to set up the exporting part, after your call to train_and_evaluate finished:
1) Define your serving input function:
serving_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
{'input_data':tf.placeholder(tf.float32, [None,#YOUR_INPUT_SHAPE_HERE (without batch size)#])})
2) Export the model to some folder
est.export_savedmodel('my_directory_for_saved_models', serving_fn)
This will save the current state of the estimator to wherever you specified. If you want a specifc checkpoint, load it before calling export_savedmodel.
This will save in "my_directory_for_saved_models" a prediction graph with the trained parameters that the estimator had when you called the export function.
Finally, you might want t freeze the graph (look up freeze_graph.py) and optimize it for inference (look up optimize_for_inference.py and/or transform_graph) obtaining a frozen *.pb file you can then load and use for inference as you wish.
Edit: Adding answers to the new questions in the update
Sidenote:
My “goal” (motivating me to ask this question) is to try and build a
reusable framework for training networks so I can just pass a
different build_fn and go (plus have the quality of life features of
exported model, early stopping, etc).
By all means, if you manage, please post it on GitHub somewhere and link it to me. I've been trying to get just the same thing up and running for a while now and the results are not quite as good as I'd like them to be.
Question 1:
In other words, it is my understanding (perhaps wrongly) that the
point of a tf Example / SequenceExample is to store a complete
singular datum entity ready to go - no other processing needed other
than reading from the TFRecord file.
Actually, this is typically not the case (although, your way is in theory perfectly fine too).
You can see TFRecords as a (awfully documented) way to store a dataset in a compact way. For image datasets for example, a record typically contains the compressed image data (as in, the bytes composing a jpeg/png file), its label and some meta information. Then the input pipeline reads a record, decodes it, preprocesses it as needed and feeds it to the network. Of course, you can move the decoding and preprocessing before the generation of the TFRecord dataset and store in the examples the ready-to-feed data, but the size blowup of your dataset will be huge.
The specific preprocessing pipeline is one example what changes between phases (for example, you might have data augmentation in the training pipeline, but not in the others). Of course, there are cases in which these pipelines are the same, but in general this is not true.
About the aside:
“When evaluating, you don't need the gradients and you need a
different input function. “, the only difference (at least in my case)
is the files from which you reading?
In your case that may be. But again, assume you're using data augmentation: You need to disable it (or, better, don't have it at all) during eval and this alters your pipeline.
Question 2: What if I train my model with records and want to inference with just the dense tensors?
This is precisely why you separate the pipeline from the model.
The model takes as input a tensor and operates on it. Whether that tensor is a placeholder or is the output of a subgraph that converts it from an Example to a tensor, that's a detail that belongs to the framework, not to the model itself.
The splitting point is the model input. The model expects a tensor (or, in the more generic case, a dict of name:tensor items) as input and uses that to build its computation graph. Where that input comes from is decided by the input functions, but as long as the output of all input functions has the same interface, one can swap inputs as needed and the model will simply take whatever it gets and use it.
So, to recap, assuming you train/eval with Examples and predict with dense tensors, your train and eval input functions will set up a pipeline that reads examples from somewhere, decodes them into tensors and returns those to the model to use as inputs. Your predict input function, on the other hand, just sets up one placeholder per input of your model and returns them to the model, because it assumes you'll put in the placeholders the data ready to be fed to the network.
Question 3:
You pass the placeholder as a parameter of build_raw_serving_input_receiver_fn, so you choose its name:
tf.estimator.export.build_raw_serving_input_receiver_fn(
{'images':tf.placeholder(tf.float32, [None,28,28,1], name='input_images')})
Question 4:
There was a mistake in the code (I had mixed up two lines), the dict's key should have been input_data (I amended the code above).
The key in the dict has to be the key you use to retrieve the tensor from features in your model_fn. In model_fn the first line is:
my_input = features["input_data"]
hence the key is 'input_data'.
As per the key in receiver_tensor, I'm still not quite sure what role that one has, so my suggestion is try setting a different name than the key in features and check where the name shows up.
Question 5:
I'm not sure I understand, I'll edit this after some clarification

How to perform finetuning on a Pytorch net

I'm using this implementation of SegNet in Pytorch, and I want to finetune it.
I've read online and I've found this method (basically freezing all layers except the last one in your net). My problem is that SegNet has more than 100 layers and I'm looking for a simpler way to do it, rather than writing 100 lines of code.
Do you think this could work? Or is this utter nonsense?
import torch.optim as optim
model = SegNet()
for name, param in model.named_modules():
if name != 'conv11d': # the last layer should remain active
param.requires_grad = False
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
def train():
...
How can I check if this is working as intended?

This process is called finetuning and setting requires_grad to False is a good way to do this. From the pytorch docs:
Every Tensor has a flag: requires_grad that allows for fine grained exclusion of subgraphs from gradient computation and can increase efficiency.
...
If there’s a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it. Backward computation is never performed in the subgraphs, where all Tensors didn’t require gradients.
See this pytorch tutorial for a relevant example.
One simple way of checking to see this is working is looking at the initial error rates. Assuming the task is similar to the task the net was originally trained on, they should be much lower than for a randomly initialized net.

what does the optional argument "constants" do in the keras recurrent layers?

I'm learning to build a customized sequence-to-sequence model with keras, and have been reading some codes that other people wrote, for example here. I got confused in the call method regarding constants. There is the keras "Note on passing external constants to RNNs", however I'm having trouble to understand what the constants are doing to the model.
I did go through the attention model and the pointer network papers, but maybe I've missed something.
Any reference to the modeling details would be appreciated! Thanks in advance.

Okay just as a reference in case someone else stumbles across this question: I went through the code in the recurrent.py file, I think the get_constants is getting the dropout mask and the recurrent dropout mask, then concatenating it with the [h,c] states (the order of these four elements is required in the LSTM step method). After that it doesn't matter anymore to the original LSTM cell, but you can add your own 'constants' (in the sense that it won't be learned) to pass from one timestep to the next. All constants will be added to the returned [h,c] states implicitly. In Keon's example the fifth position of the returned state is the input sequence, and it can be referenced in every timestep by calling states[-1].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.