Conceptual understanding of tf.nn.dynamic_rnn() "outputs" vs. "state"

Conceptual understanding of tf.nn.dynamic_rnn() "outputs" vs. "state" - python

Context
I'm reading through part II of Hands on ML and am looking for some clarity on when to use "outputs" and when to use "state" in the loss calculation for a RNN.
In the book (p.396 for those that have the book), the author says, "Note that the fully connected layer is connected to the states tensor, which contains only the final states of the RNN," referring to a sequence classifier that is unrolled over 28 steps. Since the states variable will have len(states) == <number_of_hidden_layers>, when building a deep RNN I have been using states[-1] to only connect to the final state of the final layer. For example:
# hidden_layer_architecture = list of ints defining n_neurons in each layer
# example: hidden_layer_architecture = [100 for _ in range(5)]
layers = []
for layer_id, n_neurons in enumerate(hidden_layer_architecture):
hidden_layer = tf.contrib.rnn.BasicRNNCell(n_neurons,
activation=tf.nn.tanh,
name=f'hidden_layer_{layer_id}')
layers.append(hidden_layer)
recurrent_hidden_layers = tf.contrib.rnn.MultiRNNCell(layers)
outputs, states = tf.nn.dynamic_rnn(recurrent_hidden_layers,
X_, dtype=tf.float32)
logits = tf.layers.dense(states[-1], n_outputs, name='outputs')
This works as expected given the author's previous statement. However, I don't understand when one would use the outputs variable (first output of tf.nn.dynamic_rnn())
I have looked at this question, which does a pretty good job of answering the minutia, and mentioned that, "If you are only interested in the last output of the cell, you can just slice the time dimension to pick just the last element (e.g. outputs[:, -1, :])." I inferred this to mean something along the lines of states[-1] == outputs[:, -1, :], which when tested was false. Why would this not be the case? If the outputs are the outputs of the cell at each time step, why wouldn't this be the case? In general...
Question
When does one use the outputs variable from tf.nn.dynamic_rnn() in the loss function and when would one use the states variable? How does this change the abstracted architecture of the network?
Any clarity would be greatly appreciated.

This basically breaks it down:
outputs: Full sequence of outputs of the top-level of the RNN. This means that, should you be using MultiRNNCell, this will only be the top cell; nothing from the lower cells is in here.
In general, with custom RNNCell implementations, this could be pretty much anything, however pretty much all the standard cells with return the sequence of states here, however you could also write a custom cell yourself that does something to the state sequence (e.g. a linear transformation) before returning it as outputs.
state (note that this is what the docs call it, not states) is the full state of the last time step. One important difference is that, in the case of MultiRNNCell, this will contain the final states of all cells in the sequence, not just the top one! Also, the precise format/type of this output varies heavily depending on the RNNCell used (e.g. it could be a tensor, or a tuple of tensors...).
As such, if all you care about is the top-most state of the last time step in a MultiRNNCell, you really have two options that should be identical, coming down to personal preference/"clarity":
outputs[:, -1, :] (assuming batch-major format) extracts only the last time-step from the sequence of top-level states.
state[-1] extracts only the top-level state from the tuple of final states for all layers.
There are other scenarios where you might not have this choice:
If you actually need the full sequence output, you need to use outputs.
If you need the final states from lower layers in a MultiRNNCell, you need to use state.
As for why the equality check fails: If you actually used ==, I believe this checks for equality of the tensor objects which are obviously different. You could instead try to inspect the values of the two objects for some simple toy scenario (tiny state size/sequence length) -- they should be the same.

Related

How can I obtain the output of an intermediate layer (feature extraction)?

I want to extract features of a optical image and save them into numpy array . I've seen similar questions , also can be seen here : https://keras.io/getting_started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer-feature-extraction , but don't know how to go about it .

Keras documentation does exaclty specify how to do that. If you have defined your model model_full you can create another one, that is just a part of it - from the input layer, to the one you're interested in.
model_part = Model(
inputs=model_full.input,
outputs=model_full.get_layer("intermed_layer").output)
Then you should be able to obtain output from intermediate layer using:
intermed_output = model_part(data)
In order to do that, you just need a model_full defined, which I assume you already have.
2nd approach
You can also use built-in Keras function, which I guess you already saw in documentation as well. It may look kind of complicated at first, but it's just creating a function with bound values i.e.
from keras import backend as K
get_3rd_layer_output = K.function(
[model.layers[0].input], # param 1 will be treated as layer[0].output
[model.layers[3].output]) # and this function will return output from 3rd layer
# here X is param 1 (input) and the function returns output from layers[3]
output = get_3rd_layer_output([X])[0]
Clearly, again model has to be defined. Not sure if there are any other requirements apart from that.

How to prune weights less than a threshold in PyTorch?

How to prune weights of a CNN (convolution neural network) model which is less than a threshold value (let's consider prune all weights which are <= 1).
How we can achieve that for a weight file saved in .pth format in pytorch?

PyTorch since 1.4.0 provides model pruning out of the box, see official tutorial.
As there is no threshold method to prune in PyTorch currently, you have to implement it yourself, though it's kinda easy once you get the overall idea.
Threshold Pruning method
Below is a code performing pruning:
from torch.nn.utils import prune
class ThresholdPruning(prune.BasePruningMethod):
PRUNING_TYPE = "unstructured"
def __init__(self, threshold):
self.threshold = threshold
def compute_mask(self, tensor, default_mask):
return torch.abs(tensor) > self.threshold
Explanation:
PRUNING_TYPE can be one of global, structured, unstructured. global acts across whole module (e.g. remove 20% of weight with smallest value), structured acts on whole channels/modules. We need unstructured as we would like to modify each connection in specific parameter tensor (say weight or bias)
__init__ - pass here whatever you want or need to make it work, normal stuff
compute_mask - mask to be used to prune specific tensor. In our case all parameters below threshold should be zero. I did it with absolute value as it makes more sense. default_mask is not needed here, but is left as named parameter as that's what API requires atm.
Moreover, inheriting from prune.BasePruningMethod defines methods to apply the mask to each parameter, make pruning permanent etc. See base class docs for more info.
Example module
Nothing too fancy, you can put anything you want here:
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.first = torch.nn.Linear(50, 30)
self.second = torch.nn.Linear(30, 10)
def forward(self, inputs):
return self.second(torch.relu(self.first(inputs)))
module = MyModule()
You can also load your module via module = torch.load('checkpoint.pth')
if you need, it doesn't matter.
Prune module's parameters
We should define which parameter of our module (and whether it's weight or bias) should be pruned, like this:
parameters_to_prune = ((module.first, "weight"), (module.second, "weight"))
Now, we can apply globally our unstructured pruning to all defined parameters (threshold is passed as kwarg to __init__ of ThresholdPruning):
prune.global_unstructured(
parameters_to_prune, pruning_method=ThresholdPruning, threshold=0.1
)
Results
weight attribute
To see the effect, check weights of first submodule simply with:
print(module.first.weight)
It is a weight with our pruning technique applied, but please notice it's not a torch.nn.Parameter anymore! Now it is simply an attribute of our model, hence it won't take part in training or evaluation currently.
weight_mask
We can check created mask via module.first.weight_mask to see everything is done correctly (it will be binary in this case).
weight_orig
Applying pruning creates a new torch.nn.Parameter with original weights named name + _orig, in this case weight_orig, let's see:
print(module.first.weight_orig)
This parameter will be used during training and evaluation currently!. After applying pruning via methods described above there are forward_pre_hooks added which "switch" original weight to weight_orig.
Due to such approach you can define and apply your pruning at any part of training or inference without "destroying" original weights.
Applying pruning permanently
If you wish to apply pruning permanently simply issue:
prune.remove(module.first, "weight")
And now our module.first.weight is once again parameter with entries appropriately pruned, module.first.weight_mask is removed and so is module.first.weight_orig. It's what you are probably after.
You can iterate over children to make it permanent:
for child in module.children():
prune.remove(child, "weight")
You could define parameters_to_prune using the same logic:
parameters_to_prune = [(child, "weight") for child in module.children()]
Or if you want only convolution layers to be pruned (or anything else really):
parameters_to_prune = [
(child, "weight")
for child in module.children()
if isinstance(child, torch.nn.Conv2d)
]
Advantages
uses "PyTorch way of pruning" so it's easier to communicate your intent to other programmers
define pruning on a per-tensor basis, single responsibility instead of going through everything
confine to predefined ways
pruning is not permanent hence you can recover from it if needed. Module can be saved with pruning masks and original weights so it leaves you some space to revert eventual mistake (e.g. threshold was too high and now all your weights are zero rendering results meaningless)
works with original weights during forward calls unless you want to finally change to pruned version (simple call to remove)
Disadvantages
IMO pruning API could be clearer
You can do it shorter (as provided by Shai)
might be confusing for those who do not know such thing is "defined" by PyTorch (still there are tutorials and docs so I don't think it's a major problem)

You can work directly on the values saved in the state_dict:
sd = torch.load('saved_weights.pth') # load the state dicd
for k in sd.keys():
if not 'weight' in k:
continue # skip biases and other saved parameters
w = sd[k]
sd[k] = w * (w > thr) # set to zero weights smaller than thr
torch.save(sd, 'pruned_weights.pth')

tensorflow Optimizer.py "optimizer._get_variable_for" file function

I am trying to implement an Optimizer in tensorflow, and have been looking at optimizer code from old version of tensorflow, and want to understand what does this function _get_variable_for do? It is the first function in the optimizer file.
Any help would be appreciated.
Thanking You.

I see that this function checks two conditions.
ResourceVariable and VarHandleOp
This is a ResourceVariable according to the comments in the code
"For example, if there is more than one assignment to a ResourceVariable in
a single session.run call there is a well-defined value for each operation
which uses the variable's value if the assignments and the read are connected
by edges in the graph. Consider the following example, in which two writes
can cause tf.Variable and tf.ResourceVariable to behave differently:"
a = tf.Variable(1.0, use_resource=True)
a.initializer.run()
assign = a.assign(2.0)
with tf.control_dependencies([assign]):
b = a.read_value()
with tf.control_dependencies([b]):
other_assign = a.assign(3.0)
with tf.control_dependencies([other_assign]):
# Will print 2.0 because the value was read before other_assign ran. If
# `a` was a tf.Variable instead, 2.0 or 3.0 could be printed.
tf.Print(b, [b]).eval()
VarHandleOp seems to have deeper semantics as per this
"A common approach to managing where variables are placed, is to create a method to determine where each Op is to be placed and use that method in place of a specific device name when calling with tf.device(): Consider a scenario where a model is being trained on 2 GPUs and the variables are to be placed on the CPU. There would be a loop for creating and placing the "towers" on each of the 2 GPUs. A custom device placement method would be created that watches for Ops of type Variable, VariableV2, and VarHandleOp and indicates that they are to be placed on the CPU. All other Ops would be placed on the target GPU."
It explains this scenario further with sample code.

Conditional Variational Autoencoder for cocktail recipe generation

I want to use a conditional variational autoencoder to generate cocktail recipes. I modified the code from this repo so it can read my own data. The input is an array of all the possible ingredients, so most of the entries have the value 0. If an ingredient is present, it gets a value which is the amount normalized by 250 ml. The last index is what is 'left over' to make sure a cocktail always adds op to 1.
Example:
0,0.0,0.0,0.24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0120000000000000
The output with a softmax activation function looks a bit like this:
[5.8228267e-10 6.7397465e-10 1.9761790e-08 2.3713847e-01 3.1315527e-11
4.9592632e-11 4.2637563e-05 7.6098106e-10 2.9357905e-05 1.3291576e-08
2.6885323e-09 4.2986945e-10 3.0274603e-09 8.6994453e-11 3.2391853e-10
3.3694150e-10 4.9642315e-11 2.2861177e-10 2.5966980e-11 3.3872125e-10
4.8175470e-12 1.1207919e-09 7.8108942e-10 1.0438563e-09 4.7190268e-12
2.2692757e-09 3.3177341e-10 4.7493649e-09 1.6603904e-08 2.7854623e-11
1.1586791e-07 2.3917833e-08 1.0172608e-09 2.2049740e-06 4.0200213e-10
4.8334226e-05 1.9393491e-09 4.0731374e-10 4.5671125e-10 8.5878060e-10
1.3625046e-10 1.7755342e-09 2.4927729e-09 3.8919952e-09 1.6791472e-10
1.5160178e-09 9.0631114e-10 1.2043951e-08 2.1420650e-01 1.4531254e-10
3.9913628e-10 4.6368896e-06 6.8399265e-11 2.4654754e-09 6.5392605e-12
5.8443012e-10 2.7861690e-11 4.7215394e-08 5.1503157e-09 5.4484850e-10
1.9266211e-10 7.2835156e-09 6.4243433e-10 4.2432866e-09 4.2630177e-08
1.1281617e-12 1.8015703e-08 3.5657147e-10 3.4241193e-11 4.8394988e-10
9.6064046e-11 2.9857121e-02 3.8048144e-11 1.1893182e-10 5.1867032e-01]
How can I make sure that the values are only distributed among a couple of ingredients and the rest of the ingredients get 0, similar to the input?
Is this a matter of changing the activation functions?
Thanks :)

I'm not sure you want to use probabilities here. It seems you're doing a regression to some specific values. Hence, it would make sense to not use a softmax, and use a simple mean-squared-error loss.
Note that if certain values are always biased in your loss, you can just use an extra weight on your loss, or use some abstraction (e.g. Keras's class_weight).
For this task you could consider using Keras, especially for this task it would make sense. There is an example checked into master: https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py
For this task it might actually make sense to use a GAN: https://github.com/keras-team/keras/blob/master/examples/mnist_acgan.py . You'll let it distinguish between a random cocktail and a 'real' cocktail. It will learn to distinguish between the two, and in the process, it will train the weights to be able to create a generator that will generate cocktails for you!

Tensorflow creates a new set of already existing variables each session run?

I'm finally using my LSTM model to predict things. However, I've run into a new problem that I don't quite understand. If I try to predict something using
sess.run(pred, feed_dict={x: xs})
It works great for the first prediction, but any subsequent predictions throw the error:
ValueError: Variable weight_def/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?
Now, there are a TON of topics on this - and most of them are easily solved by doing what it asks - just create a variable scope around the offending line and make variable reuse true. Now, if I do that I get the following error:
ValueError: Variable rnn_def/RNN/BasicLSTMCell/Linear/Matrix does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
This is causing me quite the headache. I've read the Tensorflow Variable Sharing documentation over and over, and I can't for the life of me figure out what I am doing wrong. Here the offending lines
with tf.variable_scope("rnn_def"):
outputs, states = rnn.rnn(self.__lstm_cell,
self.__x,
dtype=tf.float32)
self.__outputs = outputs
self.__states = states
I have this code nested in a larger class that just contains the remainder of the graph. To train it, I just call my "train" method over and over again. Which seems to work fine, the problem ends up being prediction.
So my question is two fold:
Why do I require some sort of variable sharing only after the first prediction but the first call doesn't fail? What do I need to fix this code so I can predict more than once without causing an error?
When is variable sharing useful, and why is Tensorflow creating new variables each time I run it? How can I prevent this (do I want to prevent it?)?
Thank you!

Add a print statement to that block of code. I suspect it is being called multiple times. Or maybe you are creating multiple instances of the class in which each class should have its own scope name.
To answer your questions.
Why do I require some sort of variable sharing only after the first
prediction but the first call doesn't fail? What do I need to fix this
code so I can predict more than once without causing an error?
No you don't. That block of code creating the RNN is probably being accidentally called multiple times.
When is variable sharing useful, and why is Tensorflow creating new
variables each time I run it? How can I prevent this (do I want to
prevent it?)?
It is useful in the following case where I have different input sources for part of my graph depending on whether is is training or predicting.
x_train, h_train = ops.sharpen_cell(x_train, h_train, 2, n_features, conv_size, n_convs, conv_activation, 'upsampler')
self.cost += tf.reduce_mean((x_train - y_train) ** 2)
level_scope.reuse_variables()
x_gen, h_gen = ops.sharpen_cell(x_gen, h_gen, 2, n_features, conv_size, n_convs, conv_activation, 'upsampler')
self.generator_outputs.append(tf.clip_by_value(x_gen, -1, 1))
In this example is reuses the variables for the generator which were trained with the trainer. It is also useful if you want to unroll and RNN in a loop. Such as in this case...
y = #initial value
state = #initial state
rnn = #some sort of RNN cell
with tf.variable_scope("rnn") as scope:
for t in range(10):
y, state = rnn(y, state)
scope.reuse_variabled()
In this case it will reuse the rnn weights between time steps which is the desired behavior for an RNN.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.