I want to use the external optimizer interface within tensorflow, to use newton optimizers, as tf.train only has first order gradient descent optimizers. At the same time, i want to build my network using tf.keras.layers, as it is way easier than using tf.Variables when building large, complex networks. I will show my issue with the following, simple 1D linear regression example:
import tensorflow as tf
from tensorflow.keras import backend as K
import numpy as np
#generate data
no = 100
data_x = np.linspace(0,1,no)
data_y = 2 * data_x + 2 + np.random.uniform(-0.5,0.5,no)
data_y = data_y.reshape(no,1)
data_x = data_x.reshape(no,1)
# Make model using keras layers and train
x = tf.placeholder(dtype=tf.float32, shape=[None,1])
y = tf.placeholder(dtype=tf.float32, shape=[None,1])
output = tf.keras.layers.Dense(1, activation=None)(x)
loss = tf.losses.mean_squared_error(data_y, output)
optimizer = tf.contrib.opt.ScipyOptimizerInterface(loss, method="L-BFGS-B")
sess = K.get_session()
sess.run(tf.global_variables_initializer())
tf_dict = {x : data_x, y : data_y}
optimizer.minimize(sess, feed_dict = tf_dict, fetches=[loss], loss_callback=lambda x: print("Loss:", x))
When running this, the loss just does not change at all. When using any other optimizer from tf.train, it works fine. Also, when using tf.layers.Dense() instead of tf.keras.layers.Dense() it does work using the ScipyOptimizerInterface. So really the question is what is the difference between tf.keras.layers.Dense() and tf.layers.Dense(). I saw that the Variables created by tf.layers.Dense() are of type tf.float32_ref while the Variables created by tf.keras.layers.Dense() are of type tf.float32. As far as I now, _ref indicates that this tensor is mutable. So maybe that's the issue? But then again, any other optimizer from tf.train works fine with keras layers.
Thanks
After a lot of digging I was able to find a possible explanation.
ScipyOptimizerInterface uses feed_dicts to simulate the updates of your variables during the optimization process. It only does an assign operation at the very end. In contrast, tf.train optimizers always do assign operations. The code of ScipyOptimizerInterface is not that complex so you can verify this easily.
Now the problem is that assigining variables with feed_dict is working mostly by accident. Here is a link where I learnt about this. In other words, assigning variables via feed dict, which is what ScipyOptimizerInterface does, is a hacky way of doing updates.
Now this hack mostly works, except when it does not. tf.keras.layers.Dense uses ResourceVariables to model the weights of the model. This is an improved version of simple Variables that has cleaner read/write semantics. The problem is that under the new semantics the feed dict update happens after the loss calculation. The link above gives some explanations.
Now tf.layers is currently a thin wrapper around tf.keras.layer so I am not sure why it would work. Maybe there is some compatibility check somewhere in the code.
The solutions to adress this are somewhat simple.
Either avoid using components that use ResourceVariables. This can be kind of difficult.
Patch ScipyOptimizerInterface to do assignments for variables always. This is relatively easy since all the required code is in one file.
There was some effort to make the interface work with eager (that by default uses the ResourceVariables). Check out this link
I think the problem is with the line
output = tf.keras.layers.Dense(1, activation=None)(x)
In this format output is not a layer but rather the output of a layer, which might be preventing the wrapper from collecting the weights and biases of the layer and feed them to the optimizer. Try to write it in two lines e.g.
output = tf.keras.layers.Dense(1, activation=None)
res = output(x)
If you want to keep the original format then you might have to manually collect all trainables and feed them to the optimizer via the var_list option
optimizer = tf.contrib.opt.ScipyOptimizerInterface(loss, var_list = [Trainables], method="L-BFGS-B")
Hope this helps.
Related
I'm using the functional API of TensorFlow 2 and tensorflow.keras.layers to build the model.
I have an input tensor (in_1) with shape [batch_size, length, dim] and I would like to compute the mean along the length dimension and obtain an output tensor (out_1) with shape [batch_size, dim].
Which of this should I use to do it? (all these options works, in terms of output shape and training)
out_1 = Lambda(lambda x: tf.math.reduce_mean(x, axis=1))(in_1)
out_1 = Lambda(lambda x: tf.keras.backend.mean(x, axis=1))(in_1)
out_1 = tf.math.reduce_mean(in_1, axis=1)
This last one automatically creates a TensorFlowOpLayer, is this something that should be avoided?
Are there other ways to do this?
What's the difference between tf.math.reduce_mean and tf.keras.backend.mean, which should I use?
I know that custom functions should be called inside the Lambda layer, but is it true also for TensorFlow functions such as tf.math.reduce_mean which can process the tensor in "one fell swoop"? How should I call them if I need to specify a parameter (e.g. axis)?
First, for the difference between tf.keras.backend.mean and tf.math.reduce_mean: There is none. You can check the source code for the keras backend version, which simply uses reduce_mean (from math_ops, but internally that's the same one that's exposed in tf.math). IMHO this is a bit of a failure in the TF re-design where they incorporated Keras: Keras is now contained in TF, but Keras also uses TF in the "backend", so you basically have every operation twice: Once the TF version, and once the Keras version which, after all, also just uses the TF version.
Anyway, for the difference between using Lambda or not: It also doesn't (really) matter. Here is a minimal example:
inp = tf.keras.Input((10,))
layer = tf.reduce_mean(inp, axis=-1)
model = tf.keras.Model(inp, layer)
print(model.layers)
gives the output
[<tensorflow.python.keras.engine.input_layer.InputLayer at 0x7f1a651500b8>,
<tensorflow.python.keras.engine.base_layer.TensorFlowOpLayer at 0x7f1a9912d8d0>]
We can see that the reduce_mean operation was automatically converted to a TensorFlowOpLayer. Now, this may be technically different from a Lambda layer, but I doubt that this makes any practical difference. I suppose this would not work for a Sequential model, where you need to supply a list of layers, so there Lambda would likely be needed.
I am working on a GAN and I'm trying to diagnose how any why mode collapse occurs. I want to be able to look "under the hood" and see what the outputs of various layers in the network look like for the last minibatch. I saw you can do something like model.layers[5].output, but this produces a tensor of shape [None, 64, 64, 512], which looks like an empty tensor and not the actual output from the previous run. My only other idea is to recompile a model that's missing all the layers after the one I'm interested in and then run a minibatch through, but this seems like an extremely inefficient way to do it. I'm wondering if there's an easier way. I want to run some statistics on layer outputs during the training process to see where things might be going wrong.
I did this for a GAN I was training myself. The method I used extends to both the generator (G) and discriminator (D) of a GAN.
The idea is to make a model with the same input as D or G, but with outputs according to each layer in the model that you require.
For me, I found it useful to check the activations. In Keras, with some model model (which will be D or G for you and me)
activation_layers = []
activation_names = []
# obtain the layers in a given model, but skip the first 6
# as these generally are the input / non-convolutional layers
model_layers = [layer for layer in sub_model.layers][6:]
# print the names of what we are looking at.
print("MODEL LAYERS:", model_layers)
for layer in model_layers:
# check if the layer is an activation
if isinstance(layer, (Activation, LeakyReLU)):
# append the output of this layer for later
activation_layers.append(layer.output)
# name it with a signature of its output for clarity
activation_names.append(layer.name + str(layer.output_shape[1]))
# now create a model which outputs every activation
activation_model = Model(inputs=model.inputs, outputs=activation_layers)
# this outputs a list of layers when given an input, so for G
noise = np.random.normal(size=(1,32,32,1)) # random image shape (change for yourself)
model_activations = model.predict(noise)
Now the rest is quite model-specific. This is the basic method for checking the outputs of the layers in a given model.
Note it can be done before, during or after training. It also does not need re-compiling.
The plotting of activation maps in this case is relatively straight forward and as you mentioned, you will probably have something specific you want to do. Still, I have to link this beautiful example here.
I report here a useful 2-line code block I extrapolated from the answer of #Homer that I used to inspect a single layer of the neural network.
ablation_model = Model(inputs=model.inputs, outputs=model.layers[-2].output)
preds = ablation_model.predict(np.random.normal(size=(20,2))) # adapt size
I'm self learning from Geron's "Hands on Machine Learning" and I'm a little confused about how this function (in box [114] of the following page) creates a deep neural network.
https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb
he_init = tf.variance_scaling_initializer()
def dnn(inputs, n_hidden_layers=5, n_neurons=100, name=None,
activation=tf.nn.elu, initializer=he_init):
with tf.variable_scope(name, "dnn"):
for layer in range(n_hidden_layers):
inputs = tf.layers.dense(inputs, n_neurons, activation=activation,
kernel_initializer=initializer,
name="hidden%d" % (layer + 1))
return inputs
It just looks like it resets the same input each time with a different name. Can someone explain how this is supposed to create a deep neural network?
There is a strong misconception about model construction in TensorFlow. You are advised to read more about TensorFlow's computational graph and other low-level details of this API in the official guide.
Operations built using TensorFlow are not bound to a Python variable
(assume that we are not in Eager mode for this answer). When calling one of the layer construction functions in tf.layers (or other basic functions such as the ones in tf.nn), that will add new operations to the currently active graph and return the Tensor corresponding to the output of that layer. The operations do not disappear when removing or altering the contents of the Python variables that used to hold these tensors.
What the function dnn does is iteratively create a sequence of dense layers. At each step, the variable inputs is changed to point to the output of the most recently created layer, allowing it to be "fed" into the next one. Whether to use the same variable as the original inputs or a new one for this is a matter of opinion (I often use a new variable net myself). By default, this will result in a sequence of 5 fully connected layers. Only the graph was constructed in all this; no network training or weight initialization procedures were actually applied here.
This can also be validated visually. The following code will write the graph's signature to a TensorFlow summary file:
he_init = tf.variance_scaling_initializer()
def dnn(inputs, n_hidden_layers=5, n_neurons=100, name=None,
activation=tf.nn.elu, initializer=he_init):
with tf.variable_scope(name, "dnn"):
for layer in range(n_hidden_layers):
inputs = tf.layers.dense(inputs, n_neurons, activation=activation,
kernel_initializer=initializer,
name="hidden%d" % (layer + 1))
return inputs
x = tf.placeholder(tf.float32, [32, 128])
y = dnn(x)
writer = tf.summary.FileWriter(logdir='mydnn', graph=tf.get_default_graph())
writer.flush()
By opening the same log directory with TensorBoard, we get the following graph:
Problem
I'm running a Deep Neural Network on the MNIST where the loss defined as follow:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))
The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logits is giving me this error.
This is strange, because the code just contains mul and add operations.
Possible Solution
Maybe I can use:
if cost == "nan":
optimizer = an empty optimizer
else:
...
optimizer = real optimizer
But I cannot find the type of nan. How can I check a variable is nan or not?
How else can I solve this problem?
I find a similar problem here TensorFlow cross_entropy NaN problem
Thanks to the author user1111929
tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))
is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
Replacing it with
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))
Or
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))
Solved nan problem.
The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:
import math
val = float('nan')
val
if math.isnan(val):
print('Detected NaN')
import pdb; pdb.set_trace() # Break into debugger to look around
Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.
I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logits should be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:
Implement your own softmax_cross_entropy_with_logits function, e.g. try (source):
epsilon = tf.constant(value=0.00001, shape=shape)
logits = logits + epsilon
softmax = tf.nn.softmax(logits)
cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
Update your data so that it does have a valid probability distribution
I want to implement the autoencoder (to be exact stacked convolutional autoencoder)
here I'd like to pretrain each layer first and then fine-tuning
So I created variables for weight of each layer
ex. W_1 = tf.Variable(initial_value, name,trainable=True etc) for first layer
and I pretrained W_1 of first layer
Then I want to pretrain weight of second layer (W_2)
Here I should use W_1 for calculating input of second layer.
However W_1 is trainable so if I use W_1 directly then tensorflow may train W_1 together.
So I should create W_1_out that keep value of W_1 but not trainable
To be honest I tried to modify code of this site
https://github.com/cmgreen210/TensorFlowDeepAutoencoder/blob/master/code/ae/autoencoder.py
At line 102 it creates variable by following code
self[name_w + "_fixed"] = tf.Variable(tf.identity(self[name_w]),
name=name_w + "_fixed",
trainable=False)
However it calls error cause it use uninitialized value
How should I do to copy variable but make it not trainable to pretrain next layers??
Not sure if still relevant, but I'll try anyway.
Generally, what I do in a situation like that is the following:
Populate the (default) graph according to the model you are building, e.g. for the first training step just create the first convolutional layer W1 you mention. When you train the first layer you can store the saved model once training is finished, then reload it and add the ops required for the second layer W2. Or you can just build the whole graph for W1 from scratch again directly in the code and then add the ops for W2.
If you are using the restore mechanism provided by Tensorflow, you will have the advantage that the weights for W1 are already the pre-trained ones. If you don't use the restore mechanism, you will have to set the W1 weights manually, e.g. by doing something shown in the snippet further below.
Then when you set up the training op, you can pass a list of variables as var_list to the optimizer which explicitly tells the optimizer which parameters are updated in order to minimize the loss. If this is set to None (the default), it just uses what it can find in tf.trainable_variables() which in turn is a collection of all tf.Variables that are trainable. May be check this answer, too, which basically says the same thing.
When using the var_list argument, graph collections come in handy. E.g. you could create a separate graph collection for every layer you want to train. The collection would contain the trainable variables for each layer and then you could very easily just retrieve the required collection and pass it as the var_list argument (see example below and/or the remark in the above linked documentation).
How to override the value of a variable: name is the name of the variable to be overriden, value is an array of the appropriate size and type and sess is the session:
variable = tf.get_default_graph().get_tensor_by_name(name)
sess.run(tf.assign(variable, value))
Note that the name needs an additional :0 in the end, so e.g. if the weights of your layer are called 'weights1' the name in the example should be 'weights1:0'.
To add a tensor to a custom collection: Use something along the following lines:
tf.add_to_collection('layer1_tensors', weights1)
tf.add_to_collection('layer1_tensors', some_other_trainable_variable)
Note that the first line creates the collection because it does not yet exist and the second line adds the given tensor to the existing collection.
How to use the custom collection: Now you can do something like this:
# loss = some tensorflow op computing the loss
var_list = tf.get_collection_ref('layer1_tensors')
optim = tf.train.AdamOptimizer().minimize(loss=loss, var_list=var_list)
You could also use tf.get_collection('layer_tensors') which would return you a copy of the collection.
Of course, if you don't wanna do any of this, you could just use trainable=False when creating the graph for all variables you don't want to be trainable as you hinted towards in your question. However, I don't like that option too much, because it requires you to pass in booleans into the functions that populate your graph, which is very easily overlooked and thus error-prone. Also, even if you decide to it like that, you would still have to restore the non-trainable variables manually.