Tensorflow NaN before softmax - python

I'm using Tensorflow to train a neural based reinforcement learning agent. The idea is the following, first I train an LSTM (on real data, to begin with a human bias) to predict what may be the the future actions of the human. Then I use the cell state of my LSTM as a state for reinforcement learning.
However, I've noticed a tendency for the internal state to shift with time, and therefore, the internal state is not directly suitable for Reinforcement Learning. It is apparently a common problem which is addressed with the normalized layer LSTM. I tried the tensorflow implementation of the NL LSTM but now, I cannot train it since TF puts NaN weights at its output.
I use the ADAM optimize with and tried different weights, nothing changes, I also tried to increase the number of units in my LSTM with no improvement.
Does anyone see the problem in my code ?
with tf.variable_scope("system_lstm") as scope:
no_units_system=128
_seq_system = tf.placeholder(tf.float32, [batch_size, max_length_system, system_inputShapeLen], name='seq_')
_seq_length_system = tf.placeholder(tf.int32, [batch_size], name='seq_length_')
cell_system = tf.contrib.rnn.LayerNormBasicLSTMCell(
no_units_system)
output_system, hidden_states_system = tf.nn.dynamic_rnn(
cell_system,
_seq_system,
sequence_length=_seq_length_system,
dtype=tf.float32
)
out2_system = tf.reshape(output_system, shape=[-1, no_units_system])
out2_system = tf.layers.dense(out2_system, system_outputShapeLen)
out_final_system = tf.reshape(out2_system, shape=[-1, max_length_system, system_outputShapeLen])
y_system_ = tf.placeholder(tf.float32, [None, max_length_system, system_outputShapeLen])
softmax_system = tf.nn.softmax(out_final_system, dim=-1)
loss_system = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=out_final_system, labels=y_system_))
optimizer = tf.train.AdamOptimizer(learning_rate=1e-10)
minimize_system = optimizer.minimize(loss_system)

Related

Inconsistency between GRU and RNN implementation

I'm trying to implement some custom GRU cells using Tensorflow. I need to stack those cells, and I wanted to inherit from tensorflow.keras.layers.GRU. However, when looking at the source code, I noticed that you can only pass a units argument to the __init__ of GRU, while RNN has an argument that is a list of RNNcell, and leverages it to stack those cells calling StackedRNNCells. Meanwhile, GRU only create one GRUCell.
For the paper I'm trying to implement, I actually need to stack GRUCell. Why are the implementation of RNN and GRU different?
While searching for the documentation for these classes to add links, I noticed something that may be tripping you up: there are (currently, just before the official TF 2.0 release) two GRUCell implementations in TensorFlow! There is a tf.nn.rnn_cell.GRUCell and a tf.keras.layers.GRUCell. It looks like the one from tf.nn.rnn_cell is deprecated, and the Keras one is the one you should use.
From what I can tell, the GRUCell has the same __call__() method signature as tf.keras.layers.LSTMCell and tf.keras.layers.SimpleRNNCell, and they all inherit from Layer. The RNN documentation gives some requirements on what the __call__() method of the objects you pass to its cell argument must do, but my guess is that all three of these should meet those requirements. You should be able to just use the same RNN framework and pass it a list of GRUCell objects instead of LSTMCell or SimpleRNNCell.
I can't test this right now, so I'm not sure if you pass a list of GRUCell objects or just GRU objects into RNN, but I think one of those should work.
train_graph = tf.Graph()
with train_graph.as_default():
# Initialize input placeholders
input_text = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
lr = tf.placeholder(tf.float32, name='learning_rate')
# Calculate text attributes
vocab_size = len(int_to_vocab)
input_text_shape = tf.shape(input_text)
# Build the RNN cell
lstm = tf.contrib.rnn.BasicLSTMCell(num_units=rnn_size)
drop_cell = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop_cell] * num_layers)
# Set the initial state
initial_state = cell.zero_state(input_text_shape[0], tf.float32)
initial_state = tf.identity(initial_state, name='initial_state')
# Create word embedding as input to RNN
embed = tf.contrib.layers.embed_sequence(input_text, vocab_size, embed_dim)
# Build RNN
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, dtype=tf.float32)
final_state = tf.identity(final_state, name='final_state')
# Take RNN output and make logits
logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn=None)
# Calculate the probability of generating each word
probs = tf.nn.softmax(logits, name='probs')
# Define loss function
cost = tf.contrib.seq2seq.sequence_loss(
logits,
targets,
tf.ones([input_text_shape[0], input_text_shape[1]])
)
# Learning rate optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)
# Gradient clipping to avoid exploding gradients
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)

Back-propagation with complex parameters in Tensorflow

Currently I'm trying to train a complex data generated from telecom engineering models. The weights and biases are also complex. I have used the relu activation for the hidden layers as follows at the l-th layer:
A_l = tf.complex(tf.nn.relu(tf.real(Z_l)), tf.nn.relu(tf.imag(Z_l)))
But how to do it for the cost and optimizer, please? I am really confused because I'm a beginner in machine learning. I have gone through some papers about non-analytic functions, but none of them helped to use the Tensorflow API. For example: how do I rewrite the functions below?
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = Z_out, labels = y))
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
I have seen a recommendation to split the cost as real and imaginary parts as:
cost_R = .., cost_I = ...
but I didn't try it because I think the optimizer will be split and the optimization not be working. I used the relu activation for the hidden layers as follows at the l-th layer:
A_l = tf.complex(tf.nn.relu(tf.real(Z_l)), tf.nn.relu(tf.imag(Z_l)))
But how to the cost and optimizer?
Any help is much appreciated.

Understanding Regularization in Keras

I am trying to understand why regularization syntax in Keras looks the way that it does.
Roughly speaking, regularization is way to reduce overfitting by adding a penalty term to the loss function proportional to some function of the model weights. Therefore, I would expect that regularization would be defined as part of the specification of the model's loss function.
However, in Keras the regularization is defined on a per-layer basis. For instance, consider this regularized DNN model:
input = Input(name='the_input', shape=(None, input_shape))
x = Dense(units = 250, activation='tanh', name='dense_1', kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
x = Dense(units = 28, name='dense_2',kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
y_pred = Activation('softmax', name='softmax')(x)
mymodel= Model(inputs=input, outputs=y_pred)
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
I would have expected that the regularization arguments in the Dense layer were not needed and I could just write the last line more like:
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'], regularization='l2')
This is obviously wrong syntax, but I was hoping someone could elaborate for me a bit on why the regularizes are defined this way and what is actually happening when I use layer-level regularization.
The other thing I don't understand is under what circumstances would I use each or all of the three regularization options: (kernel_regularizer, activity_regularizer, bias_regularizer)?
Let's break down the components of your question:
Your expectation of regularisation is probably in line with a feed-forward network where yes the penalty term is applied to the weights of the overall network. But this is not necessarily the case when you have RNNs mixed with CNNs etc so Keras opts give fine grain control. Perhaps for easy setup, a regularisation at model level could be added to the API for all weights.
When you use layer regularisation, the base Layer class actually adds the regularising term to the loss which at training time penalises the corresponding layer's weights etc.
Now in Keras you can often apply regularisation to 3 different things as in Dense layer. Every layer has different kernels such recurrent etc, so for the question let's look at the ones you are interested in but the same roughly applies to all layers:
kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.
bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.
activity: is applied to the output vector, the y in y = f(Wx + b).

Image retraining in tensorflow, changing the simple softmax layer to multilayer CNN

Tensorflow has released a tutorial for transfer learning named Image retraining that can be found in here:
https://www.tensorflow.org/tutorials/image_retraining
What they are doing is using a pre-trained model on inception v3 and then they change only the very last layer (softmax regression layer) and train it on the new dataset. This is very understandable and in fact a common practice in transfer learning.
I have tried their method on my dataset (which is a small dataset) and I have applied all the suggestion to get a better result from the data augmentation to change the number of steps but I did not modify their code by any means. The accuracy I got is relatively bad ~70%.
I am thinking of the possibility of training a small neural network on top of the given model, namely, changing the last layer from a simple regression to a more sophisticated network.
Here is the part of their code where they modify the softmax layer:
def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor):
"""Adds a new softmax and fully-connected layer for training.
We need to retrain the top layer to identify our new classes, so this function
adds the right operations to the graph, along with some variables to hold the
weights, and then sets up all the gradients for the backward pass.
The set up for the softmax and fully-connected layers is based on:
https://tensorflow.org/versions/master/tutorials/mnist/beginners/index.html
Args:
class_count: Integer of how many categories of things we're trying to
recognize.
final_tensor_name: Name string for the new final node that produces results.
bottleneck_tensor: The output of the main CNN graph.
Returns:
The tensors for the training and cross entropy results, and tensors for the
bottleneck input and ground truth input.
"""
with tf.name_scope('input'):
bottleneck_input = tf.placeholder_with_default(
bottleneck_tensor, shape=[None, BOTTLENECK_TENSOR_SIZE],
name='BottleneckInputPlaceholder')
ground_truth_input = tf.placeholder(tf.float32,
[None, class_count],
name='GroundTruthInput')
# Organizing the following ops as `final_training_ops` so they're easier
# to see in TensorBoard
layer_name = 'final_training_ops'
with tf.name_scope(layer_name):
with tf.name_scope('weights'):
layer_weights = tf.Variable(tf.truncated_normal([BOTTLENECK_TENSOR_SIZE, class_count], stddev=0.001), name='final_weights')
variable_summaries(layer_weights)
with tf.name_scope('biases'):
layer_biases = tf.Variable(tf.zeros([class_count]), name='final_biases')
variable_summaries(layer_biases)
with tf.name_scope('Wx_plus_b'):
logits = tf.matmul(bottleneck_input, layer_weights) + layer_biases
tf.summary.histogram('pre_activations', logits)
final_tensor = tf.nn.softmax(logits, name=final_tensor_name)
tf.summary.histogram('activations', final_tensor)
with tf.name_scope('cross_entropy'):
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
labels=ground_truth_input, logits=logits)
with tf.name_scope('total'):
cross_entropy_mean = tf.reduce_mean(cross_entropy)
tf.summary.scalar('cross_entropy', cross_entropy_mean)
with tf.name_scope('train'):
train_step = tf.train.GradientDescentOptimizer(FLAGS.learning_rate).minimize(
cross_entropy_mean)
return (train_step, cross_entropy_mean, bottleneck_input, ground_truth_input,
final_tensor)
def add_evaluation_step(result_tensor, ground_truth_tensor):
"""Inserts the operations we need to evaluate the accuracy of our results.
Args:
result_tensor: The new final node that produces results.
ground_truth_tensor: The node we feed ground truth data
into.
Returns:
Tuple of (evaluation step, prediction).
"""
with tf.name_scope('accuracy'):
with tf.name_scope('correct_prediction'):
prediction = tf.argmax(result_tensor, 1)
correct_prediction = tf.equal(
prediction, tf.argmax(ground_truth_tensor, 1))
with tf.name_scope('accuracy'):
evaluation_step = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy', evaluation_step)
return evaluation_step, prediction
However, I am facing two main problems. First, I am not if this a good idea or not? would I be just wasting in my effort in doing something useless? Second, they are using the simple MNIST tutorial as a model for the last layer, say that I would use their Expert MNIST tutorial (https://www.tensorflow.org/get_started/mnist/pros) I am lost on what to do or how to configure it?.
Any suggestions on what can I do?

Simple TensorFlow Neural Network minimizes cost function yet all results are close to 1

So I tried implementing the neural network from:
http://iamtrask.github.io/2015/07/12/basic-python-network/
but using TensorFlow instead. I printed out the cost function twice during training and the error is appears to be getting smaller according yet all the values in the output layer are close to 1 when only two of them should be. I imagine it might be something wrong with my maths but I'm not sure. There is no difference when I try with a hidden layer or use Error Squared as cost function. Here is my code:
import tensorflow as tf
import numpy as np
input_layer_size = 3
output_layer_size = 1
x = tf.placeholder(tf.float32, [None, input_layer_size]) #holds input values
y = tf.placeholder(tf.float32, [None, output_layer_size]) # holds true y values
tf.set_random_seed(1)
input_weights = tf.Variable(tf.random_normal([input_layer_size, output_layer_size]))
input_bias = tf.Variable(tf.random_normal([1, output_layer_size]))
output_layer_vals = tf.nn.sigmoid(tf.matmul(x, input_weights) + input_bias)
cross_entropy = -tf.reduce_sum(y * tf.log(output_layer_vals))
training = tf.train.AdamOptimizer(0.1).minimize(cross_entropy)
x_data = np.array(
[[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]])
y_data = np.reshape(np.array([0,0,1,1]).T, (4, 1))
with tf.Session() as ses:
init = tf.initialize_all_variables()
ses.run(init)
for _ in range(1000):
ses.run(training, feed_dict={x: x_data, y:y_data})
if _ % 500 == 0:
print(ses.run(output_layer_vals, feed_dict={x: x_data}))
print(ses.run(cross_entropy, feed_dict={x: x_data, y:y_data}))
print('\n\n')
And this is what it outputs:
[[ 0.82036656]
[ 0.96750367]
[ 0.87607527]
[ 0.97876281]]
0.21947 #first cross_entropy error
[[ 0.99937409]
[ 0.99998224]
[ 0.99992537]
[ 0.99999785]]
0.00062825 #second cross_entropy error, as you can see, it's smaller
First of all: you have no hidden layer. As far as I remember basic perceptrons could possibly model the XOR problem, but it needed some adjustments. However, AI is just invented by biology, but it does not model real neural networks exactly. Thus, you have to at least build an MLP (Multilayer perceptron), which consits of at least one input, one hidden and one output layer. The XOR problem needs at least two neurons + bias in the hidden layer to be solved correctly (with a high precision).
Additionally your learning rate is too high. 0.1 is a very high learning rate. To put it simply: it basically means that you update/adapt your current state by 10% of one single learning step. This lets your network forget about already learned invariants quickly. Usually the learning rate is something in between 1e-2 to 1e-6, depending on your problem, network size and general architecture.
Moreover you implemented the "simplified/short" version of cross-entropy. See wikipedia for the full version: cross-entropy. However, to avoid some edge cases TensorFlow already has its own version of cross-entropy: for example tf.nn.softmax_cross_entropy_with_logits.
Finally you should remember that the cross-entropy error is a logistic loss function that operates on probabilities of your classes. Although your sigmoid function squashes the output layer into an interval of [0, 1], this does only work in your case because you have one single output neuron. As soon as you have more than one output neuron, you also need the sum of the output layer to be exactly 1,0 in order to really describes probabilities for every class on the output layer.

Categories