Tensorflow: CNN training converges at a vector of zeros

Tensorflow: CNN training converges at a vector of zeros - python

I'm a beginner in deep learning and have taken a few courses on Udacity. Recently I'm trying to build a deep network detecting hand joints in the input depth images, which doesn't seem to be working well. (My dataset is ICVL Hand Posture Dataset)
The network structure is shown here.
① A batch of input images, 240x320;
② An 8-channel convolutional layer with a 5x5 kernel;
③ A max pooling layer, ksize = stride = 2;
④ A fully-connected layer, weight.shape = [38400, 1024];
⑤ A fully-connected layer, weight.shape = [1024, 48].
After several epochs of training, the output of the last layer converges as a (0, 0, ..., 0) vector. I chose the mean square error as the loss function and its value stayed above 40000 and didn't seem to reduce.
The network structure is already too simple to be simplified again but the problem remains. Could anyone offer any suggestions?
My main code is posted below:
image = tf.placeholder(tf.float32, [None, 240, 320, 1])
annotations = tf.placeholder(tf.float32, [None, 48])
W_convolution_layer1 = tf.Variable(tf.truncated_normal([5, 5, 1, 8], stddev=0.1))
b_convolution_layer1 = tf.Variable(tf.constant(0.1, shape=[8]))
h_convolution_layer1 = tf.nn.relu(
tf.nn.conv2d(image, W_convolution_layer1, [1, 1, 1, 1], 'SAME') + b_convolution_layer1)
h_pooling_layer1 = tf.nn.max_pool(h_convolution_layer1, [1, 2, 2, 1], [1, 2, 2, 1], 'SAME')
W_fully_connected_layer1 = tf.Variable(tf.truncated_normal([120 * 160 * 8, 1024], stddev=0.1))
b_fully_connected_layer1 = tf.Variable(tf.constant(0.1, shape=[1024]))
h_pooling_flat = tf.reshape(h_pooling_layer1, [-1, 120 * 160 * 8])
h_fully_connected_layer1 = tf.nn.relu(
tf.matmul(h_pooling_flat, W_fully_connected_layer1) + b_fully_connected_layer1)
W_fully_connected_layer2 = tf.Variable(tf.truncated_normal([1024, 48], stddev=0.1))
b_fully_connected_layer2 = tf.Variable(tf.constant(0.1, shape=[48]))
detection = tf.nn.relu(
tf.matmul(h_fully_connected_layer1, W_fully_connected_layer2) + b_fully_connected_layer2)
mean_squared_error = tf.reduce_sum(tf.losses.mean_squared_error(annotations, detection))
training = tf.train.AdamOptimizer(1e-4).minimize(mean_squared_error)
# This data loader reads images and annotations and convert them into batches of numbers.
loader = ICVLDataLoader('../data/')
with tf.Session() as session:
session.run(tf.global_variables_initializer())
for i in range(1000):
# batch_images: a list with shape = [BATCH_SIZE, 240, 320, 1]
# batch_annotations: a list with shape = [BATCH_SIZE, 48]
[batch_images, batch_annotations] = loader.get_batch(100).to_1d_list()
[x_, t_, l_, p_] = session.run([x_image, training, mean_squared_error, detection],
feed_dict={images: batch_images, annotations: batch_annotations})
And it runs like this.

The main issue is likely the relu activation in the output layer. You should remove this, i.e. let detection simply be the results of a matrix multiplication. If you want to force the outputs to be positive, consider something like the exponential function instead.
While relu is a popular hidden activation, I see one major problem with using it as an output activation: As is well known relu maps negative inputs to 0 -- however, crucially, the gradients will also be 0. This happening in the output layer basically means your network cannot learn from its mistakes when it produces outputs < 0 (which is likely to happen with random initializations). This will likely heavily impair the overall learning process.

Related

Is there a way to add constraints to a neural network output but still with softmax activation function?

I am not a deep learning geek, i am learning to do this for my homework.
How can I make my neural network output a list of positive floats that sum to 1 but at the same time each element of the list is smaller than a treshold (0.4 for example)?
I tried to add some hidden layers before the output layer but that didn't improve the results.
here is where i started from:
def build_net(inputs, predictor,scope,trainable):
with tf.name_scope(scope):
if predictor == 'CNN':
L=int(inputs.shape[2])
N = int(inputs.shape[3])
conv1_W = tf.Variable(tf.truncated_normal([1,L,N,32], stddev=0.15), trainable=trainable)
layer = tf.nn.conv2d(inputs, filter=conv1_W, padding='VALID', strides=[1, 1, 1, 1])
norm1 = tf.layers.batch_normalization(layer)
x = tf.nn.relu(norm1)
conv3_W = tf.Variable(tf.random_normal([1, 1, 32, 1], stddev=0.15), trainable=trainable)
conv3 = tf.nn.conv2d(x, filter=conv3_W, strides=[1, 1, 1, 1], padding='VALID')
norm3 = tf.layers.batch_normalization(conv3)
net = tf.nn.relu(norm3)
net=tf.layers.flatten(net)
return net
x=build_net(inputs,predictor,scope,trainable=trainable)
y=tf.placeholder(tf.float32,shape=[None]+[self.M])
network = tf.add(x,y)
w_init=tf.random_uniform_initializer(-0.0005,0.0005)
outputs=tf.layers.dense(network,self.M,activation=tf.nn.softmax,kernel_initializer=w_init)
I am expecting that outputs would still sum to 1, but each element of it is smaller than a specific treshold i set.
Thank you so much in advance for your precious help guys.

What you want to do is add a penalty in case any of the outputs is larger than some specified thresh, you can do this with the max function:
thresh = 0.4
strength = 10.0
reg_output = strength * tf.reduce_sum(tf.math.maximum(0.0, outputs - thresh), axis=-1)
Then you need to add reg_output to your loss so its optimized with the rest of the loss. strength is a tunable parameter that defines how strong is the penalty for going over the threshold, and you have to tune it to your needs.
This penalty works by summing max(0, output - thresh) over the last dimension, which activates the penalty if output is bigger than thresh. If its smaller, the penalty is zero and does nothing.

Does bias in the convolutional layer really make a difference to the test accuracy?

I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has multiple layers of CNN, pooling, dropout and other non -linear activations, is Bias really making a difference? The convolutional filter is learning local features and for a given conv output channel same bias is used.
This is not a dupe of this link. The above link only explains role of bias in small neural network and does not attempt to explain role of bias in deep-networks containing multiple CNN layers, drop-outs, pooling and non-linear activation functions.
I ran a simple experiment and the results indicated that removing bias from conv layer made no difference in final test accuracy.
There are two models trained and the test-accuracy is almost same (slightly better in one without bias.)
model_with_bias,
model_without_bias( bias not added in conv layer)
Are they being used only for historical reasons?
If using bias provides no gain in accuracy, shouldn't we omit them? Less parameters to learn.
I would be thankful if someone who have deeper knowledge than me, could explain the significance(if- any) of these bias in deep networks.
Here is the complete code and the experiment result bias-VS-no_bias experiment
batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_size, image_size, num_channels))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth, depth], stddev=0.1))
layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
layer3_weights = tf.Variable(tf.truncated_normal(
[image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
layer4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
# define a Model with bias .
def model_with_bias(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
# define a Model without bias added in the convolutional layer.
def model_without_bias(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv ) # layer1_ bias is not added
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv) # + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
# bias are added only in Fully connected layer(layer 3 and layer 4)
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
# Training computation.
logits_with_bias = model_with_bias(tf_train_dataset)
loss_with_bias = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_with_bias))
logits_without_bias = model_without_bias(tf_train_dataset)
loss_without_bias = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_without_bias))
# Optimizer.
optimizer_with_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_with_bias)
optimizer_without_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_without_bias)
# Predictions for the training, validation, and test data.
train_prediction_with_bias = tf.nn.softmax(logits_with_bias)
valid_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_valid_dataset))
test_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_test_dataset))
# Predictions for without
train_prediction_without_bias = tf.nn.softmax(logits_without_bias)
valid_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_valid_dataset))
test_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_test_dataset))
num_steps = 1001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
session.run(optimizer_with_bias, feed_dict=feed_dict)
session.run(optimizer_without_bias, feed_dict = feed_dict)
print('Test accuracy(with bias): %.1f%%' % accuracy(test_prediction_with_bias.eval(), test_labels))
print('Test accuracy(without bias): %.1f%%' % accuracy(test_prediction_without_bias.eval(), test_labels))
Output:
Initialized
Test accuracy(with bias): 90.5%
Test accuracy(without bias): 90.6%

Biases are tuned alongside weights by learning algorithms such as
gradient descent. biases differ from weights is that they are
independent of the output from previous layers. Conceptually bias is
caused by input from a neuron with a fixed activation of 1, and so is
updated by subtracting the just the product of the delta value and
learning rate.
In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference.
Although in a large network it has no difference, it still depends on network architecture. For instance in LSTM:
Most applications of LSTMs simply initialize the LSTMs with small
random weights which works well on many problems. But this
initialization effectively sets the forget gate to 0.5. This
introduces a vanishing gradient with a factor of 0.5 per timestep,
which can cause problems whenever the long term dependencies are
particularly severe. This problem is addressed by simply initializing the
forget gates bias to a large value such as 1 or 2. By doing so, the
forget gate will be initialized to a value that is close to 1,
enabling gradient flow.
See also:
The rule of bias in Neural network
What is bias in Neural network
An Empirical Exploration of Recurrent Network Architectures

In most networks you have a batchnorm layer after the conv layer, which has a bias. So if you have a batchnorm layer there is no gain. See:
Can not use both bias and batch normalization in convolution layers
Otherwise, from a math perspective you are learning different functions. However, it turns out that in particular if you have a very complex network for a simple problem, you might achieve almost the same thing without biases than with biases but ending up using more parameters. In my experience, using a factor of 2-4 more parameters than needed rarely hurts performance in deep learning - in particular if you regularize. So, it is hard to notice any difference. However, you might try to use few channels (I don't think depth of the network matters as much as number of channels of the convolution) and see if bias make a difference. I would guess so.

tf.get_variable() not returning changed weights

I have a program where I set up a neural network in Tensorflow that has convolutional layers and I'm trying to periodically output the filter weights as an image. I know my network is updating correctly based on the performance I'm tracking in Tensorboard (and I've validated that the weights are changing by printing them directly), but the weights image is always the same (seemingly random) values. I initialize my layers using
self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
self.image_in = tf.reshape(self.inputs, shape=[-1, int(input_pixels / view_width), view_width, 1])
self.conv1 = slim.conv2d(activation_fn=tf.nn.elu, inputs=self.image_in, num_outputs=16, kernel_size=[8, 8], stride=[4, 4], padding='VALID', scope="conv1")
self.conv2 = slim.conv2d(activation_fn=tf.nn.elu, inputs=self.conv1, num_outputs=32, kernel_size=[4, 4], stride=[2, 2], padding='VALID', scope="conv2")
hidden = slim.fully_connected(slim.flatten(self.conv2), 256, activation_fn=tf.nn.elu)
And once the training is running, every 100 iterations I save the weights using the functions provided here and
filters = ["conv1", "conv2"]
for filter in filters:
with tf.variable_scope(self.name + "/" + filter, reuse=True):
weights = tf.get_variable("weights")
grid = put_kernels_on_grid(weights)
scipy.misc.imsave('filters/' + filter + "_" + str(episode_count) + ".jpg", grid.eval()[0, :, :, 0])
Given that the weights in the filters are correctly updating, why would the weights returned by tf.get_variable() not update as well?

Tensorflow: Recurrent Neural Network Batch Training

I am trying to implement RNN in Tensorflow. I am writing my own functions instead of using RNN cells to practice.
The problem is sequence tagging, input size is [32, 48, 900] where 32 is batch size, 48 is time steps and 900 is vocab size which is one-hot encoded vector. Output is [32, 48, 145] where first two dimensions are same as input, but the last dimension is output vocabulary size (one-hot). Basically this is a NLP tagging problem.
I am getting following error:
InvalidArgumentError (see above for traceback): logits and labels must
be same size: logits_size=[48,145] labels_size=[1536,145]
The actual labels_size is [32, 48, 145] but it merges first two dimensions without my control. FYI 32*48 = 1536
If I run my RNN with batch size 1, it works fine as expected. I could not figure out how to solve the issue. I am getting the problem in the last line of the code.
I pasted the related part of the code:
inputs = tf.placeholder(shape=[None, self.seq_length, self.vocab_size], dtype=tf.float32, name="inputs")
targets = tf.placeholder(shape=[None, self.seq_length, self.output_vocab_size], dtype=tf.float32, name="targets")
init_state = tf.placeholder(shape=[1, self.hidden_size], dtype=tf.float32, name="state")
initializer = tf.random_normal_initializer(stddev=0.1)
with tf.variable_scope("RNN") as scope:
hs_t = init_state
ys = []
for t, xs_t in enumerate(tf.split(inputs[0], self.seq_length, axis=0)):
if t > 0: scope.reuse_variables()
Wxh = tf.get_variable("Wxh", [self.vocab_size, self.hidden_size], initializer=initializer)
Whh = tf.get_variable("Whh", [self.hidden_size, self.hidden_size], initializer=initializer)
Why = tf.get_variable("Why", [self.hidden_size, self.output_vocab_size], initializer=initializer)
bh = tf.get_variable("bh", [self.hidden_size], initializer=initializer)
by = tf.get_variable("by", [self.output_vocab_size], initializer=initializer)
hs_t = tf.tanh(tf.matmul(xs_t, Wxh) + tf.matmul(hs_t, Whh) + bh)
ys_t = tf.matmul(hs_t, Why) + by
ys.append(ys_t)
hprev = hs_t
output_softmax = tf.nn.softmax(ys) # Get softmax for sampling
#outputs = tf.concat(ys, axis=0)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=targets, logits=ys))

The problem may fall in the size of the ys, ys should have the size of [32, 48, 145], but the output ys only have the size of [48,145], so if the batchsize is 1, the taget size is [1, 48, 145], which just have the same size of [48,145] after dimensionality reduction.
To solve the problem you can add a loop to deal with the batchsize ( inputs[0] ) :
such as :
for i in range(inputs.getshape(0)):
for t, xs_t in enumerate(tf.split(inputs[i], self.seq_length, axis=0)):

Multiple Layer RNN Tensorflow

I have two layer LSTM network. (config.n_input is 3, config.n_steps is 5)
I think this may be related to the shape of my inputs, but I'm not sure how to fix it, I tried changing the projecting of the LSTM so that they would be the same input size, but that didn't work.
self.input_data = tf.placeholder(tf.float32, [None, config.n_steps, config.n_input], name='input')
# Tensorflow LSTM cell requires 2x n_hidden length (state & cell)
self.initial_state = tf.placeholder(tf.float32, [None, 2*config.n_hidden], name='state')
self.targets = tf.placeholder(tf.float32, [None, config.n_classes], name='target')
_X = tf.transpose(self.input_data, [1, 0, 2]) # permute n_steps and batch_size
_X = tf.reshape(_X, [-1, config.n_input]) # (n_steps*batch_size, n_input)
input_cell = rnn_cell.LSTMCell(num_units=config.n_hidden, input_size=3, num_proj=300, forget_bias=1.0)
print(input_cell.output_size)
inner_cell = rnn_cell.LSTMCell(num_units=config.n_hidden, input_size=300)
cells = [input_cell, inner_cell]
cell = rnn.rnn_cell.MultiRNNCell(cells)
It returns the following error when attempt to run it.
tensorflow.python.pywrap_tensorflow.StatusNotOK: Invalid argument: Expected size[1] in [0, 0], but got 600
[[Node: RNN/MultiRNNCell/Cell1/Slice = Slice[Index=DT_INT32, T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_state_0/_3, RNN/MultiRNNCell/Cell1/Slice/begin, RNN/MultiRNNCell/Cell1/Slice/size)]]
any superior explanations of the error message? Or are there any ways to easily fix this?

Add num_proj to your initial state:
# Tensorflow LSTM cell requires 2x n_hidden length (state & cell)
self.initial_state = tf.placeholder(tf.float32, [None, 2*config.n_hidden + 300], name='state')
This is quite an opaque error, and it might be a good idea idea for you to raise it on the TF GitHub issues page!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.