Using the softmax layer within the objective function itself

Using the softmax layer within the objective function itself - python

This is going to be long and hard to describe so apologies in advance.
I have a regular CNN like network with standard MLP layers on top of it. On top of the MLP, I have a softmax layer too, however, unlike conventional networks, this is NOT fully connected to the MLP below and it consists of subgroups.
To further describe the softmax, it looks like this:
Neur1A Neur2A ... NeurNA Neur1B Neur2B ... NeurNB Neur1C Neur2C ...NeurNC
Group A Group B Group C
There are many more groups. Each group has a softmax that is independent from the other groups. So it is in a way, several independent classifications (even though it actually is not).
What I need is for the index of the activated neuron to be monotonically increasing between groups. For example, if I have Neuron5 in Group A activated, I want the activated neuron in group B to be >=5. Same with Group B and Group C and so on..
This softmax layer containing all the neurons for all groups is actually NOT my last layer and it is interestingly an intermediate one.
To achieve this monotonicity, I add another term to my loss function that penalizes non monotonic activated neuron indices. Here is some of the code:
The code for softmax layer and its output:
def compute_image_estimate(layer2_input):
estimated_yps= tf.zeros([FLAGS.batch_size,0],dtype=tf.int64)
for pix in xrange(NUM_CLASSES):
pixrow= int( pix/width)
rowdata= image_pixels[:, pixrow*width:(pixrow+1)*width]
with tf.variable_scope('layer2_'+'_'+str(pix)) as scope:
weights = _variable_with_weight_decay('weights', shape=[layer2_input.get_shape()[1], width], stddev=0.04, wd=0.0000000)
biases = _variable_on_cpu('biases', [width], tf.constant_initializer(0.1))
y = tf.nn.softmax(tf.matmul(layer2_input,weights) + biases)
argyp=width-1-tf.argmax(y,1)
argyp= tf.reshape(argyp,[FLAGS.batch_size,1])
estimated_yps=tf.concat(1,[estimated_yps,argyp])
return estimated_yps
The estimated_yps are passed onto a function that quantifies monotonicity:
def compute_monotonicity(yp):
sm= tf.zeros([FLAGS.batch_size])
for curr_row in xrange(height):
for curr_col in xrange(width-1):
pix= curr_row *width + curr_col
sm=sm+alpha * tf.to_float(tf.square(tf.minimum(0,tf.to_int32(yp[:,pix]-yp[:,pix+1]))))
return sm
and the loss function is:
def loss(estimated_yp, SOME_OTHER_THINGS):
tf.add_to_collection('losses', SOME_OTHER_THINGS)
monotonicity_metric= tf.reduce_mean( compute_monotonocity(estimated_yp) )
tf.add_to_collection('losses', monotonicity_metric)
return tf.add_n(tf.get_collection('losses'), name='total_loss')
Now my problem is, when I do not use SOME_OTHER_THINGS that are conventional metrics, I get ValueError: No gradients provided for any variable for the monotonocity metric.
Seems like gradients are not defined when the softmax layer outputs are used like this.
Am I doing something wrong? Any help would be appreciated.

Apologies.. I realized that the problem is that tf.argmax function obviously does not have a gradient defined.

Related

when do we not need activation function?

I wrote a very basic tensorflow model where I want to predict a number:
import tensorflow as tf
import numpy as np
def HW_numbers(x):
y = (2 * x) + 1
return y
x = np.array([1.0,2.0,3.0,4.0,5.0,6.0,7.0], dtype=float)
y = np.array(HW_numbers(x))
model = tf.keras.models.Sequential([tf.keras.layers.Dense(units=1,input_shape=[1])])
model.compile(optimizer='sgd',loss='mean_squared_error')
model.fit(x,y,epochs = 30)
print(model.predict([10.0]))
This above code works fine. But if I add an activation function in Dense layer, the prediction becomes weird. I have tried 'relu','sigmoid','tanh' etc.
My question is, why is that? What exactly is activation function doing in that single layer that messes up the prediction?
I have used Tensorflow 2.0

Currently, you are learning a linear function. As it can be described by a single neuron, you just need a single neuron to learn the function. On the other hand activation function is:
to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable. It introduces non-linear properties to our Network. Their main purpose is to convert an input signal of a node in an A-NN to an output signal. That output signal now is used as an input in the next layer in the stack.
Hence, as you have just a single neuron here (a specific case), you do not need to pass the value to the next layer. In other words, all hidden, input, and output layers are merged together. Hence, the activation function is not helpful for your case. Unless you want to make a decision base on the output of the neuron.

Your network consists of just one neuron. So what it does with with no activation function is to multiply your input with the neurons weight. This weight will eventually converge to something around 2.1.
But with relu as an activation function, only positive numbers are propagated through your network. So if your neuron's weight is initialized with a negative number, you will always get zero as an output. So with relu, you have a 50:50 chance to get good results.
With the activation functions tanh and sigmoid, the output of the neuron is limited to [-1,1] and [0, 1] respectively, so your output can't be more than one.
So for such a small neuronal network, these activation functions don't match the problem.

from_logits=True and from_logits=False get different training result for tf.losses.CategoricalCrossentropy for UNet

I am doing the image semantic segmentation job with unet, if I set the Softmax Activation for last layer like this:
...
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
conv10 = (Activation('softmax'))(conv9)
model = Model(inputs, conv10)
return model
...
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
The training will not converge even for only one training image.
But if I do not set the Softmax Activation for last layer like this:
...
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
model = Model(inputs, conv9)
return model
...
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
The training will converge for one training image.
My groundtruth dataset is generated like this:
X = []
Y = []
im = cv2.imread(impath)
X.append(im)
seg_labels = np.zeros((height, width, n_classes))
for spath in segpaths:
mask = cv2.imread(spath, 0)
seg_labels[:, :, c] += mask
Y.append(seg_labels.reshape(width*height, n_classes))
Why? Is there something wrong for my usage?
This is my experiment code of git: https://github.com/honeytidy/unet
You can checkout and run (can run on cpu). You can change the Activation layer and from_logits of CategoricalCrossentropy and see what i said.

Pushing the "softmax" activation into the cross-entropy loss layer significantly simplifies the loss computation and makes it more numerically stable.
It might be the case that in your example the numerical issues are significant enough to render the training process ineffective for the from_logits=False option.
You can find a derivation of the cross entropy loss (a special case of "info gain" loss) in this post. This derivation illustrates the numerical issues that are averted when combining softmax with cross entropy loss.

from_logits = True signifies the values of the loss obtained by the model are not normalized and is basically used when we don't have any softmax function in our model. For e.g. https://www.tensorflow.org/tutorials/generative/dcgan in this model they have not used a softmax activation function or in other words we can say it helps in numerical stability.

By default, all of the loss function implemented in Tensorflow for classification problem uses from_logits=False. Remember in case of classification problem, at the end of the prediction, usually one wants to produce output in terms of probabilities.
Just look at the image below, the last layer of the network(just before softmax function)
So the sequence is Neural Network ⇒ Last layer output ⇒ Softmax or Sigmoid function ⇒ Probability of each class.
For example in the case of a multi-class classification problem, where output can be y1, y2, ....... yn one wants to produce each output with some probability. (see the output layer). Now, this output layer will get compared in cross-entropy loss function with the true label.
Let us take an example where our network produced the output for the classification task. Assume your Neural Network is producing output, then you convert that output into probabilities using softmax function and calculate loss using a cross-entropy loss function
# output produced by the last layer of NN
nn_output_before_softmax = [3.2, 1.3, 0.2, 0.8]
# converting output of last layer of NN into probabilities by applying softmax
nn_output_after_softmax = tf.nn.softmax(nn_output_before_softmax)
# output converted into softmax after appling softmax
print(nn_output_after_softmax.numpy())
[0.77514964 0.11593805 0.03859243 0.07031998]
y_true = [1.0, 0.0, 0.0, 0.0]
Now there are two scenarios:
One is explicitly using the softmax (or sigmoid) function
One is not using softmax function separately and wants to include in the calculation of loss function
1) One is explicitly using the softmax (or sigmoid) function
When one is explicitly using softmax (or sigmoid) function, then, for the classification task, then there is a default option in TensorFlow loss function i.e. from_logits=False. So here TensorFlow is assuming that whatever the input that you will be feeding to the loss function are the probabilities, so no need to apply the softmax function.
# By default from_logits=False
loss_taking_prob = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
loss_1 = loss_taking_prob(y_true, nn_output_after_softmax)
print(loss_1)
tf.Tensor(0.25469932, shape=(), dtype=float32)
2) One is not using the softmax function separately and wants to include it in the calculation of the loss function. This means that whatever inputs you are providing to the loss function is not scaled (means inputs are just the number from -inf to +inf and not the probabilities). Here you are letting TensorFlow perform the softmax operation for you.
loss_taking_logits = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
loss_2 = loss_taking_logits(y_true, nn_output_before_softmax)
print(loss_2)
tf.Tensor(0.2546992, shape=(), dtype=float32)
Please do remember that you using from_logits=False when it should be True leads to taking softmax of probabilities and producing incorrect model

I guess the problem comes from the softmax activation function. Looking at the doc I found that sotmax is applied to the last axis by default. Can you look at model.summary() and check if that is what you want ?

For softmax to work properly, you must make sure that:
You are using 'channels_last' as Keras default channel config.
This means the shapes in the model will be like (None, height, width, channels)
This seems to be your case because you are putting n_classes in the last axis. But it's also strange because you are using Conv2D and your output Y should be (1, height, width, n_classes) and not that strange shape you are using.
Your Y has only zeros and ones (not 0 and 255 as usually happens to images)
Check that Y.max() == 1 and Y.min() == 0
You may need to have Y = Y / 255.
Only one class is correct (your data does not have more than one path/channel with value = 1).
Check that (Y.sum(axis=-1) == 1).all() is True

Why prediction on activation values (Softmax) gives incorrect results?

I've implemented a basic neural network from scratch using Tensorflow and trained it on MNIST fashion dataset. It's trained correctly and outputs testing accuracy around ~88-90% over 10 classes.
Now I've written predict() function which predicts the class of given image using trained weights. Here is the code:
def predict(images, trained_parameters):
Ws, bs = [], []
parameters = {}
for param in trained_parameters.keys():
parameters[param] = tf.convert_to_tensor(trained_parameters[param])
X = tf.placeholder(tf.float32, [images.shape[0], None], name = 'X')
Z_L = forward_propagation(X, trained_parameters)
p = tf.argmax(Z_L) # Working fine
# p = tf.argmax(tf.nn.softmax(Z_L)) # not working if softmax is applied
with tf.Session() as session:
prediction = session.run(p, feed_dict={X: images})
return prediction
This uses forward_propagation() function which returns the weighted sum of the last layer (Z) and not the activitions (A) because of TensorFlows tf.nn.softmax_cross_entropy_with_logits() requires Z instead of A as it will calculate A by applying softmax Refer this link for details.
Now in predict() function, when I make predictions using Z instead of A (activations) it's working correctly. By if I calculate softmax on Z (which is activations A of the last layer) it's giving incorrect predictions.
Why it's giving correct predictions on weighted sums Z? We are not supposed to first apply softmax activation (and calculate A) and then make predictions?
Here is the link to my colab notebook if anyone wants to look at my entire code: Link to Notebook Gist
So what am I missing here?

Most TF functions, such as tf.nn.softmax, assume by default that the batch dimension is the first one - that is a common practice. Now, I noticed in your code that your batch dimension is the second, i.e. your output shape is (output_dim=10, batch_size=?), and as a result, tf.nn.softmax is computing the softmax activation along the batch dimension.
There is nothing wrong in not following the conventions - one just needs to be aware of them. Computing the argmax of the softmax along the first axis should yield the desired results (it is equivalent to taking the argmax of the logits):
p = tf.argmax(tf.nn.softmax(Z_L, axis=0))
Also, I would also recommend computing the argmax along the first axis in case more than one image is fed into the network.

How to reuse RNN in TensorFlow

I want to implement a model like DSSM (Deep Semantic Similarity Model).
I want to train one RNN model and use this model to get three hidden vector for three different inputs, and use these hidden vector to compute loss function.
I try to code in a variable scope with reuse=None like:
gru_cell = tf.nn.rnn_cell.GRUCell(size)
gru_cell = tf.nn.rnn_cell.DropoutWrapper(gru_cell,output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell([gru_cell] * 2, state_is_tuple=True)
embedding = tf.get_variable("embedding", [vocab_size, wordvec_size])
inputs = tf.nn.embedding_lookup(embedding, self._input_data)
inputs = tf.nn.dropout(inputs, 0.5)
with tf.variable_scope("rnn"):
_, self._states_2 = rnn_states_2[config.num_layers-1] = tf.nn.dynamic_rnn(cell, inputs, sequence_length=self.lengths, dtype=tf.float32)
self._states_1 = rnn_states_1[config.num_layers-1]
with tf.variable_scope("rnn", reuse=True):
_, rnn_states_2 = tf.nn.dynamic_rnn(cell,inputs,sequence_length=self.lengths,dtype=tf.float32)
self._states_2 = rnn_states_2[config.num_layers-1]
I use the same inputs and reuse the RNN model, but when I print 'self_states_1' and 'self_states_2', these two vectors are different.
I use with tf.variable_scope("rnn", reuse=True): to compute 'rnn_states_2' because I want to use the same RNN model like 'rnn_states_1'.
But why I get different hidden vectors with the same inputs and the same model?
Where did i go wrong?
Thanks for your answering.
Update:
I find the reason may be the 'tf.nn.rnn_cell.DropoutWrapper' , when I remove the drop out wrapper, the hidden vectors are same, when I add the drop out wrapper, these vector become different.
So, the new question is :
How to fix the part of vector which be 'dropped out' ? By setting the 'seed' parameter ?
When training a DSSM, should I fix the drop out action ?

If you structure your code to use tf.contrib.rnn.DropoutWrapper, you can set variational_recurrent=True in your wrapper, which causes the same dropout mask to be used at all steps, i.e. the dropout mask will be constant. Is that what you want?
Setting the seed parameter in tf.nn.dropout will just make sure that you get the same sequence of dropout masks every time you run with that seed. That does not mean the dropout mask will be constant, just that you'll always see the same dropout mask at a particular iteration. The mask will be different for every iteration.

Choosing from different cost function and activation function of a neural network

Recently I started toying with neural networks. I was trying to implement an AND gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers.
First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).
I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[0],[0],[0],[1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 1])
W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))
activation = tf.nn.sigmoid(tf.matmul(x, W)+b)
cost = tf.reduce_sum(tf.square(activation - y))/4
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iterations:
[[ 0.0031316 ]
[ 0.12012422]
[ 0.12012422]
[ 0.85576665]]
Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b).
Question 2 - I read from a StackOverflow post here:
[Activation Function] selection depends on the problem.
So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this.
I also implemented the AND gate with a different approach, with the output as one-hot true. As you can see the train_Y [1,0] means that the 0th index is 1, so the answer is 0. I hope you get it.
Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 2])
W = tf.Variable(tf.zeros([2, 2]))
b = tf.Variable(tf.zeros([2]))
activation = tf.nn.softmax(tf.matmul(x, W)+b)
cost = -tf.reduce_sum(y*tf.log(activation))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iteration
[[ 1.00000000e+00 1.41971401e-09]
[ 9.98996437e-01 1.00352429e-03]
[ 9.98996437e-01 1.00352429e-03]
[ 1.40495342e-03 9.98595059e-01]]
Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation.
Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?

I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.
Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.
For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.
Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:
sigmoid(W1 * x1 + W2 * x2 + B)
Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:
sigmoid(W * (x1 + x2) + B)
x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.
Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).
What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:
If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.
If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.
Use ReLU as a nonlienearity between layers.
Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,

Cost function and activation function play an important role in the learning phase of a neural network.
The activation function, as explained in the first answer, gives the possibility to the network to learn non-linear functions, besides assuring to have small change in the output in response of small change in the input. A sigmoid activation function works well for these assumptions. Other activation functions do the same but may be less computational expensive, see activation functions for completeness. But, in general Sigmoid activation function should be avoid because the vanishing gradient problem.
The cost function C plays a crucial role in the speed of learning of the neural network. Gradient-based neural networks learn in an iterative way by minimising the cost function, so computing the gradient of the cost function, and changing the weights in according to it. If a quadratic cost function is used, this means that its gradient with respect the weights, is proportional to the activation function first derivate. Now, if a sigmoid activation function is used this implies that when the output is close to 1 the derivate is very small, as you can see from the image, and so the neurons learns slow.
The cross-entropy cost function allows to avoid this problem. Even if you are using a sigmoid function, using a cross-entropy function as cost function, implies that its derivates with respect to the weights are not more proportional to the first derivate of the activation function, as happened with the quadratic function , but instead they are proportional to the output error. This implies that when the prediction output is far away to the target your network learns more quickly, and viceversa.
Cross-entropy cost function should be used always instead of using a quadratic cost function, for classification problem, for the above explained.
Note that, in neural networks the cross-entropy function has not always the same meaning as the cross-entropy function you meet in probability, there it is used to compare two probability distribution. In neural networks this can be true if you have a unique sigmoid output to the final layer and want to think about it as a probability distribution. But this losses meaning if you have multi-sigmoid neurons at the final layer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using the softmax layer within the objective function itself - python

Apologies.. I realized that the problem is that tf.argmax function obviously does not have a gradient defined.

Related

when do we not need activation function?

from_logits=True and from_logits=False get different training result for tf.losses.CategoricalCrossentropy for UNet

Why prediction on activation values (Softmax) gives incorrect results?

How to reuse RNN in TensorFlow

Choosing from different cost function and activation function of a neural network

Categories

Resources