I have this doubt when I fit a neural network in a regression problem. I preprocessed the predictors (features) of my train and test data using the methods of Imputers and Scale from sklearn.preprocessing,but I did not preprocessed the class or target of my train data or test data.
In the architecture of my neural network all the layers has relu as activation function except the last layer that has the sigmoid function. I have choosen the sigmoid function for the last layer because the values of the predictions are between 0 and 1.
tl;dr: In summary, my question is: should I deprocess the output of my neuralnet? If I don't use the sigmoid function, the values of my output are < 0 and > 1. In this case, how should I do it?
Thanks
Usually, if you are doing regression you should use a linear' activation in the last layer. A sigmoid function will 'favor' values closer to 0 and 1, so it would be harder for your model to output intermediate values.
If the distribution of your targets is gaussian or uniform I would go with a linear output layer. De-processing shouldn't be necessary unless you have very large targets.
Related
I am currently building a neural network to predict features such as temperature. So the output for this could be a positive or negative value. I am normalizing my input data and using the tanh activation function in each hidden layer.
Should I use a linear activation function for the output layer to get an unbounded continuous output OR should I use tanh for the output layer and then inverse normalize the output? Could someone explain this I don't think my understanding of this is correct.
You are actually in the correct direction
Option1:
you need to normalize the temperatures first and then fed it to the model let say your temperature ranges from [-100,100] so convert it into a scale of [-1,1] then use this scaled version of temp in your target variable.
At the time of prediction just inverse transform the output and you will get your desired result.
Option2:
You create a regression kind of neural network and don't apply any activation function to the output layer (means no bonds for values it could be +ve or -ve).
In this case you are not required to normalize the values.
Sample NN Spec:
Input Layer==> # neurons one per feature
Hidden Layer==>relu/selu as activation function| # of neurons/Hidden layers is as per your requirement
Output Layer==> 1 neuron/ no Activation function required
Python 3.7 tensorflow
I am experimenting Time series forecasting w Tensorflow
I understand the second line creates a LSTM RNN i.e. a Recurrent Neural Network of type Long Short Term Memory.
Why do we need to add a Dense(1) layer in the end?
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Tutorial for Dense() says
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
would you like to rephrase or elaborate on need for Dense() here ?
The following line
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
creates an LSTM layer which transforms each input step of size #features into a latent representation of size 32. You want to predict a single value so you need to convert this latent representation of size 32 into a single value. Hence, you add the following line
single_step_model.add(tf.keras.layers.Dense(1))
which adds a Dense Layer (Fully-Connected Neural Network) with one neuron in the output which, obviously, produces a single value. Look at it as a way to transform an intermediate result of higher dimensionality into the final result.
Well in the tutorial you are following Time series forecasting, they are trying to forecast temperature (6 hrs ahead). For which they are using an LSTM followed by a Dense layer.
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Dense layer is nothing but a regular fully-connected NN layer. In this case you are bringing down the output dimensionality to 1, which should represent some proportionality (need not be linear) to the temperature you are trying to predict. There are other layers you could use as well. Check out, Keras Layers.
If you are confused about the input and output shape of LSTM, check out
I/O Shape.
I am doing the image semantic segmentation job with unet, if I set the Softmax Activation for last layer like this:
...
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
conv10 = (Activation('softmax'))(conv9)
model = Model(inputs, conv10)
return model
...
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
The training will not converge even for only one training image.
But if I do not set the Softmax Activation for last layer like this:
...
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
model = Model(inputs, conv9)
return model
...
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
The training will converge for one training image.
My groundtruth dataset is generated like this:
X = []
Y = []
im = cv2.imread(impath)
X.append(im)
seg_labels = np.zeros((height, width, n_classes))
for spath in segpaths:
mask = cv2.imread(spath, 0)
seg_labels[:, :, c] += mask
Y.append(seg_labels.reshape(width*height, n_classes))
Why? Is there something wrong for my usage?
This is my experiment code of git: https://github.com/honeytidy/unet
You can checkout and run (can run on cpu). You can change the Activation layer and from_logits of CategoricalCrossentropy and see what i said.
Pushing the "softmax" activation into the cross-entropy loss layer significantly simplifies the loss computation and makes it more numerically stable.
It might be the case that in your example the numerical issues are significant enough to render the training process ineffective for the from_logits=False option.
You can find a derivation of the cross entropy loss (a special case of "info gain" loss) in this post. This derivation illustrates the numerical issues that are averted when combining softmax with cross entropy loss.
from_logits = True signifies the values of the loss obtained by the model are not normalized and is basically used when we don't have any softmax function in our model. For e.g. https://www.tensorflow.org/tutorials/generative/dcgan in this model they have not used a softmax activation function or in other words we can say it helps in numerical stability.
By default, all of the loss function implemented in Tensorflow for classification problem uses from_logits=False. Remember in case of classification problem, at the end of the prediction, usually one wants to produce output in terms of probabilities.
Just look at the image below, the last layer of the network(just before softmax function)
So the sequence is Neural Network ⇒ Last layer output ⇒ Softmax or Sigmoid function ⇒ Probability of each class.
For example in the case of a multi-class classification problem, where output can be y1, y2, ....... yn one wants to produce each output with some probability. (see the output layer). Now, this output layer will get compared in cross-entropy loss function with the true label.
Let us take an example where our network produced the output for the classification task. Assume your Neural Network is producing output, then you convert that output into probabilities using softmax function and calculate loss using a cross-entropy loss function
# output produced by the last layer of NN
nn_output_before_softmax = [3.2, 1.3, 0.2, 0.8]
# converting output of last layer of NN into probabilities by applying softmax
nn_output_after_softmax = tf.nn.softmax(nn_output_before_softmax)
# output converted into softmax after appling softmax
print(nn_output_after_softmax.numpy())
[0.77514964 0.11593805 0.03859243 0.07031998]
y_true = [1.0, 0.0, 0.0, 0.0]
Now there are two scenarios:
One is explicitly using the softmax (or sigmoid) function
One is not using softmax function separately and wants to include in the calculation of loss function
1) One is explicitly using the softmax (or sigmoid) function
When one is explicitly using softmax (or sigmoid) function, then, for the classification task, then there is a default option in TensorFlow loss function i.e. from_logits=False. So here TensorFlow is assuming that whatever the input that you will be feeding to the loss function are the probabilities, so no need to apply the softmax function.
# By default from_logits=False
loss_taking_prob = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
loss_1 = loss_taking_prob(y_true, nn_output_after_softmax)
print(loss_1)
tf.Tensor(0.25469932, shape=(), dtype=float32)
2) One is not using the softmax function separately and wants to include it in the calculation of the loss function. This means that whatever inputs you are providing to the loss function is not scaled (means inputs are just the number from -inf to +inf and not the probabilities). Here you are letting TensorFlow perform the softmax operation for you.
loss_taking_logits = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
loss_2 = loss_taking_logits(y_true, nn_output_before_softmax)
print(loss_2)
tf.Tensor(0.2546992, shape=(), dtype=float32)
Please do remember that you using from_logits=False when it should be True leads to taking softmax of probabilities and producing incorrect model
I guess the problem comes from the softmax activation function. Looking at the doc I found that sotmax is applied to the last axis by default. Can you look at model.summary() and check if that is what you want ?
For softmax to work properly, you must make sure that:
You are using 'channels_last' as Keras default channel config.
This means the shapes in the model will be like (None, height, width, channels)
This seems to be your case because you are putting n_classes in the last axis. But it's also strange because you are using Conv2D and your output Y should be (1, height, width, n_classes) and not that strange shape you are using.
Your Y has only zeros and ones (not 0 and 255 as usually happens to images)
Check that Y.max() == 1 and Y.min() == 0
You may need to have Y = Y / 255.
Only one class is correct (your data does not have more than one path/channel with value = 1).
Check that (Y.sum(axis=-1) == 1).all() is True
While measuring the accuracy of a CNN i understand that i should use the output of the softmax layer(Predicted label) to target label. But even if i compare logits (which are the output of last fully connected layer, as per my understanding) with target labels, i get almost same accuracy. Here is the relevant part of my code:
matches = tf.equal(tf.argmax(y_pred,1),tf.argmax(y,1))
acc = tf.reduce_mean(tf.cast(matches,tf.float32))
whereas y_pred is the output of final normal fully connected layer without any activation function (only matrix multiplication and bias addition w*x+b)
y_pred = normal_full_layer(second_hidden_layer,6)
6 because I have 6 classes.
Here is the accuracy graph using y_pred:
Accuracy is around 96%
Now if I do same (calculate accuracy) by applying softmax activation on y_pred, let's call it pred_softmax, i get almost same accuracy
pred_softmax = tf.nn.softmax(y_pred).
Accuracy Graph using softmax:
In fact the accuracy should be exactly equal. Taking the argmax of an array of logits should return the same as taking the argmax of the softmax of that array. This is because the softmax function maps larger logits to be closer to 1 in a strictly increasing way.
The softmax function takes a set of outputs (an array) y and maps it to exp(y)/sum(exp(y)), the larger the y[i] the larger the softmax of y[i] and so it must be that argmax(y[i])==argmax(softmax(y[i]))
I am referreing to this sample code
in the code snippet below:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
Now NCE_Loss function is nothing but a single hidden layer neural network with softmax at the optput layer [knowing is takes only a few negative sample]
This part of the graph will only update the weights of the network, it is not doing anything to the "embeddings" matrix/ tensor.
so ideally once the network is trained we must again pass it once through the embeddings_matrix first and then multiply by the transpose of the "nce_weights" [considering it as the same weight auto-encoder, at input & output layers] to reach to the hidden layer representation of each word, which we are are calling word2vec (?)
But if look at the later part of the code, the value of the embeddings matrix is being used a word representation. This
Even the tensorflow doc for NCE loss, mentions input (to which we are passing embed, which uses embeddings) as just the 1st layer input activation values.
inputs: A Tensor of shape [batch_size, dim]. The forward activations of the input network.
A normal back propagation stops at the first layer of the network,
does this implementation of NCE loss, goes beyond and propagates the loss to the input values (and hence to the embedding) ?
This seems an extra step?
Refer this for why I am calling it an extra step, he has a same explanation.
Want I have figured out reading and going through tensorflow is that
though the entire thing is single hidden layer neural network, a auto-encoder indeed. But the weights are not tied, which I assumed.
The encoder is made of the weight matrix embeddings and the decoder is made of the nce_weights. And now embed is nothing but the hidden layer output, given by multiplying input with embeddings.
So with this, embeddings and nce_weights both will be updated in the graph. And we can choose any of the two weight matrix, embeddings is more preferred here.
Edit1:
Actually for both tf.nn.nce_loss and tf.nn.sampled_softmax_loss, the parameters, weights and bias are for the input Weights(tranpose) X + bias, to objective function, which can be logistic regression/ softmax function [refer].
But the back-propagation/ gradient descent happens till the very base of the graph you are building and does not stop at the weights and bias of the function only. Hence the input parameter in both tf.nn.nce_loss and tf.nn.sampled_softmax_loss are also updated which in-turn is build of embeddings matrix.