softmax and sigmoid are giving same results in multiclass classification - python

I am building an lstm model. I tested my model using softmax and sigmoid activation function. In the documentation sigmoid is used for binary classification and softmax is used for multiclass classification. But in my case, both are giving the same results. Why is it so?
Here is my code:
embedding_vecor_length = 128
max_length = 700
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_vecor_length, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=16, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Here are the predicted results:
[[2.72062905e-02 1.47979835e-03 4.44446778e-04 1.60833297e-05
4.15672457e-06 3.20438482e-02 9.38653767e-01 1.41544719e-04
5.55426550e-06 4.47654566e-06]
[2.31099591e-01 1.71699154e-03 1.32052042e-02 4.70457249e-04
8.86382014e-02 2.65704724e-03 6.54215395e-01 7.50611164e-03
4.89178114e-04 1.89376965e-06]
[1.24909900e-01 8.73659015e-01 9.71468398e-06 1.66079029e-04
1.05203628e-06 4.14116839e-05 3.97000113e-05 6.98190925e-05
1.10231712e-03 9.84829512e-07]

The sigmoid allows you to have high probability for all of your classes, some of them, or none of them. Example: classifying diseases in a chest x-ray image. The image might contain pneumonia, emphysema, and/or cancer, or none of those findings.
The softmax enforces that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes. Example: classifying images from the MNIST data set of handwritten digits. A single picture of a digit has only one true identity - the picture cannot be a 7 and an 8 at the same time.
So in your case if the model is good the prediction will not differ alot when either using sigmoid or softmax, softmax forces the sum of prediction to be 1 sigmoid doesn't do that.

Related

Sigmoid as a last layer in LSTM

I have a classification (Keras) with LSTM for a dataset with 4 attributes labeled into 2 classes (safe and unsafe). with put the sigmoid in the last layer I got a better accuracy of 98% rather than softmax.
My question is that:
1 )If I use Softmax in the last layer:
in Softmax based on the 2 neurons as output at the end in other code, I can compare the score and say the data belong to which.
For example score_safe= 1.2945 and score_unsafe= -9.0 then I can say this row of dataset belongs to the safe class.
2)If I use Sigmid in the last layer:
Then I had to put just a neuron as up output and how can I compare the scores and how can say this row of datasets belongs to which class?
model = Sequential()
model.add(LSTM(256, input_shape=(x_train.shape[1:]), activation='tanh', return_sequences=True))
#model.add(BatchNormalization())
model.add(Dense(128, activation='tanh'))
#model.add(BatchNormalization())
model.add(Dense(128, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
The output of a sigmoid is a single float between 0. and 1.
Typically, it is set such that if the output is below 0.5 the model is classifying as the first class (whichever class is represented as a 0 in your dataset). If the output is above 0.5 the model is classifying as the second class (represented as a 1 in your dataset).
The 0.5 threshold can be varied to introduce a bias toward one or the other class.

Is there a way to use multilabel classification but take as correct when the model predicts only one label in keras?

I have a dataset of weather forecasts and am trying to make a model that predicts which forecast will be more accurate the next day.
In order to do so, my y output is of the form y=[1,0,1,0] because I have the forecasts of 4 different organizations. 1 represents that this is the best forecast for the current record and more 'ones' means that multiple forecasts had the same best prediction.
My problem is that I want to create a model that trains on these data but also learns that only predicting 1 value correctly is 100% correct answer as I only need to get as a result one of the best and equal forecasts. I believe that the way I am doing this 'shaves' accuracy from my evaluation. Is there a way to implement this in keras? The architecture of the neural network is totally experimental and there is no specific reason why I chose it. This is the code I wrote. My train dataset consists of 6463 rows × 505 columns.
model = Sequential()
model.add(LSTM(150, activation='relu',activity_regularizer=regularizers.l2(l=0.0001)))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(4, activation='softmax'))
#LSTM
# reshape input to be 3D [samples, timesteps, features]
X_train_sc =X_train_sc.reshape((X_train_sc.shape[0], 1, X_train_sc.shape[1]))
X_test_sc = X_test_sc.reshape((X_test_sc.shape[0], 1,X_test_sc.shape[1]))
#validation set
x_val=X_train.iloc[-2000:-1300,0:505]
y_val=y_train[-2000:-1300]
x_val_sc=scaler.transform(x_val)
# reshape input to be 3D for LSTM[samples, timesteps, features]
x_val_sc =x_val_sc.reshape((x_val_sc.shape[0], 1, x_val_sc.shape[1]))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['categorical_accuracy'])
history= model.fit(x=X_train_sc, y=y_train ,validation_data=(x_val_sc,y_val), epochs=300, batch_size=24)
print(model.evaluate(X_test_sc,y_test))
yhat= model.predict(X_test_sc)
My accuracy is ~44%
If you want to make prediction of form [1,0,1,0] ie. the model should predict the probabiliyt of belong to each of the 4 classes then it is called multi-label classification. What you have coded for is multi-class classification.
Muti-label classification
Your last layer will be a dense layers of size 4 for each class, with sigmod activation. You will use a binary_crossentropy loss.
x = np.random.randn(100,10,1)
y = np.random.randint(0,2,(100,4))
model = keras.models.Sequential()
model.add(keras.layers.LSTM(16, activation='relu', input_shape=(10,1), return_sequences=False))
model.add(keras.layers.Dense(8, activation='relu'))
model.add(keras.layers.Dense(4, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(x,y)
Check
print (model.predict(x))
Output
array([[0.5196002 , 0.52978194, 0.5009601 , 0.5036485 ],
[0.508756 , 0.5189857 , 0.5022978 , 0.50169533],
[0.5213044 , 0.5254892 , 0.51159555, 0.49724004],
[0.5144601 , 0.5264933 , 0.505496 , 0.5008205 ],
[0.50524575, 0.5147699 , 0.50287664, 0.5021702 ],
[0.521035 , 0.53326863, 0.49642274, 0.50102305],
.........
As you can see the probabilities for each prediction do not sum up to one, rather each value is a probability of it belonging to the corresponding class. So if the probability > 0.5 you can say that it belong to the class.
On the other hand if you use softmax, the probabilies sum up to 1 ie. it belongs to the single class for which it has value > 0.5.

Loss Returns NaN When Using Regulizers On LSTM

When I train my LSTM model, it returns nan for the loss
I am using a single layer LSTM with a Dense softmax layer at the end for classification output
Adam optimizer
Categorical crossentropy loss function
Relu activation
For some reason when I use regulizers of any kind on the LSTM layer I get NaN for the loss.
Also this only occurs when I make the LSTM layer over 128 units, so I can get rid of it if I get rid of the regularization or make the network 128 or less units.
I have already confirmed that there are no NaNs in the input.
I'm wondering why this happens, and how it is possible to regularize larger LSTM layers.
Here's my code:
def build_model():
model = Sequential()
model.add(LSTM(130, batch_input_shape=(None,90,5), return_sequences=False, recurrent_dropout=0.1, kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation("relu"))
model.add(Dense(2))
model.add(Activation("softmax"))
model.compile(optimizer="adam", loss = "categorical_crossentropy", metrics=["categorical_accuracy"])
return(model)
Thanks

Training CNN with transfer learning in Keras - image input doesn't work but vector input does

I'm trying to do transfer learning in Keras. I set up a ResNet50 network set to not trainable with some extra layers:
# Image input
model = Sequential()
model.add(ResNet50(include_top=False, pooling='avg')) # output is 2048
model.add(Dropout(0.05))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.15))
model.add(Dense(512, activation='relu'))
model.add(Dense(7, activation='softmax'))
model.layers[0].trainable = False
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
Then I create input data: x_batch using the ResNet50 preprocess_input function, along with the one hot encoded labels y_batch and do the fitting as so:
model.fit(x_batch,
y_batch,
epochs=nb_epochs,
batch_size=64,
shuffle=True,
validation_split=0.2,
callbacks=[lrate])
Training accuracy gets close to 100% after ten or so epochs, but validation accuracy actually decreases from around 50% to 30% with validation loss steadily increasing.
However if I instead create a network with just the last layers:
# Vector input
model2 = Sequential()
model2.add(Dropout(0.05, input_shape=(2048,)))
model2.add(Dense(512, activation='relu'))
model2.add(Dropout(0.15))
model2.add(Dense(512, activation='relu'))
model2.add(Dense(7, activation='softmax'))
model2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model2.summary()
and feed in the output of the ResNet50 prediction:
resnet = ResNet50(include_top=False, pooling='avg')
x_batch = resnet.predict(x_batch)
Then validation accuracy gets up to around 85%... What is going on? Why won't the image input method work?
Update:
This problem is really bizarre. If I change ResNet50 to VGG19 it seems to work ok.
After a lot of googling I found that the problem is to do with the Batch Normalisation layers in ResNet. There are no batch normalisation layers in VGGNet which is why it works for that topology.
There is a pull request to fix this in Keras here, which explains in more detail:
Assume we use one of the pre-trained CNNs of Keras and we want to fine-tune it. Unfortunately, we get no guarantees that the mean and variance of our new dataset inside the BN layers will be similar to the ones of the original dataset. As a result, if we fine-tune the top layers, their weights will be adjusted to the mean/variance of the new dataset. Nevertheless, during inference the top layers will receive data which are scaled using the mean/variance of the original dataset. This discrepancy can lead to reduced accuracy.
This means that the BN layers are adjusting to the training data, however when validation is performed, the original parameters of the BN layers are used. From what I can tell, the fix is to allow the frozen BN layers to use the updated mean and variance from training.
A work around is to pre-compute the ResNet output. In fact, this decreases training time considerably, as we are not repeating that part of the calculation.
you can try :
Res = keras.applications.resnet.ResNet50(include_top=False,
weights='imagenet', input_shape=(IMG_SIZE , IMG_SIZE , 3 ) )
# Freeze the layers except the last 4 layers
for layer in vgg_conv.layers :
layer.trainable = False
# Check the trainable status of the individual layers
for layer in vgg_conv.layers:
print(layer, layer.trainable)
# Vector input
model2 = Sequential()
model2.add(Res)
model2.add(Flatten())
model2.add(Dropout(0.05 ))
model2.add(Dense(512, activation='relu'))
model2.add(Dropout(0.15))
model2.add(Dense(512, activation='relu'))
model2.add(Dense(7, activation='softmax'))
model2.compile(optimizer='adam', loss='categorical_crossentropy', metrics =(['accuracy'])
model2.summary()

Usage of sigmoid activation function in Keras

I have a big dataset composed of 18260 input field with 4 outputs. I am using Keras and Tensorflow to build a neural network that can detect the possible output.
However I tried many solutions but the accuracy is not getting above 55% unless I use sigmoid activation function in all model layers except the first one as below:
def baseline_model(optimizer= 'adam' , init= 'random_uniform'):
# create model
model = Sequential()
model.add(Dense(40, input_dim=18260, activation="relu", kernel_initializer=init))
model.add(Dense(40, activation="sigmoid", kernel_initializer=init))
model.add(Dense(40, activation="sigmoid", kernel_initializer=init))
model.add(Dense(10, activation="sigmoid", kernel_initializer=init))
model.add(Dense(4, activation="sigmoid", kernel_initializer=init))
model.summary()
# Compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
Is using sigmoid for activation correct in all layers? The accuracy is reaching 99.9% when using sigmoid as shown above. So I was wondering if there is something wrong in the model implementation.
The sigmoid might work. But I suggest using relu activation for hidden layers' activation. The problem is, your output layer's activation is sigmoid but it should be softmax(because you are using sparse_categorical_crossentropy loss).
model.add(Dense(4, activation="softmax", kernel_initializer=init))
Edit after discussion on comments
Your outputs are integers for class labels. Sigmoid logistic function outputs values in range (0,1). The output of the softmax is also in range (0,1), but the Softmax function adds another constraint on the outputs:- the sum of the outputs must be 1. Therefore the outputs of softmax can be interpreted as probability of the input being each class.
E.g
def sigmoid(x):
return 1.0/(1 + np.exp(-x))
def softmax(a):
return np.exp(a-max(a))/np.sum(np.exp(a-max(a)))
a = np.array([0.6, 10, -5, 4, 7])
print(sigmoid(a))
# [0.64565631, 0.9999546 , 0.00669285, 0.98201379, 0.99908895]
print(softmax(a))
# [7.86089760e-05, 9.50255231e-01, 2.90685280e-07, 2.35544722e-03,
4.73104222e-02]
print(sum(softmax(a))
# 1.0
You got to use one or the other activation, as activations are the source to bring non-linearity into the model. If the model doesn't have any activation, then it basically behaves like a single layer network. Read more about 'Why to use activations here'. You can check various activations here.
Although it seems like your model is overfitting when using sigmoid, so try techniques to overcome it like creating train/dev/test sets, reducing complexity of the model, dropouts, etc.
Neural networks require non-linearity at each layer to work. Without non-linear activation no matter how many layers you have, you could write the same thing with only one layer.
Linear functions are limited in complexity and if "g" and "f" are linear functions g(f(x)) could be written as z(x) where z is also a linear function. It is pointless to stack them without adding non-linearity.
And that's why we use non-linear activation functions. sigmoid(g(f(x))) cannot be written as a linear function.

Categories