Sigmoid as a last layer in LSTM - python

I have a classification (Keras) with LSTM for a dataset with 4 attributes labeled into 2 classes (safe and unsafe). with put the sigmoid in the last layer I got a better accuracy of 98% rather than softmax.
My question is that:
1 )If I use Softmax in the last layer:
in Softmax based on the 2 neurons as output at the end in other code, I can compare the score and say the data belong to which.
For example score_safe= 1.2945 and score_unsafe= -9.0 then I can say this row of dataset belongs to the safe class.
2)If I use Sigmid in the last layer:
Then I had to put just a neuron as up output and how can I compare the scores and how can say this row of datasets belongs to which class?
model = Sequential()
model.add(LSTM(256, input_shape=(x_train.shape[1:]), activation='tanh', return_sequences=True))
#model.add(BatchNormalization())
model.add(Dense(128, activation='tanh'))
#model.add(BatchNormalization())
model.add(Dense(128, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))

The output of a sigmoid is a single float between 0. and 1.
Typically, it is set such that if the output is below 0.5 the model is classifying as the first class (whichever class is represented as a 0 in your dataset). If the output is above 0.5 the model is classifying as the second class (represented as a 1 in your dataset).
The 0.5 threshold can be varied to introduce a bias toward one or the other class.

Related

Tensorflow: Predict 1 of 4 labels for text classification

I'm currently working on a text classification problem that needs us to classify text into one of four labels. After encoding y-value should be one of [0,1,2,3] which should be the predicted label.
However, the prediction this model made seems ranging in (0,1) and I'm a bit confused? Moreover, can anyone clarify if this is ANN or RNN? Have zero experience in TensorFlow and still struggling...
model = Sequential()
model.add(Dense(16, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
from sklearn.preprocessing import LabelEncoder
#encode the label
label_encoder = LabelEncoder()
y_train=np.array(label_encoder.fit_transform(train_labels))
x_train=np.array(train_features)
y_true=np.array(label_encoder.fit_transform(dev_label))
#fit the model
model.fit(x_train,y_train,epochs=1)
y_pred=model.predict(dev_features)
and the error message:Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
Let's say that the target column has 4 unique values: red, blue, green, yellow and the corpus is converted to TF-IDF values. The first 3 rows look like this:
word_1
word_2
target
0.567
0.897
red
0.098
0.238
blue
0.66
0.786
green
One-Hot Encoding
After one-hot encoding the target, your target looks like an array of the form:
array[[1. 0. 0. 0.], <- category 'red'
[0. 1. 0. 0.], <- category 'blue'
[0. 0. 1. 0.]...] <- category 'green'
Here, the target column is of the size (n_samples, n_targets) which is (n,4). In such a case final activation has to be sigmoid or softmax and you will train your model with categorical_crossentropy loss. The code here answering your question will be:
model.add(Dense(4, activation='sigmoid'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
Label-Encoding
After label-encoding the target, your target looks like an array of the form:
array([1, 2, 3 ...])
with a 1D array of size (n_targets). Here the code will be:
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Prediction
These numbers you see are the probability of each class for the given input sample. For example, [[0.4846592 0.5153408]] means that the given sample belongs to class 0 with probability of around 0.48 and it belongs to class 1 with probability of around 0.51. So you want to take the class with the highest probability and therefore you can use np.argmax to find which index (i.e. 0 or 1) is the maximum one:
import numpy as np
pred_class = np.argmax(y_pred, axis=-1)
Further, this has nothing to do with the loss function of the model. These probabilities are given by the last layer in your model which is very likely that it uses softmax as the activation function to normalize the output as a probability distribution.
Source
Conclusion
The error you are getting is because of the loss function being used incorrectly.
If you have 1D integer encoded or LabelEncoded target, you should use sparse_categorical_crossentropy as loss function.
If you have one-hot encoded your target in order to have 2D shape (n_samples, n_class), you should use categorical_crossentropy
The dense layer should have a dimension 4 and the activation function should be "softmax" instead of "sigmoid" since we are performing multi-class (more than 2 classes) classification.
Also, change the loss function to "categorical_crossentropy".
Your code sample will look like this:
model.add(Dense(16, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])

Is there a way to use multilabel classification but take as correct when the model predicts only one label in keras?

I have a dataset of weather forecasts and am trying to make a model that predicts which forecast will be more accurate the next day.
In order to do so, my y output is of the form y=[1,0,1,0] because I have the forecasts of 4 different organizations. 1 represents that this is the best forecast for the current record and more 'ones' means that multiple forecasts had the same best prediction.
My problem is that I want to create a model that trains on these data but also learns that only predicting 1 value correctly is 100% correct answer as I only need to get as a result one of the best and equal forecasts. I believe that the way I am doing this 'shaves' accuracy from my evaluation. Is there a way to implement this in keras? The architecture of the neural network is totally experimental and there is no specific reason why I chose it. This is the code I wrote. My train dataset consists of 6463 rows × 505 columns.
model = Sequential()
model.add(LSTM(150, activation='relu',activity_regularizer=regularizers.l2(l=0.0001)))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(4, activation='softmax'))
#LSTM
# reshape input to be 3D [samples, timesteps, features]
X_train_sc =X_train_sc.reshape((X_train_sc.shape[0], 1, X_train_sc.shape[1]))
X_test_sc = X_test_sc.reshape((X_test_sc.shape[0], 1,X_test_sc.shape[1]))
#validation set
x_val=X_train.iloc[-2000:-1300,0:505]
y_val=y_train[-2000:-1300]
x_val_sc=scaler.transform(x_val)
# reshape input to be 3D for LSTM[samples, timesteps, features]
x_val_sc =x_val_sc.reshape((x_val_sc.shape[0], 1, x_val_sc.shape[1]))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['categorical_accuracy'])
history= model.fit(x=X_train_sc, y=y_train ,validation_data=(x_val_sc,y_val), epochs=300, batch_size=24)
print(model.evaluate(X_test_sc,y_test))
yhat= model.predict(X_test_sc)
My accuracy is ~44%
If you want to make prediction of form [1,0,1,0] ie. the model should predict the probabiliyt of belong to each of the 4 classes then it is called multi-label classification. What you have coded for is multi-class classification.
Muti-label classification
Your last layer will be a dense layers of size 4 for each class, with sigmod activation. You will use a binary_crossentropy loss.
x = np.random.randn(100,10,1)
y = np.random.randint(0,2,(100,4))
model = keras.models.Sequential()
model.add(keras.layers.LSTM(16, activation='relu', input_shape=(10,1), return_sequences=False))
model.add(keras.layers.Dense(8, activation='relu'))
model.add(keras.layers.Dense(4, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(x,y)
Check
print (model.predict(x))
Output
array([[0.5196002 , 0.52978194, 0.5009601 , 0.5036485 ],
[0.508756 , 0.5189857 , 0.5022978 , 0.50169533],
[0.5213044 , 0.5254892 , 0.51159555, 0.49724004],
[0.5144601 , 0.5264933 , 0.505496 , 0.5008205 ],
[0.50524575, 0.5147699 , 0.50287664, 0.5021702 ],
[0.521035 , 0.53326863, 0.49642274, 0.50102305],
.........
As you can see the probabilities for each prediction do not sum up to one, rather each value is a probability of it belonging to the corresponding class. So if the probability > 0.5 you can say that it belong to the class.
On the other hand if you use softmax, the probabilies sum up to 1 ie. it belongs to the single class for which it has value > 0.5.

softmax and sigmoid are giving same results in multiclass classification

I am building an lstm model. I tested my model using softmax and sigmoid activation function. In the documentation sigmoid is used for binary classification and softmax is used for multiclass classification. But in my case, both are giving the same results. Why is it so?
Here is my code:
embedding_vecor_length = 128
max_length = 700
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_vecor_length, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=16, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Here are the predicted results:
[[2.72062905e-02 1.47979835e-03 4.44446778e-04 1.60833297e-05
4.15672457e-06 3.20438482e-02 9.38653767e-01 1.41544719e-04
5.55426550e-06 4.47654566e-06]
[2.31099591e-01 1.71699154e-03 1.32052042e-02 4.70457249e-04
8.86382014e-02 2.65704724e-03 6.54215395e-01 7.50611164e-03
4.89178114e-04 1.89376965e-06]
[1.24909900e-01 8.73659015e-01 9.71468398e-06 1.66079029e-04
1.05203628e-06 4.14116839e-05 3.97000113e-05 6.98190925e-05
1.10231712e-03 9.84829512e-07]
The sigmoid allows you to have high probability for all of your classes, some of them, or none of them. Example: classifying diseases in a chest x-ray image. The image might contain pneumonia, emphysema, and/or cancer, or none of those findings.
The softmax enforces that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes. Example: classifying images from the MNIST data set of handwritten digits. A single picture of a digit has only one true identity - the picture cannot be a 7 and an 8 at the same time.
So in your case if the model is good the prediction will not differ alot when either using sigmoid or softmax, softmax forces the sum of prediction to be 1 sigmoid doesn't do that.

Keras LSTM fit underfitting

I have time series training data of about 5000 numbers. For each 100 numbers, I am trying to predict the 101st. At the end of the series, I would put in the predicted numbers back into the model to predict ahead of the time series.
The attached graph shows the training data, the test data and the prediction output. Currently, the model seems to be under-fitting. I would like to know what hyperparameters should be changed, or if I need to re-structure my input and output data.
I am using the following LSTM network.
model = Sequential()
model.add(LSTM(128, input_shape=([bl,1]), activation='relu', return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(20,activation='relu'))
model.add(Dense(1))
model.compile(optimizer=adam(lr=0.0001), loss='mean_squared_error', metrics=['accuracy'])
model.fit(y_ba_tr_in, y_ba_tr_out,
epochs=20,
batch_size=5,shuffle=False,verbose=2)
y_ba_tr_in.shape = (4961, 100, 1)
y_ba_tr_out.shape = (4961, 1)
Something you could try is taking return_sequences=True out of your last LSTM layer. I believe this is generally the approach when you intend to predict for the next timestep.
After that modification, you also shouldn't need the subsequent Flatten() and Dense() layers.

Understanding this DNN model and why it does not work on multi-label classification

I am fairly new to Keras and DNN in general and starting from some tutorials, I have managed to create a model for classifying sentences. The model is shown below. To be honest, I do not know for sure what is the intuition behind it and why it works. So this is my question.
def create_model():
embedding_layer = Embedding(input_dim=100, output_dim=300,
input_length=100)
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(LSTM(units=100, return_sequences=True))
model.add(GlobalMaxPooling1D())
#model.add(Dense(1, activation='sigmoid'))
###### multiclassification #########
model.add(Dense(3, activation='sigmoid')) #I want to replace the above line with this for multi-classification but this didnt work
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
And here is my understanding: The model starts with training word embeddings on the corpus (of sentences), and represent each sentence as a vector of word vectors (embedding_layer). The dropout layer then forces the model to not rely on specific words. Convolution has a similar effect of identifying phrases/n-grams as opposed to just individual words; then an LSTM follows to learn sequences of phrases/n-grams that may be useful features; The Globalmaxpooling1D layer then 'flattens' the LSTM output as features for the final classification (dense layer).
Does this make any sense? I also do not quite understand the interaction between the maxpooling1D layer and the lstm layer. What's the input_shape to lstm and what does the output look like?
So, your intuition is right. Everything you told holds. About MaxPooling1D - it's a way to downsample the output from Conv1D. The output from this layer will be 4-times smaller than the original output from Conv1D (so input to LSTM will have a length of 25 with the same number of features. Just to show you how it works:
output from Conv1D :
0, 1, 1, 0, -1, 2, 3, 5, 1, 2, 1, -1
input to LSTM :
1 (max from 0, 1, 1, 0), 5 (max from -1, 2, 3, 5), 2 (max from 1, 2, 1, -1)
Edit
I haven't noticed categorical_crossentropy and activations. So:
If your output is one out of 3 classes you could use categorical_crossentropy and sigmoid but then your input cannot be interpretable as probability distribution but as class score (prediction is equal to a class with a highest score). Better option is to use softmax which produces a probability distribution over classes.
In case of 3 classes prediction (not mutually exclusive) due to Keras implementation you should use binary_crossentropy even though it's mathematically equivalent to categorical_crossentropy. It's because keras normalizes outputs from the last layer and makes them to sum up to 1. This might seriously harm your training.
Multiclass models:
The multiclassification model ending with Dense(3,activation='sigmoid') is ok for a multiclass with 3 possible classes.
But it should only use 'categorical_crossentropy' if there is only one correct class among the 3. In this case, the activation function should be 'softmax'.
A 'softmax' will guarantee that all the classes sum 1. It's good when you want only one correct class.
A 'sigmoid' will not care about the relation between the 3 classes, they can coexist as all ones or all zeros. In this case, use a 'binary_crossentropy'.
LSTM and GlobalMaxPooling:
The LSTM input is (batchSize, timeSteps, featuresOrDimension).
The output can be two:
With return_sequences = True: (batchSize, timeSteps, units)
With return_sequences = False: (batchSize, units).
Since you chose the True case, there is the timeSteps dimension, and the GlobalMaxPooling1D will take the highest value in that dimension and discard the others, resulting in (batchSize,units).
It's pretty much like using only LSTM(units,return_sequences=False). But this one takes the last step in the sequence, while the maxpooling will take the maximum step.

Categories