How to match predicted class and prabablity with actual labels - python

I have trained a deep learning model based on Bidirectional LSTM and dense layer output. It is quite confusing that which output probability acquired using model.predict(x) matches with my actual label (one hot encoded labels). Moreover, the model.predict_classes(x) outputs (0,1,2) are also confusing. How can i relate these outputs with my original labels. Below is my code snippet for reference:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.5))
model.add(Bidirectional(tf.keras.layers.LSTM(250, return_sequences=True,activation='tanh')))
model.add(Bidirectional(tf.keras.layers.LSTM(250)))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
print(model.summary())
history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=32, epochs=10)
model.predict_classes(test_doc)
model.predict(test_doc)
If someone please help to identify the outputs with actual labels in this sequential model.

Assuming that you have made a one hot encoding for your labels (0, 1, 2), you will have vectors as output of your model.
So, for example, if you have an instance with class 0, you'll target vector will be:
[1, 0, 0]
if you have an instance with class 1, your target vector will be:
[0, 1, 0]
if you have an instance with class 2, your target vector will be:
[0, 0, 1]
The method .predict, will give you a probability for every class in your target. So because you have 3 classes (0, 1, 2), you'll get a vector of size three with three probabilities
model.predict(x) # vector of size 3 with 3 probabilities
Something like this:
#class0, class1, class2
[0.31, 0.4, 0.29]
And these probabilities will sum to 1 because you have used the softmax activation function.
The method .predict_classes, will select for you the class with the highest probability from the vector and will make the decode
So if you have a probabilities vector of:
#class0, class1, class2
[0.31, 0.4, 0.29]
you'll get 1, because the max value in the vector is the one with index 1 in the target vector, the which represent the class 1
PS. You can make the model do the one hot encoding, changing the loss as "sparse_categorical_crossentropy":
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Without using pd.get_dummies(df['Sentiment']).value

Related

Tensorflow: Predict 1 of 4 labels for text classification

I'm currently working on a text classification problem that needs us to classify text into one of four labels. After encoding y-value should be one of [0,1,2,3] which should be the predicted label.
However, the prediction this model made seems ranging in (0,1) and I'm a bit confused? Moreover, can anyone clarify if this is ANN or RNN? Have zero experience in TensorFlow and still struggling...
model = Sequential()
model.add(Dense(16, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
from sklearn.preprocessing import LabelEncoder
#encode the label
label_encoder = LabelEncoder()
y_train=np.array(label_encoder.fit_transform(train_labels))
x_train=np.array(train_features)
y_true=np.array(label_encoder.fit_transform(dev_label))
#fit the model
model.fit(x_train,y_train,epochs=1)
y_pred=model.predict(dev_features)
and the error message:Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
Let's say that the target column has 4 unique values: red, blue, green, yellow and the corpus is converted to TF-IDF values. The first 3 rows look like this:
word_1
word_2
target
0.567
0.897
red
0.098
0.238
blue
0.66
0.786
green
One-Hot Encoding
After one-hot encoding the target, your target looks like an array of the form:
array[[1. 0. 0. 0.], <- category 'red'
[0. 1. 0. 0.], <- category 'blue'
[0. 0. 1. 0.]...] <- category 'green'
Here, the target column is of the size (n_samples, n_targets) which is (n,4). In such a case final activation has to be sigmoid or softmax and you will train your model with categorical_crossentropy loss. The code here answering your question will be:
model.add(Dense(4, activation='sigmoid'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
Label-Encoding
After label-encoding the target, your target looks like an array of the form:
array([1, 2, 3 ...])
with a 1D array of size (n_targets). Here the code will be:
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Prediction
These numbers you see are the probability of each class for the given input sample. For example, [[0.4846592 0.5153408]] means that the given sample belongs to class 0 with probability of around 0.48 and it belongs to class 1 with probability of around 0.51. So you want to take the class with the highest probability and therefore you can use np.argmax to find which index (i.e. 0 or 1) is the maximum one:
import numpy as np
pred_class = np.argmax(y_pred, axis=-1)
Further, this has nothing to do with the loss function of the model. These probabilities are given by the last layer in your model which is very likely that it uses softmax as the activation function to normalize the output as a probability distribution.
Source
Conclusion
The error you are getting is because of the loss function being used incorrectly.
If you have 1D integer encoded or LabelEncoded target, you should use sparse_categorical_crossentropy as loss function.
If you have one-hot encoded your target in order to have 2D shape (n_samples, n_class), you should use categorical_crossentropy
The dense layer should have a dimension 4 and the activation function should be "softmax" instead of "sigmoid" since we are performing multi-class (more than 2 classes) classification.
Also, change the loss function to "categorical_crossentropy".
Your code sample will look like this:
model.add(Dense(16, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])

Keras Dense Neural Net Predicting in Narrow Range

I've been playing with Numer.ai data, mostly as a way to improve my understanding of neural nets but I'm running into a problem that I can't seem to get past. No matter the configuration of my dense neural net, the output comes out in a tight range.
The input is 300 scaled feature columns (0 to 1) and the target is between 0 and 1 (values of 0, 0.25, 0.5, 0.75, and 1)
Here is my fully reproducible code:
import pandas as pd
# load data
training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz")
tournament_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz")
feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]
# select those columns out of the training dataset
X_train = training_data[feature_cols].to_numpy()
# select target variables
y_train = training_data.loc[:,'target'].to_numpy()
#same thing on validation data
val_data = tournament_data[tournament_data.data_type=='validation']
X_val = val_data[feature_cols]
y_val= val_data.loc[:,'target']
I've tried a number of different configurations in my neural network (different optimizers: adam and sgd, different learning rates 0.01 down to 0.0001, different neuron sizes, adding dropout: although, I didn't expect this to work because it seems to be a problem with bias, not variance, using linear, softmax, and sigmoid final layer activation functions: softmax produces negative values so that was an immediate non-starter, different batch sizes: as small as 16 and as large as 256, adding or removing batch normalization, shuffling the input data, and training for different numbers of epochs). Ultimately, the results are one of two things:
Predicted values are all the same number, usually somewhere in the 0.45 to 0.55 area
Predicted values are in a very narrow range, usually not more than 0.05 different. So the values are 0.45 to 0.55
I can't figure out what configuration changes I need to make to get this neural network to output predictions across a broader area of the 0 to 1 range.
from tensorflow.keras import models, layers
dropout_rate = 0.15
model = models.Sequential()
model.add(layers.Dense(512, input_shape=(X_train.shape[1],)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1028, activation = 'relu', kernel_regularizer='l2'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',metrics=['mae', 'mse'])
history = model.fit(X_train, y_train,
validation_data=(X_val, y_val),
batch_size=64,
epochs=200,
verbose=1)
# Prediction output
predictions_df = model.predict(X_val)
predictions_df = predictions_df.reshape(len(predictions_df))
pred_max = predictions_df.max()
pred_min = predictions_df.min()
pred_range = pred_max - pred_min
print(pred_max, pred_min, pred_range)
# example output: 0.51895267 0.47968164 0.039271027
EDIT:
There is an impact on them when the following changes are made (tests run on batches size of 512, number of epochs 5, below results are only on training data) -
Loss set to mse instead of binary_crossentropy
Batch size 512 (for quick prototyping)
Epochs set to 5 (loss flattens after that)
Remove l2 regularization, and increase dropout
Set output activation -
With sigmoid -> Max:0.60, ​Min: 0.36
Without activation -> Max: 0.69, Min: 0.29
With relu -> Max: 0.73, Min: 0.10
Here is the code for testing purposes -
from tensorflow.keras import models, layers
dropout_rate = 0.50
model = models.Sequential()
model.add(layers.Dense(512, input_shape=(X_train.shape[1],)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1024, activation = 'relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1, activation='relu'))
model.compile(optimizer='adam',
loss='mse',metrics=['mae'])
history = model.fit(X_train, y_train,
#validation_data=(X_val, y_val),
batch_size=512,
epochs=5,
verbose=1)
# Prediction output
predictions_df = model.predict(X_train)
predictions_df = predictions_df.reshape(len(predictions_df))
pred_max = predictions_df.max()
pred_min = predictions_df.min()
pred_range = pred_max - pred_min
print(pred_max, pred_min, pred_range)
0.73566914 0.1063129 0.62935627
Proposed solutions
You are trying to solve a regression problem of predicting an arbitrary value between 0 to 1 (values of 0, 0.25, 0.5, 0.75, and 1), but trying to solve it as a binary classification problem using a sigmoid activation and a binary_crossentropy loss.
What you may want to try is using mse and/or removing any output activation (or better, use relu as suggested by #desertnaut). You could simply be underfitting as suggested by #xdurch0. Try with and without the regularization as well.
model = models.Sequential()
model.add(layers.Dense(512, input_shape=(X_train.shape[1],)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1028, activation = 'relu')
model.add(layers.BatchNormalization())
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1))
model.compile(optimizer='adam', loss='mse')
Check this table to help you with how to use losses and activations for different types of problem settings.
On a side note, the discrete nature of the values in your dependent variable, y, you can also consider reframing the problem as a multi-class single-label classification problem, if the downstream task allows it.

keras and sigmoid not predict a right class

I'm trying to fit a model that predict a target class that can be: 0, 1, 2, 3
during fitting his val_accuracy is: 1.0
but his prediction is like:
array([[1.2150223e-09]], dtype=float32)
X_train.shape
#(1992, 1, 68)
model = Sequential()
model.add(LSTM(128, input_shape=(1,X_train.shape[2])))
model.add(Dense(128, activation="relu",kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4)))
model.add(Dropout(0.4))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer='adam',loss='mae', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100,batch_size=16, validation_split=0.1, shuffle=True
X_test = np.expand_dims(X_test,1)
y_test = np.expand_dims(y_test,1)
model.evaluate(X_test,y_test)
#[0.0010176461655646563, 1.0]
data = np.expand_dims(data, 1)
model.predict(data) #array([[1.2150223e-09]], dtype=float32) <---- here expected was 0, 1, 2 or 3
data.shape #(1, 1, 68)
I can't undestand what is wrong
Your model has only one output however you have four classes, so you need to change the last Dense layer to model.add(Dense(4, activation="softmax")), sigmoid is usually for binary classification. In your case, there 4 classes, so softmax need to be used. Then you will get probabilities by probabilities = model.predict(data) then use
CATEGORIES[np.argmax(probabilities)] it gives predicted class. By the way, when you want to calculate the loss, use multi-class cross entropy loss: model.compile(.., loss='categorical_crossentropy', ..)
model.add(Dense(4, activation="softmax"))
probabilities = model.predict(data)
print(CATEGORIES[np.argmax(probabilities)])

Poor accuracy of CNN model with Keras

I need advice. I got a very poor result(10% accuracy) when building a CNN model with Keras when only using a subset of CIFAR10 dataset (only use 10000 data, 1000 per class). How can I increase the accuracy? I try to change/increase the epoch, but the result is still the same. Here is my CNN architecture :
cnn = models.Sequential()
cnn.add(layers.Conv2D(25, (3, 3), input_shape=(32, 32, 3)))
cnn.add(layers.MaxPooling2D((2, 2)))
cnn.add(layers.Activation('relu'))
cnn.add(layers.Conv2D(50, (3, 3)))
cnn.add(layers.MaxPooling2D((2, 2)))
cnn.add(layers.Activation('relu'))
cnn.add(layers.Conv2D(100, (3, 3)))
cnn.add(layers.MaxPooling2D((2, 2)))
cnn.add(layers.Activation('relu'))
cnn.add(layers.Flatten())
cnn.add(layers.Dense(100))
cnn.add(layers.Activation('relu'))
cnn.add(layers.Dense(10))
cnn.add(layers.Activation('softmax'))
compile and fit:
EPOCHS = 200
BATCH_SIZE = 10
LEARNING_RATE = 0.1
cnn.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
loss='binary_crossentropy',
metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
mc = ModelCheckpoint(filepath=checkpoint_path, monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
history_cnn = cnn.fit(train_images, train_labels, epochs=EPOCHS, batch_size=BATCH_SIZE,
validation_data=(test_images, test_labels),callbacks=[es, mc],verbose=0)
The data i use is CIFAR10, but i only take 1000 images per class so total data is only 10000. I use normalization for preprocessing the data.
First of all, the problem is the loss. Your dataset is a multi-class problem, not a binary and not multi-label one
As stated here:
The classes are completely mutually exclusive. There is no overlap
between automobiles and trucks. "Automobile" includes sedans, SUVs,
things of that sort. "Truck" includes only big trucks. Neither
includes pickup trucks.
In this situation is suggested the use of the categorical crossentropy. Keep in mind that if your label are sparse (encoded with the number between 0 and 999) and not as one hot encoded vector ([0, 0, 0 ... 1, 0, 0]) you should use the sparse categorical crossentropy.
not sparse (labels encoded as vectors [0, 0, 1,....0])
cnn.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
loss='categorical_crossentropy',
metrics=['accuracy'])
sparse (labels encoded as numbers in (0, ... 999))
cnn.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Also, the learning rate is quite high (0.1). I'll suggest you to start with something lower (0.001) for example.
this post is also relevant for your problem
Edit: my bad, for the number of filters it is a commong approach having an increasing number of filters

Understanding this DNN model and why it does not work on multi-label classification

I am fairly new to Keras and DNN in general and starting from some tutorials, I have managed to create a model for classifying sentences. The model is shown below. To be honest, I do not know for sure what is the intuition behind it and why it works. So this is my question.
def create_model():
embedding_layer = Embedding(input_dim=100, output_dim=300,
input_length=100)
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(LSTM(units=100, return_sequences=True))
model.add(GlobalMaxPooling1D())
#model.add(Dense(1, activation='sigmoid'))
###### multiclassification #########
model.add(Dense(3, activation='sigmoid')) #I want to replace the above line with this for multi-classification but this didnt work
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
And here is my understanding: The model starts with training word embeddings on the corpus (of sentences), and represent each sentence as a vector of word vectors (embedding_layer). The dropout layer then forces the model to not rely on specific words. Convolution has a similar effect of identifying phrases/n-grams as opposed to just individual words; then an LSTM follows to learn sequences of phrases/n-grams that may be useful features; The Globalmaxpooling1D layer then 'flattens' the LSTM output as features for the final classification (dense layer).
Does this make any sense? I also do not quite understand the interaction between the maxpooling1D layer and the lstm layer. What's the input_shape to lstm and what does the output look like?
So, your intuition is right. Everything you told holds. About MaxPooling1D - it's a way to downsample the output from Conv1D. The output from this layer will be 4-times smaller than the original output from Conv1D (so input to LSTM will have a length of 25 with the same number of features. Just to show you how it works:
output from Conv1D :
0, 1, 1, 0, -1, 2, 3, 5, 1, 2, 1, -1
input to LSTM :
1 (max from 0, 1, 1, 0), 5 (max from -1, 2, 3, 5), 2 (max from 1, 2, 1, -1)
Edit
I haven't noticed categorical_crossentropy and activations. So:
If your output is one out of 3 classes you could use categorical_crossentropy and sigmoid but then your input cannot be interpretable as probability distribution but as class score (prediction is equal to a class with a highest score). Better option is to use softmax which produces a probability distribution over classes.
In case of 3 classes prediction (not mutually exclusive) due to Keras implementation you should use binary_crossentropy even though it's mathematically equivalent to categorical_crossentropy. It's because keras normalizes outputs from the last layer and makes them to sum up to 1. This might seriously harm your training.
Multiclass models:
The multiclassification model ending with Dense(3,activation='sigmoid') is ok for a multiclass with 3 possible classes.
But it should only use 'categorical_crossentropy' if there is only one correct class among the 3. In this case, the activation function should be 'softmax'.
A 'softmax' will guarantee that all the classes sum 1. It's good when you want only one correct class.
A 'sigmoid' will not care about the relation between the 3 classes, they can coexist as all ones or all zeros. In this case, use a 'binary_crossentropy'.
LSTM and GlobalMaxPooling:
The LSTM input is (batchSize, timeSteps, featuresOrDimension).
The output can be two:
With return_sequences = True: (batchSize, timeSteps, units)
With return_sequences = False: (batchSize, units).
Since you chose the True case, there is the timeSteps dimension, and the GlobalMaxPooling1D will take the highest value in that dimension and discard the others, resulting in (batchSize,units).
It's pretty much like using only LSTM(units,return_sequences=False). But this one takes the last step in the sequence, while the maxpooling will take the maximum step.

Categories