Selecting validation metric for `categorical_crossentropy` in Keras

Selecting validation metric for `categorical_crossentropy` in Keras - python

I am looking at these two questions and documentation:
Whats the output for Keras categorical_accuracy metrics?
Categorical crossentropy need to use categorical_accuracy or accuracy as the metrics in keras?
https://keras.io/api/metrics/probabilistic_metrics/#categoricalcrossentropy-class
For classification of X-Rays images I (15 classes) I do:
# Compile a model
model1.compile(optimizer = 'adam', loss = 'categorical_crossentropy',
metrics = ['accuracy'])
# Fit the model
history1 = model1.fit_generator(train_generator, epochs = 10,
steps_per_epoch = 10, verbose = 1, validation_data = valid_generator)
My model works and I have an output:
But I am not sure how to add validation accuracy here to compare results and avoid over/underfitting.

I hope the following can help you:
The use of "categorical_crossentropy" tells me that your labels are a one hot encoding over different classes.
Let's say you have 15 classes, the correct prediction would be a vector with 14 zeros, and a one at the corresponding index. In this context "accuracy" will be very high as your model will be correctly predicting mostly zero everywhere, so the accuracy should easily be at least 13/15 = 0.86.
A more suitable metric would be "categorical_accuracy" which will give you 1 if the model predicts the correct index, and else 0.
If you have a validation "categorical_accuracy" better than 1/15 = 0.067 (assuming your class are correctly balanced), your model is better than random.
You can find a list of metrics at keras metrics.

Related

Why do I need a very high learning rate for this model to converge?

I have a simple model in tensorflow which is being trained on the first 1000 images in the MNIST datset. From my previous experience the learning rates which I used were of the order of around 0.001, however for my model to converge the learning rate needs to be far heigher, at least larger than 1. The model is shown below.
def gen_model():
return tf.keras.models.Sequential([
tf.keras.Input(shape=(28,28,)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='sigmoid'),
tf.keras.layers.Dense(10, activation='softmax')
])
model = gen_model()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5), loss='mean_squared_error')
model.summary()
model.fit(x_train, y_train, batch_size=1000, epochs=10000)
Is it expected for models of this form to require an extremely high learning rate, or is there something I have missed? When I use a learning rate of around 0.001 the loss changes incredibly slowly.
The dataset was created with the following code:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_train = x_train.reshape(60000,28,28)[:1000];
y_train = y_train[:1000];
y_train = tf.one_hot(y_train, 10)

Generally speaking, models that require learning rates larger than 1 raise a red flag for me. It seems like your model is a vanilla multilayer perceptron, so there's nothing overly complicated about that, but there are a couple things about your setup that stand out:
The output from your model uses a softmax, which is normally used to represent values from a categorical distribution (i.e., 1-of-k) -- this is typical for a classification model. But the loss you're using is typically used for optimizing Gaussian or regression outputs. You might want to try using a cross-entropy loss to see if that helps.
The output from your model is in probability space, so the values you get out from your model are in [0, 1]. The loss you're using is averaging the squared differences between the model output and the target 1-hot vector (whose values are in {0, 1}). The value you'll get for this loss is always smaller than 1, so with a learning rate less than 1, and multiplying by the existing model weights, the delta that you'll apply to your model weights is always going to be small. Sometimes that's a good thing, but my guess is that in this case -- and particularly at the start of training when the model weights aren't near their optimal values -- this is going to be quite slow.
Related to the above point, you might try initializing your model weights with a larger range of values than the default. This would help make the gradient values larger, but could also make the model more likely to diverge.
You could also try to replace your softmax output activation with a plain linear activation, in effect converting your model's output to (unnormalized) log-probability space. Then you'd need to change your dataset labels to also represent target log-probability values, which isn't possible exactly, but could get close with something like 1e8 * (1 - one_hot). But if you wanted to go this route, you'd effectively be implementing a cross-entropy loss yourself; see the first point.

Bert prediction shape not equal to num_samples

I have a text classification that I am trying to do using BERT. Below is the code I am using. The model training code(below) works fine but I am facing issue with the prediction part
from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 5e-5
nlabels = 26
# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=nlabels,
output_attentions=False,
output_hidden_states=False)
# optimizer Adam
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
bert_history = model.fit(ds_tr_encoded, epochs=number_of_epochs)
I am getting the output using the following
preds = model.predict(ds_te_encoded)
pred_labels_idx = np.argmax(preds['logits'], axis=1)
The issue I am facing is that the shape of pred_labels_idx is not the same as ds_te_encoded
len(pred_labels_idx) #426820
tf.data.experimental.cardinality(ds_te_encoded) #<tf.Tensor: shape=(), dtype=int64, numpy=21341>
Not sure why this is happening.

Since ds_te_encoded is of type tf.data.Dataset and you call cardinality(...), the cardinality in your case is simply the rounded number of batches and not the number of samples. So I am assuming you are using a batch size of 20, because 426820/20 = 21341. That is probably what is causing the confusion.

Tensorflow multi-output classification accuracy decrease

I am currently building a multi-output classification model. The model has 2 outputs and I define the compile as follow:
model.compile(RMSprop(lr = 0.0003, decay = 1e-6),
loss = ["categorical_crossentropy", "categorical_crossentropy"],
metrics = ["accuracy"])
The problem is if I train 2 models separately, each model gained accuracy over 80%. However, when I combined them together, the accuracy is always around 50-60%. I tried to use the loss_weights as well but it hasn't improved.
How can I improve that?

Accuracy Difference in Deep Learning - epochs and final accuracy

I am trying to train a deep learning model for my project. The code for the same is as given below:
model = Sequential()
# Add an input layer
model.add(Dense(22, activation='relu', input_shape=(22,)))
# Add one hidden layer
model.add(Dense(12, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, gender_train,epochs=90, batch_size=1, verbose=1)
gender_pred = model.predict(x_test)
print(confusion_matrix(gender_test,gender_pred.round()))
print(classification_report(gender_test,gender_pred.round()))
print(accuracy_score(gender_pred.round(),gender_test))
The accuracy for each epoch is coming in 90s. (eg. 0.97). But, When I predict the gender and print the accuracy with the help of accuracy_score, it's just coming in 80s. How are these accuracies related? I am new to machine learning, any help would be appreciated.

The training and test accuracy scores are used to identify over-fitting. A training accuracy that is far higher than test accuracy indicates over-fitting.
Accuracy can be misleading because of the nature of how it is calculated. In your case you are predicting the gender which has two possible values 0 and 1. If the data is imbalanced, e.g. if a large portion of the test data was of one class then predicting that class would provide a high accuracy score but it would not necessarily be true when the classifier is applied to other data.
Accuracy = ((True Positive) + (True Negative)) / (Total Population)
The Confusion Matrix enables an accurate assessment of the model performance.
True Positives (TP) − The case when both actual class & predicted class of data point is 1.
True Negatives (TN) − The case when both actual class & predicted class of data point is 0.
False Positives (FP) − The case when actual class of data point is 0 & predicted class of data point is 1.
False Negatives (FN) − The case when actual class of data point is 1 & predicted class of data point is 0.
Looking at your console output:
TP = 93,
FP = 27,
FN = 24,
TN = 101
Accuracy score : (93 + 101) / (93 + 27 + 24 + 101) = 0.7918367347
Precision : (TP) / (TP + FP) : 93 / (93 + 27) = 0.775
Recall : (TP) / (TP + FN) : 93 / (93 + 24) = 0.7948
Support is the number of samples of the true label in each class of target values. (120 are class 0, 125 are class 1)
F1-score : 2 * (Precision * Recall) / (Precision + Recall)
F1 score is the weighted average of the precision and recall. Its best value would be 1 and the worst would be 0.
Being able to interpret these variables will help you to better understand your model.
If you are only beginning your Machine Learning adventure, I'd highly recommend Andrew Ng's Machine Learning course on Coursera. He explains everything you need to understand quite simply without too much mathematics.

You wrongly mix up two things. There are two sets general, train set: user for training and you got above 90+ accuracy and another set you used for test and you got 79 - which by the way a clear indication of model overfitting.
All your confusion matrix, classification report, and lastly accuracy measurements are done using the test set. In your code,
model = Sequential()
# Add an input layer
model.add(Dense(22, activation='relu', input_shape=(22,)))
# Add one hidden layer
model.add(Dense(12, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, gender_train,epochs=90, batch_size=1, verbose=1)
This is the training part, and it properly .fit the training pairs - x_train -> gender_train. Now, after some fitting or some epoch later when we test the model on unseen data which never NEVER used for training, that eventually reveals the model's true performance. In your case, this unseen data is x_test.
gender_pred = model.predict(x_test)
print(confusion_matrix(gender_test,gender_pred.round()))
print(classification_report(gender_test,gender_pred.round()))
print(accuracy_score(gender_pred.round(),gender_test))
If performance on x_test set or Unseen data is low or very low compare to the training data (i.e. x_train), it's simply called overfitting and that's not expected. In your case, your model is suffering from overfitting.
How to overcome
To avoid overfitting, people generally do as follows:
make a validation set for choosing hyper-parameter in training time
strong data augmentation
I would recommend these two article to read to get some more idea. Hope that helps.
Overfitting and Underfitting With Machine Learning Algorithms
Underfitting vs. Overfitting (vs. Best Fitting) in Machine Learning

What Does Accuracy Metrics Mean in Keras' Sample Denoising Autoencoder?

I am working with Keras' sample denoising autoencoder;
https://keras.io/examples/mnist_denoising_autoencoder/
As I compile it, I use the following options:
autoencoder.compile(loss='mse', optimizer= Adadelta, metrics=['accuracy'])
Followed by training. I did training deliberately WITHOUT using noisy training data(x_train_noisy), but merely tried to recover x_train.
autoencoder.fit(x_train, x_train, epochs=30, batch_size=128)
After training 60,000 inputs of MNIST digits, it gives me an accuracy of 81.25%. Does it mean there are 60000*81.25% images are PERFECTLY recovered (equaling to the original input pixel by pixel), that is, 81.25% output images from the autoencoder are IDENTICAL to their input counterparts, or something else?
Furthermore, I also conducted a manual check by comparing output and the original data (60000 28X28 matrices) pixel by pixel--counting non-zeros elements from their differences:
x_decoded = autoencoder.predict(x_train)
temp = x_train*255
x_train_uint8 = temp.astype('uint8')
temp = x_decoded*255
x_decoded_uint8 = temp.astype('uint8')
c = np.count_nonzero(x_train_uint8 - x_decoded_uint8)
cp = 1-c /60000/28/28
Yet cp is only about 71%. Could any tell me why there is a difference?

Accuracy doesn't make sense for a regression problem, hence the keras sample doesn't use that metric during autoencoder.compile.
In this case, keras calculates the accuracy as per this metric.
binary_accuracy
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
Using this numpy implementation, you should get the same value as output by Keras for validation accuracy at the end of training.
x_decoded = autoencoder.predict(x_test_noisy)
acc = np.mean(np.equal(x_test, np.round(x_decoded)))
print(acc)
Refer this answer for more details:
What function defines accuracy in Keras when the loss is mean squared error (MSE)?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.