Val_Acc > Training_Acc, yet Val_Loss < Training_Loss

Val_Acc > Training_Acc, yet Val_Loss < Training_Loss – explanation? - python

I am training a Neural Network with one hidden layer and I get this strange behavior all the time: Val_Accuracy is bigger than Training_Acc, yet Val_Loss is smaller than Training_Loss. I do not understand, what this might indicate.
I am doing classification on a 5-class-target, working with sparse_categorical_crossentropy as loss and sparse_categorical_accuracy as the additional metric. (Yes, my target comes as an integer, it is not one-hotted – "sparse" is correct.)
I tried a lot of tweaking of the hyperparameters; especially LR and node amount.
Because I am using Dropout, I am aware that Testing Accuracy/Loss can be better than Training. The question is about the differing of the two metrics.
opti = tf.keras.optimizers.Adam(lr=0.00001)
batch = 100
mindelta = 0.0001
lr_red_factor = 0.4
acti = 'relu'
actilast = 'softmax'
kern = 'he_normal'
bias = 'zeros'
droprate = 0.4
target = 5
loss = 'sparse_categorical_crossentropy'#'mean_squared_error'
metrics = ['sparse_categorical_accuracy']
nodesperlayer = 30
howmanyhidden = 1
patience = (4000/(nodesperlayer*howmanyhidden))/batch
model = models.Sequential()
first = True
for _ in range(howmanyhidden):
if first:
model.add(layers.Dense(nodesperlayer, use_bias=True,
kernel_initializer=kern,
bias_initializer=bias,
input_dim=feats.columns.shape[0]))
first=False
else:
model.add(layers.Dense(nodesperlayer, use_bias=True,
kernel_initializer=kern,
bias_initializer=bias))
model.add(layers.BatchNormalization(axis=-1, momentum=0.99,
epsilon=0.001, center=True,
scale=True, beta_initializer='zeros',
gamma_initializer='ones',
moving_mean_initializer='zeros',
moving_variance_initializer='ones',
beta_regularizer=None,
gamma_regularizer=None,
beta_constraint=None,
gamma_constraint=None))
model.add(layers.Activation(acti))
model.add(layers.Dropout(droprate)) #todo seed?
model.add(layers.Dense(target, activation = actilast))
model.compile(loss=loss, metrics=metrics, optimizer=opti)
just one of the many examples to clarify what I mean:
loss: 1.1345 - sparse_categorical_accuracy: 0.5586 - val_loss: 1.0571 - val_sparse_categorical_accuracy: 0.6028

I think you have a conceptual misunderstanding, the idea if the loss is that you minimize it during training, because smaller losses are better, but accuracy behaves the other way, higher accuracies are better.
The dynamics of training are such that as the loss decreases, accuracy usually increases, so the behavior you see is correct and completely expected.

Well, loss and accuracy are inversely proportional to each other. "Loss" is a measurement for the error that your model makes, while accuracy is a measurement for how good your model performs. Since the validation error/loss is smaller than the training error it makes only sense that the validation accuracy is better than the training accuracy.

Related

Low accuracy of Transformer model for 1D Data

My dataset( Network traffic dataset where we do binary classification)-
Number of features in data is 25
This is the Transformer model -
embed_dim = 25 # Embedding size for each token
num_heads = 2 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer
inputs = layers.Input(shape=(25,1,))
transformer_block = TransformerBlock(25, num_heads, ff_dim)
x = transformer_block(inputs)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
outputs = layers.Dense(1, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
history = model.fit(
x_train, y_train, batch_size=32, epochs=50, validation_data=(x_test, y_test))
But the accuracy isn't changing and is extremely poor and it isn't changing with epochs-
Epoch 1/50
1421/1421 [==============================] - 9s 6ms/step - loss: 0.5215 - accuracy: 0.1192 - val_loss: 0.4167 - val_accuracy: 0.1173

Overall, one should be able to get to 100% (train) accuracy, as long as data is not contradictory. Arguably it is the best strategy to get there, before worrying about generalisation (test error), for the specific case:
final activation should be sigmoid (otherwise we have f(x) = exp(x) / exp(x) = 1
there is no need to dropout (it will only make training accuracy lower)
global pooling can remove important information - replace it with a Dense layer for the time being
normalise your data, your features are in quite wide ranges, this can cause training to struggle to converge
consider lowering your learning rate, as it will make it easier to overfit to training data
If all the above fail, just increase size of the model, as "20-25" range of your features might just not be big enough. Neural networks need quite a lot of redundancy to learn properly.
Personally I would also replace the whole model with just an MLP and verify everything works, I am not sure why transformer would be the model of choice here, and it will allow you to verify if the issue is with the model chosen, or with the code.
Finally - make sure that indeed 100% accuracy is obtainable, take your training data and check if there are any 2 datapoints that have exactly the same features, and different labels. If there are none - you should be able to get to 100% accuracy, it is just a matter of getting hyperparamters right.

Keras neural network predicting the same output

I need to develop a neural network with Keras to predict a disease using genetic data. It is known, that predicting this disease is possible even with logistic regression (however the predictions, in this case, are of very poor quality). It's worth mentioning that my data is imbalanced, so I introduced class weights later.
I decided to start with the simplest way to predict it - with a network, analogous to a logistic regression - one hidden layer with one neuron and achieved a bad, yet at least some result - 0.12-0.14 F1 score. Then I tried with 2 hidden and 1 output layers with different amount of neurons in the first hidden layer - from 1 to 8.
It turns out that in some cases it learns something, and in some is predicting the same output for every sample. I displayed the accuracy and loss function over the epochs and this is what I get:
Network loss function by epoch. It's clear that the loss function has roughly the same value, for the training data.
Network accuracy by epoch. It's clear that the accuracy is not improving, but fluctuates from 0 to 1
I searched for similar questions and the suggestions were the following:
Make more neurons - I just have to make it work with 1, 2 or more neurons in the first layer, so I can't add neurons to this one. I increased the amount of neurons in the second hidden layer up to 20, but it then stopped predicting anything with any number oh neurons in the first layer configuration.
Make more layers - I tried adding one more layer, but still have the same problem
To introduce dropout and increase it - what dropout are we talking about if it can learn with just one layer and one neuron in it
Reduce learning rate - decreased it from the default 10^(-3) to 10^(-4)
Reduce batch size - varied it from 500 samples in a minibatch to 1 (stochastic gradient descent)
More epochs - isn't 20 to 50 epochs on a 500'000 sample dataset enough?
Here's the model:
def run_nn_class_weights(data, labels, model):
n_iter = 20
predicted = None
true = None
print('Splitting the data')
x_train, x_valid, y_train, y_valid = train_test_split(data, labels, test_size = 0.05)
#model = create_model()
early_stopping_monitor=EarlyStopping(patience=240)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(labels),
labels)
class_weights = dict(enumerate(class_weights))
hist = model.fit(x_train, y_train, validation_data=[x_valid, y_valid], class_weight=class_weights,
epochs=n_iter, batch_size=500, shuffle=True, callbacks=[early_stopping_monitor],verbose=1)
proba = model.predict(data)
predicted = proba.flatten()
true = labels
return(model, proba, hist)
def old_model_n_pred(n_neurons_1st = 1):
model = Sequential()
model.add(Dense(n_neurons_1st, activation='relu', input_shape=(7516,), kernel_initializer='glorot_normal'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Flatten())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

This is a small network that should be able to converge to something that's not an atractor (getting stuck on a single value).
I suggest taking a look at the weights of all the neurons with ReLu activation.
ReLus are great because get quick calculations; but half of the relu has derivate of zero, which doesn't help with gradient descent. This might be your case.
In guess in yout case the enemy would be the first neuron.
In order to overcome this problem, I would try to do regularize inputs (to have all samples centered around 0.5 and scaled by the standard deviation). If you do this to a ReLU, you'll make it ignore anything under between [-inf, sd].
if that does not fix part of the problem, swich to a different activation function in the first layer. A sigmoid will work very good and it's not too expensive for just one neuron.
Also, take a close look at your input distribution. What your network actually does is doing a sigmoid-like classification, then using between 4 to 8 neurons to "zoom"/correct on the important parts of the function that the first transformation didn't account for.

Why neural network predicts wrong on its own training data?

I made a LSTM (RNN) neural network with supervised learning for data stock prediction. The problem is why it predicts wrong on its own training data? (note: reproducible example below)
I created simple model to predict next 5 days stock price:
model = Sequential()
model.add(LSTM(32, activation='sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer='adam', loss='mse')
es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
model.fit(x_train, y_train, batch_size=64, epochs=25, validation_data=(x_test, y_test), callbacks=[es])
The correct results are in y_test (5 values), so model trains, looking back 90 previous days and then restore weights from best (val_loss=0.0030) result with patience=3:
Train on 396 samples, validate on 1 samples
Epoch 1/25
396/396 [==============================] - 1s 2ms/step - loss: 0.1322 - val_loss: 0.0299
Epoch 2/25
396/396 [==============================] - 0s 402us/step - loss: 0.0478 - val_loss: 0.0129
Epoch 3/25
396/396 [==============================] - 0s 397us/step - loss: 0.0385 - val_loss: 0.0178
Epoch 4/25
396/396 [==============================] - 0s 399us/step - loss: 0.0398 - val_loss: 0.0078
Epoch 5/25
396/396 [==============================] - 0s 391us/step - loss: 0.0343 - val_loss: 0.0030
Epoch 6/25
396/396 [==============================] - 0s 391us/step - loss: 0.0318 - val_loss: 0.0047
Epoch 7/25
396/396 [==============================] - 0s 389us/step - loss: 0.0308 - val_loss: 0.0043
Epoch 8/25
396/396 [==============================] - 0s 393us/step - loss: 0.0292 - val_loss: 0.0056
Prediction result is pretty awesome, isn't it?
That's because algorithm restored best weights from #5 epoch. Okey, let's now save this model to .h5 file, move back -10 days and predict last 5 days (at first example we made model and validate on 17-23 April including day off weekends, now let's test on 2-8 April). Result:
It shows absolutely wrong direction. As we see that's because model was trained and took #5 epoch best for validation set on 17-23 April, but not on 2-8. If I try train more, playing with what epoch to choose, whatever I do, there are always a lot of time intervals in the past that have wrong prediction.
Why does model show wrong results on its own trained data? I trained data, it must remember how to predict data on this piece of set, but predicts wrong. What I also tried:
Use large data sets with 50k+ rows, 20 years stock prices, adding more or less features
Create different types of model, like adding more hidden layers, different batch_sizes, different layers activations, dropouts, batchnormalization
Create custom EarlyStopping callback, get average val_loss from many validation data sets and choose the best
Maybe I miss something? What can I improve?
Here is very simple and reproducible example. yfinance downloads S&P 500 stock data.
"""python 3.7.7
tensorflow 2.1.0
keras 2.3.1"""
import numpy as np
import pandas as pd
from keras.callbacks import EarlyStopping, Callback
from keras.models import Model, Sequential, load_model
from keras.layers import Dense, Dropout, LSTM, BatchNormalization
from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
import yfinance as yf
np.random.seed(4)
num_prediction = 5
look_back = 90
new_s_h5 = True # change it to False when you created model and want test on other past dates
df = yf.download(tickers="^GSPC", start='2018-05-06', end='2020-04-24', interval="1d")
data = df.filter(['Close', 'High', 'Low', 'Volume'])
# drop last N days to validate saved model on past
df.drop(df.tail(0).index, inplace=True)
print(df)
class EarlyStoppingCust(Callback):
def __init__(self, patience=0, verbose=0, validation_sets=None, restore_best_weights=False):
super(EarlyStoppingCust, self).__init__()
self.patience = patience
self.verbose = verbose
self.wait = 0
self.stopped_epoch = 0
self.restore_best_weights = restore_best_weights
self.best_weights = None
self.validation_sets = validation_sets
def on_train_begin(self, logs=None):
self.wait = 0
self.stopped_epoch = 0
self.best_avg_loss = (np.Inf, 0)
def on_epoch_end(self, epoch, logs=None):
loss_ = 0
for i, validation_set in enumerate(self.validation_sets):
predicted = self.model.predict(validation_set[0])
loss = self.model.evaluate(validation_set[0], validation_set[1], verbose = 0)
loss_ += loss
if self.verbose > 0:
print('val' + str(i + 1) + '_loss: %.5f' % loss)
avg_loss = loss_ / len(self.validation_sets)
print('avg_loss: %.5f' % avg_loss)
if self.best_avg_loss[0] > avg_loss:
self.best_avg_loss = (avg_loss, epoch + 1)
self.wait = 0
if self.restore_best_weights:
print('new best epoch = %d' % (epoch + 1))
self.best_weights = self.model.get_weights()
else:
self.wait += 1
if self.wait >= self.patience or self.params['epochs'] == epoch + 1:
self.stopped_epoch = epoch
self.model.stop_training = True
if self.restore_best_weights:
if self.verbose > 0:
print('Restoring model weights from the end of the best epoch')
self.model.set_weights(self.best_weights)
def on_train_end(self, logs=None):
print('best_avg_loss: %.5f (#%d)' % (self.best_avg_loss[0], self.best_avg_loss[1]))
def multivariate_data(dataset, target, start_index, end_index, history_size, target_size, step, single_step=False):
data = []
labels = []
start_index = start_index + history_size
if end_index is None:
end_index = len(dataset) - target_size
for i in range(start_index, end_index):
indices = range(i-history_size, i, step)
data.append(dataset[indices])
if single_step:
labels.append(target[i+target_size])
else:
labels.append(target[i:i+target_size])
return np.array(data), np.array(labels)
def transform_predicted(pr):
pr = pr.reshape(pr.shape[1], -1)
z = np.zeros((pr.shape[0], x_train.shape[2] - 1), dtype=pr.dtype)
pr = np.append(pr, z, axis=1)
pr = scaler.inverse_transform(pr)
pr = pr[:, 0]
return pr
step = 1
# creating datasets with look back
scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df.values)
dataset = df_normalized[:-num_prediction]
x_train, y_train = multivariate_data(dataset, dataset[:, 0], 0,len(dataset) - num_prediction + 1, look_back, num_prediction, step)
indices = range(len(dataset)-look_back, len(dataset), step)
x_test = np.array(dataset[indices])
x_test = np.expand_dims(x_test, axis=0)
y_test = np.expand_dims(df_normalized[-num_prediction:, 0], axis=0)
# creating past datasets to validate with EarlyStoppingCust
number_validates = 50
step_past = 5
validation_sets = [(x_test, y_test)]
for i in range(1, number_validates * step_past + 1, step_past):
indices = range(len(dataset)-look_back-i, len(dataset)-i, step)
x_t = np.array(dataset[indices])
x_t = np.expand_dims(x_t, axis=0)
y_t = np.expand_dims(df_normalized[-num_prediction-i:len(df_normalized)-i, 0], axis=0)
validation_sets.append((x_t, y_t))
if new_s_h5:
model = Sequential()
model.add(LSTM(32, return_sequences=False, activation = 'sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
# model.add(Dropout(0.2))
# model.add(BatchNormalization())
# model.add(LSTM(units = 16))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer = 'adam', loss = 'mse')
# EarlyStoppingCust is custom callback to validate each validation_sets and get average
# it takes epoch with best "best_avg" value
# es = EarlyStoppingCust(patience = 3, restore_best_weights = True, validation_sets = validation_sets, verbose = 1)
# or there is keras extension with built-in EarlyStopping, but it validates only 1 set that you pass through fit()
es = EarlyStopping(monitor = 'val_loss', patience = 3, restore_best_weights = True)
model.fit(x_train, y_train, batch_size = 64, epochs = 25, shuffle = True, validation_data = (x_test, y_test), callbacks = [es])
model.save('s.h5')
else:
model = load_model('s.h5')
predicted = model.predict(x_test)
predicted = transform_predicted(predicted)
print('predicted', predicted)
print('real', df.iloc[-num_prediction:, 0].values)
print('val_loss: %.5f' % (model.evaluate(x_test, y_test, verbose=0)))
fig = go.Figure()
fig.add_trace(go.Scatter(
x = df.index[-60:],
y = df.iloc[-60:,0],
mode='lines+markers',
name='real',
line=dict(color='#ff9800', width=1)
))
fig.add_trace(go.Scatter(
x = df.index[-num_prediction:],
y = predicted,
mode='lines+markers',
name='predict',
line=dict(color='#2196f3', width=1)
))
fig.update_layout(template='plotly_dark', hovermode='x', spikedistance=-1, hoverlabel=dict(font_size=16))
fig.update_xaxes(showspikes=True)
fig.update_yaxes(showspikes=True)
fig.show()

The OP postulates an interesting finding. Let me simplify the original question as follows.
If the model is trained on a particular time series, why can't the model reconstruct previous time series data, which it was already trained on?
Well, the answer is embedded in the training progress itself. Since EarlyStopping is used here to avoid overfitting, the best model is saved at epoch=5, where val_loss=0.0030 as mentioned by the OP. At this instance, the training loss is equal to 0.0343, that is, the RMSE of training is 0.185. Since the dataset is scaled using MinMaxScalar, we need to undo the scaling of RMSE to understand what's going on.
The minimum and maximum values of the time sequence are found to be 2290 and 3380. Therefore, having 0.185 as the RMSE of training means that, even for the training set, the predicted values may differ from the ground truth values by approximately 0.185*(3380-2290), that is ~200 units on average.
This explains why there is a big difference when predicting the training data itself at a previous time step.
What should I do to perfectly emulate training data?
I asked this question from myself. The simple answer is, make the training loss approaching 0, that is overfit the model.
After some training, I realized that a model with only 1 LSTM layer that has 32 cells is not complex enough to reconstruct the training data. Therefore, I have added another LSTM layer as follows.
model = Sequential()
model.add(LSTM(32, return_sequences=True, activation = 'sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
# model.add(Dropout(0.2))
# model.add(BatchNormalization())
model.add(LSTM(units = 64, return_sequences=False,))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer = 'adam', loss = 'mse')
And the model is trained for 1000 epochs without considering EarlyStopping.
model.fit(x_train, y_train, batch_size = 64, epochs = 1000, shuffle = True, validation_data = (x_test, y_test))
At the end of 1000th epoch we have a training loss of 0.00047 which is much lower than the training loss in your case. So we would expect the model to reconstruct the training data better. Following is the prediction plot for Apr 2-8.
A Final Note:
Training on a particular database does not necessarily mean that the model should be able to perfectly reconstruct the training data. Especially, when the methods such as early stopping, regularization and dropout are introduced to avoid overfitting, the model tends to be more generalizable rather than memorizing training data.

As others have said already, you should not expect a lot from this.
Nevertheless, I found the following in your code:
You are re-fitting the scaler every time during training and testing. You need to save the sacler and only transform data during testing, otherwise, the results will be slightly different:
from sklearn.externals import joblib
scaler_filename = "scaler.save"
if new_s_h5:
scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df.values)
joblib.dump(scaler, scaler_filename)
else:
scaler = joblib.load(scaler_filename)
df_normalized = scaler.transform(df.values)
Set shuffle=False. As you do need to keep the order of your dataset.
Set batch_size=1. As it will be less overfitting-prone and the learning will be more noisy and the error less averaged.
Set epochs=50 or more.
With above-mentioned settings, the model achieved loss: 0.0037 - val_loss: 3.7329e-04.
Check the following samples of prediction:
From 17/04/2020 --> 23/04/2020:
From 02/04/2020 --> 08/04/2020:
From 25/03/2020 --> 31/03/2020:

Suspect #1 - Regularization
Neural networks are great at overfitting the training data, actually there is an experiment replacing CIFAR10 (image classification task) labels (y values) by random labels on the training dataset and the network fits the random labels resulting in almost zero loss.
on the left side we can see that given enough epochs random labels
gets around 0 loss - perfect score (from understanding deep learning
requires re-thinking generalization by zhang et al 2016)
So why its not happening all the time? regularization.
regularization is (roughly) trying to solve harder problem than the optimization problem (the loss) we defined for the model.
some common regularizations methods in neural networks:
early stopping
dropout
batch normalization
weight decay (e.g. l1 l2 norms)
data augmentation
adding random/gaussian noise
these methods help reduce overfitting and usually result in better validation and test performance, but result in lower train performance (which doesnt matter actually as explained on the last paragraph).
train data performance are usually not so important and for that we use the validation set.
Suspect #2 - Model Size
you are using single LSTM layer with 32 units. thats pretty small.
try increase the size and even put two LSTM layers (or bidirectional one) and I'm sure the model and the optimizer will overfit your data as long as you let them - i.e. remove the early stopping, restore_last_weights and any other regularization specified above.
Note on Problem Complexity
trying to predict future stock prices just by looking at the history is not an easy task, and even if the model can (over)fit perfectly the training set it will probably wont do anything useful on the test set or in real world.
ML is not black magic, the x samples needs to be correlated in some way to the y tags, we usually assume that (x,y) are drawn from some distribution together.
A more intuitive way to think about it, when you need to tag an image manually for dog/cat class - that pretty straight forward. but can you manually "tag" the stock price by looking at the history of that stock alone?
Thats some intuition on how hard this problem is.
Note on Overfitting
One should not chase higher training performance its almost useless to try overfit the training data, as we usually try to perform well with a model on new unseen data with similar properties to the train data. the all idea is to try to generalize and learn the properties of the data and correlation with the target, thats what learning is :)

Why does model show wrong results on its own trained data? I trained data, it must remember how to predict data on this piece of set, but predicts wrong.
You want the model to learn the relationship between input and output instead of memorization. If a model memorizes the correct output for each input we can say it is over fitting the training data. Often you can force the model to overfit by using a small subset of the data, so if that's the behavior you want to see you can try that.

Basically If you want to get better result for training data, your training accuracy should be high as possible. You should use better model with respect to data you have. Basically you should check whether your training accuracy for this purpose regardless of the test accuracy. This is also called as overfitting which gives better accuracy in training data rather than test data.
Early stopping might be affect for this scenario where the best test/validation accuracy is taken rather than training accuracy.

The short answer:
Set:
batch_size = 1
epochs = 200
shuffle = False
Intuition: You are describing the priority of high accuracy in the training data. This is describing overfitting. To do that, set the batch size to 1, the epochs high, and shuffling off.

It's under-fitting and to improve that I thing you need to add neurons into your hidden layers.!!
Another point is try activation function 'relu'. Sigmoid doesn't give good results. Also you need to define 'softmax' in your output layer.!

Why does model show wrong results on its own trained data? I trained data, it must
remember how to predict data on this piece of set, but predicts wrong.
Look what are you doing:
Building a model with some layers
Training model with the training_data
When you trained the model, all trainable parameters get trained (i.e., weights of model got saved)
These weights now represent the relationship between input and output.
When you predict the same training_data again, this time trained model uses weights to get the output.
The quality of your model now decides the predictions and hence they are different from original results even though data is the same.

Different training accuracy for different models but same testing accuracy

I am working to develop a deep learning classifier- 2 classes. The dataset I am working with is imbalanced. I did down sampling to resolve the same. I then create a small sample of data of both classes and create a Deep Learning model as follows:
dl_model = Sequential()
n_cols = X_train.shape[1]
dl_model.add(Dense(1024, activation='relu', input_shape=(n_cols,)))
dl_model.add(Dense(512, activation='relu'))
dl_model.add(Dense(256, activation='relu'))
dl_model.add(Dense(256, activation='relu'))
dl_model.add(Dense(128, activation='relu'))
dl_model.add(Dense(64, activation='relu'))
dl_model.add(Dense(2, activation='softmax'))
adam= optimizers.Adam(lr=0.001)
dl_model.compile(optimizer=adam, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
early_stopping_monitor = EarlyStopping(patience=3)
dl_model.fit(X_train, y_train, epochs=10, validation_split=0.2, batch_size=1000,callbacks=[early_stopping_monitor], shuffle=True)
model_json = dl_model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
dl_model.save_weights("model.h5")
For different hyperparameter tuning, I get results like:
Model 1 - train_loss: 7.7971 - train_acc: 0.5160 - val_loss: 9.6992 - val_acc: 0.3982
Model 2 - train_loss: 2.8257 - train_acc: 0.8201 - val_loss: 2.9312 - val_acc: 0.8160
Model 3 - train_loss: 3.1887 - train_acc: 0.8002 - val_loss: 3.5195 - val_acc: 0.7808
I save each of these models and then load it in a different file where I apply the model to the whole dataset and calculate the metrics as follows:
sc = MinMaxScaler()
X = sc.fit_transform(X)
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
loaded_model.load_weights("model.h5")
loaded_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
score = loaded_model.evaluate(X, y, verbose=0)
print("Deep learning accuracy %s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))
All the above 3 models give the same accuracy. Even the same confusion matrix. What could be the reason? Shouldn't the 3 models give different results as they have different training accuracy/metrics?
Update:
When loading any of the models, I get he same accuracy of 97.82% and confusion matrix as :
[[143369 0]
[ 2958 0]]

The problem that you have here is that all the neural networks that you have trained are not able to properly learn the second class, the less well-represented one.
The accuracy that you have on the test set is the same due to the fact that neither model_1 or model_2 or model_3 are able to distinguish class 1 from class 2 and thus all three of them know to recognise class 1, but fail to recognise class 2. In other words, when you test on your test set, the results are the same, regardless of the differences that you see during training.
This observation can be easily inferred from the confusion matrix that you displayed there.
Assume you do not know the above observation. Let us do some simple math:
143369 + 2958 = 146327.
(143369/146327) * 100 = 97.97% (which is a bit bigger than your reported accuracy, but in the same ballpark -- the minor difference stems from evaluate_score in keras)
You can also infer from this(not only visually seeing that you have no TP(true positives) for class 2) that your have a problem.
Let us proceed now to tackle this issue!
Since we have mentioned this observation, you have to do the following for tackling this issue(or some of them combined):
First of all, start with a lower learning rate (0.0001 is a much better starting choice).
Second of all, consult the following procedures in order to obtain a good model:
Remove the EarlyStopping(patience=3).
Save your best model according to a different metric than accuracy (F1-Score for example)
Reduce the learning_rate while training(ReduceLROnPlateau). You can use the following callback which is much more suitable in your case than the EarlyStopping : https://keras.io/callbacks/#reducelronplateau
Use dataset enrichment. The best way to tackle imbalanced datasets is to use oversampling. Rather than under-sampling the well represented class and thus reducing the variance of your dataset, you can balance the support of classes by adding more examples in your minority class.

TensorFlow 2.0: Eager execution of training either returns bad results or doesn't learn at all

I am experimenting with TensorFlow 2.0 (alpha). I want to implement a simple feed forward Network with two output nodes for binary classification (it's a 2.0 version of this model).
This is a simplified version of the script. After I defined a simple Sequential() model, I set:
# import layers + dropout & activation
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.activations import elu, softmax
# Neural Network Architecture
n_input = X_train.shape[1]
n_hidden1 = 15
n_hidden2 = 10
n_output = y_train.shape[1]
model = tf.keras.models.Sequential([
Dense(n_input, input_shape = (n_input,), activation = elu), # Input layer
Dropout(0.2),
Dense(n_hidden1, activation = elu), # hidden layer 1
Dropout(0.2),
Dense(n_hidden2, activation = elu), # hidden layer 2
Dropout(0.2),
Dense(n_output, activation = softmax) # Output layer
])
# define loss and accuracy
bce_loss = tf.keras.losses.BinaryCrossentropy()
accuracy = tf.keras.metrics.BinaryAccuracy()
# define optimizer
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
# save training progress in lists
loss_history = []
accuracy_history = []
# loop over 1000 epochs
for epoch in range(1000):
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(X_train), y_train)
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# save in history vectors
current_loss = current_loss.numpy()
loss_history.append(current_loss)
accuracy.update_state(model(X_train), y_train)
current_accuracy = accuracy.result().numpy()
accuracy_history.append(current_accuracy)
# print loss and accuracy scores each 100 epochs
if (epoch+1) % 100 == 0:
print(str(epoch+1) + '.\tTrain Loss: ' + str(current_loss) + ',\tAccuracy: ' + str(current_accuracy))
accuracy.reset_states()
print('\nTraining complete.')
Training goes without errors, however strange things happen:
Sometimes, the Network doesn't learn anything. All loss and accuracy scores are constant throughout all the epochs.
Other times, the network is learning, but very very badly. Accuracy never went beyond 0.4 (while in TensorFlow 1.x I got an effortless 0.95+). Such a low performance suggests me that something went wrong in the training.
Other times, the accuracy is very slowly improving, while the loss remains constant all the time.
What can cause these problems? Please help me understand my mistakes.
UPDATE:
After some corrections, I can make the Network learn. However, its performance is extremely poor. After 1000 epochs, it reaches about %40 accuracy, which clearly means something is still wrong. Any help is appreciated.

The tf.GradientTape is recording every operation that happens inside its scope.
You don't want to record in the tape the gradient calculation, you only want to compute the loss forward.
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(df), classification)
# End of tape scope
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
# The tape is now consumed
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
More importantly, I don't see the loop on the training set, therefore I suppose the complete code looks like:
for epoch in range(n_epochs):
for df, classification in dataset:
# your code that computes loss and trains
Moreover, the usage of the metrics is wrong.
You want to accumulate, thus update the internal state of the accuracy operation, at every training step and measure the overall accuracy at the end of every epoch.
Thus you have to:
# Measure the accuracy inside the training loop
accuracy.update_state(model(df), classification)
And call accuracy.result() only at the end of the epoch, when all the accuracy value have been saved into the metric.
Remember to call to the .reset_states() method to clears the variable states, resetting it to zero at the end of every epoch.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.