Accuracy Difference in Deep Learning - epochs and final accuracy

Accuracy Difference in Deep Learning - epochs and final accuracy - python

I am trying to train a deep learning model for my project. The code for the same is as given below:
model = Sequential()
# Add an input layer
model.add(Dense(22, activation='relu', input_shape=(22,)))
# Add one hidden layer
model.add(Dense(12, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, gender_train,epochs=90, batch_size=1, verbose=1)
gender_pred = model.predict(x_test)
print(confusion_matrix(gender_test,gender_pred.round()))
print(classification_report(gender_test,gender_pred.round()))
print(accuracy_score(gender_pred.round(),gender_test))
The accuracy for each epoch is coming in 90s. (eg. 0.97). But, When I predict the gender and print the accuracy with the help of accuracy_score, it's just coming in 80s. How are these accuracies related? I am new to machine learning, any help would be appreciated.

The training and test accuracy scores are used to identify over-fitting. A training accuracy that is far higher than test accuracy indicates over-fitting.
Accuracy can be misleading because of the nature of how it is calculated. In your case you are predicting the gender which has two possible values 0 and 1. If the data is imbalanced, e.g. if a large portion of the test data was of one class then predicting that class would provide a high accuracy score but it would not necessarily be true when the classifier is applied to other data.
Accuracy = ((True Positive) + (True Negative)) / (Total Population)
The Confusion Matrix enables an accurate assessment of the model performance.
True Positives (TP) − The case when both actual class & predicted class of data point is 1.
True Negatives (TN) − The case when both actual class & predicted class of data point is 0.
False Positives (FP) − The case when actual class of data point is 0 & predicted class of data point is 1.
False Negatives (FN) − The case when actual class of data point is 1 & predicted class of data point is 0.
Looking at your console output:
TP = 93,
FP = 27,
FN = 24,
TN = 101
Accuracy score : (93 + 101) / (93 + 27 + 24 + 101) = 0.7918367347
Precision : (TP) / (TP + FP) : 93 / (93 + 27) = 0.775
Recall : (TP) / (TP + FN) : 93 / (93 + 24) = 0.7948
Support is the number of samples of the true label in each class of target values. (120 are class 0, 125 are class 1)
F1-score : 2 * (Precision * Recall) / (Precision + Recall)
F1 score is the weighted average of the precision and recall. Its best value would be 1 and the worst would be 0.
Being able to interpret these variables will help you to better understand your model.
If you are only beginning your Machine Learning adventure, I'd highly recommend Andrew Ng's Machine Learning course on Coursera. He explains everything you need to understand quite simply without too much mathematics.

You wrongly mix up two things. There are two sets general, train set: user for training and you got above 90+ accuracy and another set you used for test and you got 79 - which by the way a clear indication of model overfitting.
All your confusion matrix, classification report, and lastly accuracy measurements are done using the test set. In your code,
model = Sequential()
# Add an input layer
model.add(Dense(22, activation='relu', input_shape=(22,)))
# Add one hidden layer
model.add(Dense(12, activation='relu'))
# Add an output layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, gender_train,epochs=90, batch_size=1, verbose=1)
This is the training part, and it properly .fit the training pairs - x_train -> gender_train. Now, after some fitting or some epoch later when we test the model on unseen data which never NEVER used for training, that eventually reveals the model's true performance. In your case, this unseen data is x_test.
gender_pred = model.predict(x_test)
print(confusion_matrix(gender_test,gender_pred.round()))
print(classification_report(gender_test,gender_pred.round()))
print(accuracy_score(gender_pred.round(),gender_test))
If performance on x_test set or Unseen data is low or very low compare to the training data (i.e. x_train), it's simply called overfitting and that's not expected. In your case, your model is suffering from overfitting.
How to overcome
To avoid overfitting, people generally do as follows:
make a validation set for choosing hyper-parameter in training time
strong data augmentation
I would recommend these two article to read to get some more idea. Hope that helps.
Overfitting and Underfitting With Machine Learning Algorithms
Underfitting vs. Overfitting (vs. Best Fitting) in Machine Learning

Related

Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!

Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow

Selecting validation metric for `categorical_crossentropy` in Keras

I am looking at these two questions and documentation:
Whats the output for Keras categorical_accuracy metrics?
Categorical crossentropy need to use categorical_accuracy or accuracy as the metrics in keras?
https://keras.io/api/metrics/probabilistic_metrics/#categoricalcrossentropy-class
For classification of X-Rays images I (15 classes) I do:
# Compile a model
model1.compile(optimizer = 'adam', loss = 'categorical_crossentropy',
metrics = ['accuracy'])
# Fit the model
history1 = model1.fit_generator(train_generator, epochs = 10,
steps_per_epoch = 10, verbose = 1, validation_data = valid_generator)
My model works and I have an output:
But I am not sure how to add validation accuracy here to compare results and avoid over/underfitting.

I hope the following can help you:
The use of "categorical_crossentropy" tells me that your labels are a one hot encoding over different classes.
Let's say you have 15 classes, the correct prediction would be a vector with 14 zeros, and a one at the corresponding index. In this context "accuracy" will be very high as your model will be correctly predicting mostly zero everywhere, so the accuracy should easily be at least 13/15 = 0.86.
A more suitable metric would be "categorical_accuracy" which will give you 1 if the model predicts the correct index, and else 0.
If you have a validation "categorical_accuracy" better than 1/15 = 0.067 (assuming your class are correctly balanced), your model is better than random.
You can find a list of metrics at keras metrics.

Keras neural network predicting the same output

I need to develop a neural network with Keras to predict a disease using genetic data. It is known, that predicting this disease is possible even with logistic regression (however the predictions, in this case, are of very poor quality). It's worth mentioning that my data is imbalanced, so I introduced class weights later.
I decided to start with the simplest way to predict it - with a network, analogous to a logistic regression - one hidden layer with one neuron and achieved a bad, yet at least some result - 0.12-0.14 F1 score. Then I tried with 2 hidden and 1 output layers with different amount of neurons in the first hidden layer - from 1 to 8.
It turns out that in some cases it learns something, and in some is predicting the same output for every sample. I displayed the accuracy and loss function over the epochs and this is what I get:
Network loss function by epoch. It's clear that the loss function has roughly the same value, for the training data.
Network accuracy by epoch. It's clear that the accuracy is not improving, but fluctuates from 0 to 1
I searched for similar questions and the suggestions were the following:
Make more neurons - I just have to make it work with 1, 2 or more neurons in the first layer, so I can't add neurons to this one. I increased the amount of neurons in the second hidden layer up to 20, but it then stopped predicting anything with any number oh neurons in the first layer configuration.
Make more layers - I tried adding one more layer, but still have the same problem
To introduce dropout and increase it - what dropout are we talking about if it can learn with just one layer and one neuron in it
Reduce learning rate - decreased it from the default 10^(-3) to 10^(-4)
Reduce batch size - varied it from 500 samples in a minibatch to 1 (stochastic gradient descent)
More epochs - isn't 20 to 50 epochs on a 500'000 sample dataset enough?
Here's the model:
def run_nn_class_weights(data, labels, model):
n_iter = 20
predicted = None
true = None
print('Splitting the data')
x_train, x_valid, y_train, y_valid = train_test_split(data, labels, test_size = 0.05)
#model = create_model()
early_stopping_monitor=EarlyStopping(patience=240)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(labels),
labels)
class_weights = dict(enumerate(class_weights))
hist = model.fit(x_train, y_train, validation_data=[x_valid, y_valid], class_weight=class_weights,
epochs=n_iter, batch_size=500, shuffle=True, callbacks=[early_stopping_monitor],verbose=1)
proba = model.predict(data)
predicted = proba.flatten()
true = labels
return(model, proba, hist)
def old_model_n_pred(n_neurons_1st = 1):
model = Sequential()
model.add(Dense(n_neurons_1st, activation='relu', input_shape=(7516,), kernel_initializer='glorot_normal'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Flatten())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

This is a small network that should be able to converge to something that's not an atractor (getting stuck on a single value).
I suggest taking a look at the weights of all the neurons with ReLu activation.
ReLus are great because get quick calculations; but half of the relu has derivate of zero, which doesn't help with gradient descent. This might be your case.
In guess in yout case the enemy would be the first neuron.
In order to overcome this problem, I would try to do regularize inputs (to have all samples centered around 0.5 and scaled by the standard deviation). If you do this to a ReLU, you'll make it ignore anything under between [-inf, sd].
if that does not fix part of the problem, swich to a different activation function in the first layer. A sigmoid will work very good and it's not too expensive for just one neuron.
Also, take a close look at your input distribution. What your network actually does is doing a sigmoid-like classification, then using between 4 to 8 neurons to "zoom"/correct on the important parts of the function that the first transformation didn't account for.

Accuracy of TensorFlow model changes a lot each time i run model.fit

My project is to try and find out if I can predict gender of people speaking near phone from data from gyroscope and accelerometer. I have 315 examples(60sec each) and each example has 2997 lines where each line represents magnitude of vector from gyro/accelerometer xyz axis.
I shuffled input and output by same seed and I normalized input data. I split data on 60|20|20. In this test I try from accelerometer to see if there is male speaking, so output is binary.
When I train data with current model, sometimes I get accuracy as high as 0.68 and as low as 0.36 while loss is almost always around 0.69. I run it in a for loop for 10 times and average is 0.5 accuracy and 0.69 loss.
First question is i tried multiple types of models, learning rates, optimization algorithms etc. but in average i wasnt too successful. Should I try Recurrent NNs and where can i learn it?
Second question is if i train model with accuracy of 68%, is it okay to say the model has 68% accuracy even though i know average is 50%?
model = tf.keras.Sequential()
model.add(layers.Dense(512, activation='relu',input_shape = (2997,), kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.5))
for j in range(10) :
model.add(layers.Dense(1024, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
0.001,
decay_steps=200,
decay_rate=1,
staircase=True)
callbacks = [
tf.keras.callbacks.EarlyStopping(
monitor='val_loss', patience = 20
)
]
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate = lr_schedule),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics = ['accuracy'])
history = model.fit(
train_vector_examples,
train_vector_labels,
validation_split = 0.25,
epochs =80,
callbacks=callbacks,
verbose=0,
shuffle=False
)
loss1, accuracy = model.evaluate(test_vector_examples, test_vector_labels)

This is from my own experience working with different types of data; to have get good solutions to your questions you should probably study the characteristics of the data closely before coming up with any models/algorithms.
First question: generally speaking, RNNs are good for data that has time dependency, or in other words, for cases where the inputs' order matters (e.g. time series, text). So I think RNNs may not be the best choice for your type of data, as I suppose ordering does not matter in your dataset.
Second question: this really depends on the difficulty of the problem you are trying to solve; but in my opinion 68% is quite low as 50% is basically the same as random choice. You probably want to improve the accuracy further.
Also, from your explanations, I can see that each gyro/accelerometer input has shape of rank 3 (xyz), so maybe you can try some CNN architectures and see how it goes.

R-squared results of test and validation differ by a huge margin

I am working on a regression problem with keras and tensorflow using a neural network. The data is split, so that 282774 datasets are for training, 70694 are for validation and 88367 are for testing. To evaluate my models I am printing out the mean squared error (MSE), the mean absolute error (MAE) and the R-squared score. These are some examples from the results I get:
MSE MAE R-squared
Training 1.562072899 0.958128839 0.849787137
Validation 0.687871457 0.62066941 0.935365564
Test 0.683918759 0.618674863 -16.22829222
I do not understand the value for R-squared on test data. I know that R-squared can be negative, but how can it be that there is such a big difference between validation and test if both fall into the category of unseen data. Can someone give me a hint?
Some background information:
Since keras does not have the R-squared metric built in, I implemented it with some code I found on the web and which seems logical for me:
def r2_keras(y_true, y_pred):
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
And if it helps: this is my model:
model = Sequential()
model.add(Dense(75, input_shape=(7,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))
adam = optimizers.Adam(lr=0.001)
model.compile(loss='mse',
optimizer=adam,
metrics=['mse', 'mae', r2_keras])
history = model.fit(x_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2)
score = model.evaluate(x_test, y_test, batch_size=32)
One strange thing I noticed is, that not all testing data seems to be considered. The console prints out the following:
86304/88367 [============================>.] - ETA: 0s-----
Maybe this leads to a miscalculation for R-squared?
I am thankful for any help/hint I can get on understanding this issue.
Update:
I checked for outliers, but could not find any significant one. Min and max-values for test and train are close by, considering the standard deviation. Also the histograms look very much alike.
So in the next step I let my model predict the values for test data again and used pandas + numpy to calculate the r2_score. This time I got a value which is approximately equal to the r2_score for validation.
Below is how I did it. Do you see any flaws in the way I performed the calculation? (I just want to be sure that the old r2_score for "test" was indeed a calculation error)
# "test" is a dataframe with input data and the real outputs
# "inputs" is a list of the input column names
# The real/true outputs are contained in the column "output"
test['output_pred'] = model.predict(x=np.array(test[inputs]))
output_mean = test['output'].mean() # Is this the correct mean value for r2 here?
test['SSres'] = np.square(test['output']-test['output_pred'])
test['SStot'] = np.square(test['output']-output_mean)
r2 = 1-(test['SSres'].sum()/(test['SStot'].sum()))

Tensorflow's built-in evaluate method evaluates your test set batch by batch and hence calculates r2 at each batch. The metrics produced from model.evaluate() is then simple average of all r2 from each batch. While in model.fit(), r2 (and all metrics on validation set) are calculated per epoch (instead of per batch and then take avg.)
You may slice your output and output_pred into batches of the same batch size you used in model.evaluate() and calculate r2 on each batch. I guess the model produces high r2 on batches with high total sum of squares (SS_tot) and bad r2 on lower ones. So when taken average, result would be poor (however when calculate r2 on entire dataset, samples with higher ss_tot usually dominate the result).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.