LSTM accuracy too low

LSTM accuracy too low - python

I’m a new learner, I just try to get accuracy and validate accuracy using the below code
model = Sequential()
model.add(LSTM(10, input_shape=(train_X.shape[1], train_X.shape[2])))
#model.add(Dropout(0.2))
#model.add(LSTM(30, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1), return_sequences=True)
model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=120, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history[‘loss’], label=’train’)
pyplot.plot(history.history[‘val_loss’], label=’test’)
pyplot.legend()
pyplot.show()
print(history.history[‘acc’])
As the loss value is very low (which is round 0.0136) inspite of that I’m getting the accuracy is 6.9% and validate accuracy is 2.3% respectively, which is very low

That is because accuracy is meaningful only for classification problems; for regression (i.e. numeric prediction) ones, such as yours, accuracy is meaningless.
What's more, the fact is that Keras unfortunately will not "protect" you or any other user from putting such meaningless requests in your code, i.e. you will not get any error, or even a warning, that you are attempting something that does not make sense, such as requesting the accuracy in a regression setting; see my answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)? for more details and a practical demonstration (the argument is identical in the case of MAE instead of MSE, since both loss functions signify regression problems).
In regression settings, usually the performance metric is the same with the loss (here MAE), so you should just remove the metrics=[‘accuracy’] argument from your model compilation and worry only for your loss (which, as you say, is low indeed).

Related

How to monitor accuracy in tensorflow (metric accuracy is not available)

I would like to monitor accuracy for my tensorflow model, however, when compiling my model using metrics=['accuracy'] or metrics=[tf.keras.metrics.Accuracy()] and then train my model the following Warning pops up.
WARNING:tensorflow: Early stopping conditioned on metric accuracy which is not available. Available metrics are: loss, val_loss
model.compile(optimizer='adam', loss='mean_squared_error', metrics=["tried both options i mentioned"])
callbacks = [EarlyStopping(monitor='accuracy', patience=1000)]
model.fit(x_train, y_train, epochs=5000, batch_size=100, validation_split=0.2, callbacks=callbacks)

Based on the link here:
Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:
So, for other problems like regression you should use other metrics rather than accuracy, like metrics=[tf.keras.metrics.MeanSquaredError()])

In addition to Kaveh's answer, there are other metrics for regression problems. One that I think is quite useful is R2 squared (https://en.wikipedia.org/wiki/Coefficient_of_determination) and it isn't included in Keras.
Tensorflow addons library (https://www.tensorflow.org/addons) implements it and can be used in a ANN with the following code:
import tensorflow_addons as tfa
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01),
loss="mean_squared_error",
metrics=tfa.metrics.RSquare(y_shape=(1,)))

Interpreting training loss/accuracy vs validation loss/accuracy

I have a few questions about interpreting the performance of certain optimizers on MNIST using a Lenet5 network and what does the validation loss/accuracy vs training loss/accuracy graphs tell us exactly.
So everything is done in Keras using a standard LeNet5 network and it is ran for 15 epochs with a batch size of 128.
There are two graphs, train acc vs val acc and train loss vs val loss. I made 4 graphs because I ran it twice, once with validation_split = 0.1 and once with validation_data = (x_test, y_test) in model.fit parameters. Specifically the difference is shown here:
train = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_data=(x_test,y_test), verbose=1)
train = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_split=0.1, verbose=1)
These are the graphs I produced:
using validation_data=(x_test, y_test):
using validation_split=0.1:
So my two questions are:
1.) How do I interpret both the train acc vs val acc and train loss vs val acc graphs? Like what does it tell me exactly and why do different optimizers have different performances (i.e the graphs are different as well).
2.) Why do the graphs change when I use validation_split instead? Which one would be a better choice to use?

I will attempt to provide an answer
You can see that towards the end training accuracy is slightly higher than validation accuracy and training loss is slightly lower than validation loss. This hints at overfitting and if you train for more epochs the gap should widen.
Even if you use the same model with same optimizer you will notice slight difference between runs because weights are initialized randomly and randomness associated with GPU implementation. You can look here for how to address this issue.
Different optimizers will usually produce different graph because they update model parameters differently. For example, vanilla SGD will do update at constant rate for all parameters and at all training steps. But if you add momentum the rate will depend on previous updates and usually will result in faster convergence. Which means you can achieve same accuracy as vanilla SGD in lower number of iteration.
Graphs will change because training data will be changed if you split randomly. But for MNIST you should use standard test split provided with the dataset.

My LSTM model overfits over validation data

Here's my LSTM model to classify hand gesture. Initially, I had 1960 training data of shape(num_sequences, num_joints, 3) that I reshape to shape(num_sequences, num_joints*3).
Here's my model:
input_shape = (trainx.shape[1], trainx.shape[2])
print("Build LSTM RNN model ...")
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(171, 66)))
model.add(Bidirectional(LSTM(units=256, activation='tanh', return_sequences=True, input_shape=input_shape)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=True)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=False)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Dense(units=trainy.shape[1], activation="softmax"))
print("Compiling ...")
# Keras optimizer defaults:
# Adam : lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, decay=0.
# RMSprop: lr=0.001, rho=0.9, epsilon=1e-8, decay=0.
# SGD : lr=0.01, momentum=0., decay=0.
opt = Adam()
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
I get a 90% accuracy on train and 50% on test

Overfitting is quite common in deep learning.
To circumvent over fitting with your LSTM architecture try the following things in this order:
Decrease the learning rate from 0.1 or 0.01 to 0.001,0.0001,0.00001.
Reduce the number of epochs. You can try to plot the training and validation accuracy as a function of the number of epochs and see when the training accuracy becomes larger than the validation accuracy. That is the number of epochs that you should use. Combine this with the 1st step decreasing the learning rate.
Then you can try to modify the architecture of the LSTM, here you already added dropout (maximum value 0.5), I would suggest to try 0.2, 0.3. You have 3 cells which is better than 2, the size of the nodes look reasonable. What is the embedding dimension you are currently using? Since you are overfitting it is worth a try to reduce the number of cells from 3 to 2 and keeping the same number of nodes.
The batch size might be important as well as the distribution of subclasses in your dataset. Is the dataset equally distributed and equally balanced between training and validation sets? What I mean by this is that if one hand gesture is over represented in the training set compared to the validation set that might be a problem. A good strategy to overcome this is to keep some part of the data as a test set. Then do a train/split cross validation using sklearn (5 times). Then train your architecture on each train/split model separately (5 times) and compare the training and validation accuracy. If there is a big bias in the split or among the sets you will be able to identify it in this manner.
Last, you can try augmentation, specifically rotation and horizontal/vertical flip. This library might help https://github.com/aleju/imgaug
Hope this helps!

How do you know the network is over fitting versus some kind of error in your data set. Does the validation loss improve initially up to some epoch then plateau or start to increase? Then it is over fitting. If it starts at 50% and stays there it is not an over fitting problem. With the amount of drop out you have over fitting does not look very likely. How did you select your validation set? Was it randomly selected from the overall data set or did you do the selection? It is always better to randomly select the data so that its probability distribution mirrors that of the training data. As said in the comments please show your code for model.fit there could be a problem there. How do you input the data? Did you use generators? A 50% validation accuracy leads me to suspect some error in how your validation data is provided to the network or some error in labeling of the validation data. I would also recommend you consider the use of dynamically adjusting your learning rate based on monitoring of validation loss. Keras has a callback for this
called ReduceLROnPlateau. Documentation is here. Set it up to monitor validation loss. I set the parameters patience=3 and factor=.5 which seems to work well. You can think of training as descending into a valley. As you descend the valley gets narrower. If the learning rate is to large and remains fixed you won't be able to reach further down toward the minimum. This should improve your training accuracy which should result in improvement of validation accuracy. As I said with the level of drop out you have I do not think it is over fitting but if it is you can also use Keras regularizes to help avoid over training. Documentation is here.

How can I get weights converged in a way that MSE minimizes?

here is my code
for _ in range(5):
K.clear_session()
model = Sequential()
model.add(LSTM(256, input_shape=(None, 1)))
model.add(Dropout(0.2))
model.add(Dense(256))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='RmsProp', metrics=['accuracy'])
hist = model.fit(x_train, y_train, epochs=20, batch_size=64, verbose=0, validation_data=(x_val, y_val))
p = model.predict(x_test)
print(mean_squared_error(y_test, p))
plt.plot(y_test)
plt.plot(p)
plt.legend(['testY', 'p'], loc='upper right')
plt.show()
Total params : 330,241
samples : 2264
and below is the result
I haven't changed anything.
I only ran for loop.
As you can see in the picture, the result of the MSE is huge, even though I have just run the for loop.
I think the fundamental reason for this problem is that the optimizer can not find global maximum and find local maximum and converge. The reason is that after checking all the loss graphs, the loss is no longer reduced significantly. (After 20 times) So in order to solve this problem, I have to find the global minimum. How should I do this?
I tried adjusting the number of batch_size, epoch. Also, I tried hidden layer size, LSTM unit, kerner_initializer addition, optimizer change, etc. but could not get any meaningful result.
I wonder how can I solve this problem.
Your valuable opinions and thoughts will be very much appreciated.
if you want to see full source here is link https://gist.github.com/Lay4U/e1fc7d036356575f4d0799cdcebed90e

From your example, the problem simply comes from the fact that you have over 100 times more parameters than you have samples. If you reduce the size of your model, you will see less variance.
The wider question you are asking is actually very interesting that usually isn't covered in tutorials. Nearly all Machine Learning models are by nature stochastic, the output predictions will change slightly everytime you run it which means you will always have to ask the question: Which model do I deploy to production ?
Off the top of my head there are two things you can do:
Choose the first model trained on all the data (after cross-validation, ...)
Build an ensemble of models that all have the same hyper-parameters and implement a simple voting strategy
References:
https://machinelearningmastery.com/train-final-machine-learning-model/
https://machinelearningmastery.com/randomness-in-machine-learning/

If you want to always start from the same point you should set some seed. You can do it like this if you use Tensorflow backend in Keras:
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
If you want to learn why do you get different results in ML/DL models, I recommend this article.

Keras Binary Classification - Sigmoid activation function

I've implemented a basic MLP in Keras with tensorflow and I'm trying to solve a binary classification problem. For binary classification, it seems that sigmoid is the recommended activation function and I'm not quite understanding why, and how Keras deals with this.
I understand the sigmoid function will produce values in a range between 0 and 1. My understanding is that for classification problems using sigmoid, there will be a certain threshold used to determine the class of an input (typically 0.5). In Keras, I'm not seeing any way to specify this threshold, so I assume it's done implicitly in the back-end? If this is the case, how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem? With binary classification, we want a binary value, but with regression a nominal value is needed. All I can see that could be indicating this is the loss function. Is that informing Keras on how to handle the data?
Additionally, assuming Keras is implicitly applying a threshold, why does it output nominal values when I use my model to predict on new data?
For example:
y_pred = model.predict(x_test)
print(y_pred)
gives:
[7.4706882e-02] [8.3481872e-01] [2.9314638e-04] [5.2297767e-03]
[2.1608515e-01] ... [4.4894204e-03] [5.1120580e-05] [7.0263929e-04]
I can apply a threshold myself when predicting to get a binary output, however surely Keras must be doing that anyway in order to correctly classify? Perhaps Keras is applying a threshold when training the model, but when I use it to predict new values, the threshold isn't used as the loss function isn't used in predicting? Or is not applying a threshold at all, and the nominal values outputted happen to be working well with my model? I've checked this is happening on the Keras example for binary classification, so I don't think I've made any errors with my code, especially as it's predicting accurately.
If anyone could explain how this is working, I would greatly appreciate it.
Here's my model as a point of reference:
model = Sequential()
model.add(Dense(124, activation='relu', input_shape = (2,)))
model.add(Dropout(0.5))
model.add(Dense(124, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr = 0.1, momentum = 0.003),
metrics=['acc'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)

The output of a binary classification is the probability of a sample belonging to a class.
how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem?
It does not need to. It uses the loss function to calculate the loss, then the derivatives and update the weights.
In other words:
During training the framework minimizes the loss. The user must specify the loss function (provided by the framework) or supply their own. The network only cares about the scalar value this function outputs and its 2 arguments are predicted y^ and actual y.
Each activation function implements the forward propagation and back-propagation functions. The framework is only interested in these 2 functions. It does not care what the function does exactly, as long as it is differentiable for gradient descent to work.

You can assign the threshold explicitly in compile() by using
tf.keras.metrics.BinaryAccuracy(
name="binary_accuracy", dtype=None, threshold=0.5
)
like following:
model.compile(optimizer='sgd',
loss='mse',
metrics=[tf.keras.metrics.BinaryAccuracy()])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.