Keras RNN accuracy doesn't improve

Keras RNN accuracy doesn't improve - python

I'm trying to improve my model so it can become a bit more accurate. Right now I'm training the model and get this as my training and validation accuracy.
For every epoch I get an training accuracy of 0.0003 and an validation accuracy of 0. I know this isn't good but I don't know how I can fix this.
Data is normalized with the minmax scaler. 4 of the 8 features are normalized (other 4 are hour, day, day_of_week and month)
Update:
I've also tried to normalize the entire dataset and it doesn't make a differance
scaling = MinMaxScaler(feature_range=(0,1)).fit(df[cols])
df[[cols]] = scaling.transform(df[[cols]])
My model: The shape is (5351, 1, 8)
and the input_shape is (1, 8)
model = keras.Sequential()
model.add(keras.layers.Bidirectional(keras.layers.LSTM(2,input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='linear')))
model.add(keras.layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='Adamax', metrics=['acc'])
history = model.fit(
X_train, y_train,
epochs=200,
batch_size=24,
validation_split=0.35,
shuffle=False,
)
i tried using the answer of this question:
Keras model accuracy not improving
but it didn't work

A mean_sqared_error loss is for regression tasks while a acc metric is for classification problems. So it makes no sense to use them together.
If you work on a classification problem, use binary_crossentropy or categorical_crossentropy as loss and keep the metric parameter as you did.
If it is a regression tasks, change the metric to [mse] for mean squares error instead of [acc].
Your model "works" and you have applied the standard formula for backpropagation by using the mean squares error loss. But measuring the accuracy will make Keras check if your model's output is EXACTLY equals to the expected values. Since the loss function is for regression, it will hardly ever be equal.
Three last points because that little change won't correct everything.
Firstly, your last dense layer should have an activation function. (It's safier)
Secondly, I'm pretty sure a Bidirectional+LSTM layer placed before a Dense layer should have a return_sequences=False. A LSTM layer (with or without Bidirectional) can return thé full séquence of vector (like a matrix) but a dense layer takes vectors as input. But in this case it will work because of the third point.
The last point is about the shape of your data. You have 5351 examples of shape (1, 8) each which a vector of size 8. But a LSTM layer takes a sequence of vectors still thé size of your séquence is one. I don't know if it is relevent to use an RNN type layer here.

Related

Why do I need a very high learning rate for this model to converge?

I have a simple model in tensorflow which is being trained on the first 1000 images in the MNIST datset. From my previous experience the learning rates which I used were of the order of around 0.001, however for my model to converge the learning rate needs to be far heigher, at least larger than 1. The model is shown below.
def gen_model():
return tf.keras.models.Sequential([
tf.keras.Input(shape=(28,28,)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='sigmoid'),
tf.keras.layers.Dense(10, activation='softmax')
])
model = gen_model()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5), loss='mean_squared_error')
model.summary()
model.fit(x_train, y_train, batch_size=1000, epochs=10000)
Is it expected for models of this form to require an extremely high learning rate, or is there something I have missed? When I use a learning rate of around 0.001 the loss changes incredibly slowly.
The dataset was created with the following code:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_train = x_train.reshape(60000,28,28)[:1000];
y_train = y_train[:1000];
y_train = tf.one_hot(y_train, 10)

Generally speaking, models that require learning rates larger than 1 raise a red flag for me. It seems like your model is a vanilla multilayer perceptron, so there's nothing overly complicated about that, but there are a couple things about your setup that stand out:
The output from your model uses a softmax, which is normally used to represent values from a categorical distribution (i.e., 1-of-k) -- this is typical for a classification model. But the loss you're using is typically used for optimizing Gaussian or regression outputs. You might want to try using a cross-entropy loss to see if that helps.
The output from your model is in probability space, so the values you get out from your model are in [0, 1]. The loss you're using is averaging the squared differences between the model output and the target 1-hot vector (whose values are in {0, 1}). The value you'll get for this loss is always smaller than 1, so with a learning rate less than 1, and multiplying by the existing model weights, the delta that you'll apply to your model weights is always going to be small. Sometimes that's a good thing, but my guess is that in this case -- and particularly at the start of training when the model weights aren't near their optimal values -- this is going to be quite slow.
Related to the above point, you might try initializing your model weights with a larger range of values than the default. This would help make the gradient values larger, but could also make the model more likely to diverge.
You could also try to replace your softmax output activation with a plain linear activation, in effect converting your model's output to (unnormalized) log-probability space. Then you'd need to change your dataset labels to also represent target log-probability values, which isn't possible exactly, but could get close with something like 1e8 * (1 - one_hot). But if you wanted to go this route, you'd effectively be implementing a cross-entropy loss yourself; see the first point.

Keras LSTM trained with masking and custom loss function breaks after first iteration

I am attempting to train an LSTM that reads a variable length input sequence and has a custom loss function applied to it. In order to be able to train on batches, I pad my inputs to all be the maxmimum length.
My input data is a float tensor of shape (7789, 491, 11) where the form is (num_samples, max_sequence_length, dimension).
Any sample that is shorter than the maximum length I pad with -float('inf'), so a sequence with 10 values would start with 481 sets of 11 '-inf' values followed by the real values at the end.
The way I am attempting to evaluate this model doesn't fit into any standard loss functions, so I had to make my own. I've tested it and it performs as expected on sample tensors. I don't believe this is the source of the issue so I won't go into details, but I could be wrong.
The problem I'm having comes from the model itself. Here is how I define and train it:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Masking(mask_value=-float('inf'),
input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(30),
kernel_initializer=tf.keras.initializers.zeros())
model.add(tf.keras.layers.Reshape((3, 10)))
model.compile(loss=batched_custom_loss, optimizer='rmsprop', run_eagerly=True)
model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)
No errors are thrown when I try to fit the model, but it only works on the first batch of training. As soon as the second batch starts, the loss becomes 'nan'. Upon closer inspection, it seems like the LSTM layer is outputting 'nan' after the first epoch of training.
My two guesses for what is going on are:
I set up the masking layer wrong, and it for some reason fails to mask out all of the -inf values after the first training iteration. Thus, -inf gets passed through the LSTM and it goes haywire.
I did something wrong with the format of my loss function, and the when the optimizer applies my calculated loss to the model it ruins the weights of the LSTM. For reference, my loss function outputs a 1D tensor with length equal to the number of samples in the batch. Each item in the output is a float with the loss of that sample.
I know that the math in my loss function is good since I've tested it on sample data, but maybe the output format is wrong even though it seems to match what I've found online.
Let me know if the problem is obvious from what I've shown or if you need more information.

Searching for an LSTM architecture to be used for regression

The lack of good intuition on LSTMs and how they work paired with
an awkward dataset and regression problem leaves me with questions on how to approach and solve my scenario.
I don't want any in depth answers, I seek just for intuition and suggestions.
My dataset consists of:
X flights, each flight has Y timesteps where each timestep has Z features. Every flight is characterized by K which is a 2 value vector (K_1, K_2) and that's the target of regression, predicting these 2 variables.
I've tried several different regression methods and they happen to perform really well. Because I have time dependent trajectories, the other methods calculated stats across each trajectory for the Z features and transformed each whole trajectory to just [Z*l,] - [K_1, K_2] supervised data, where l is just a factor that implies that we have new calculated features. (for example one of these features could be the mean of a feature across all trajectory).
Problem:
I want to implement an LSTM regression pipeline that takes as input the raw dataset (X,Y,Z) and after a dense layer outputs 2 values, K_1 and K_2 (with 1 model or 2 seperate models for each value) and backpropagate correctly the loss of K_1_target, K_2_target.
I've tried it and it seems that it performs really poorly and I don't know if it's a technical mistake or a theoritical mistake.
Below I provide the architecture I use at the moment.
samples, timesteps, features = x_train.shape[0], x_train.shape[1], x_train.shape[2]
model1 = Sequential()
model1.add(Masking(mask_value=-10.0))
model1.add(LSTM(hidden_units, return_sequences = True))
model1.add(Flatten())
model1.add(Dense(hidden_units, activation = "relu"))
model1.add(Dense(1, activation = "linear"))
model1.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model1.fit(x_train, y_train[:,0], validation_data=(x_test, y_test[:,0]), epochs=epochs, batch_size=batch, shuffle=False)
model2 = Sequential()
model2.add(Masking(mask_value=-10.0))
model2.add(LSTM(hidden_units, return_sequences=True))
model2.add(Flatten())
model2.add(Dense(hidden_units, activation = "relu"))
model2.add(Dense(1, activation = "linear"))
model2.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model2.fit(x_train, y_train[:,1], validation_data=(x_test, y_test[:,1]), epochs=epochs, batch_size=batch, shuffle=False)
In my mind it makes sense but it seems that it is not working well..
I'm not entirely sure how the trainable params of LSTM are learning on backpropagation using a dense layer after and backwarding the loss of 2 values which are the same again and again for the same trajectory.
Any kind of clarification, correction or intuition will help me a lot!
Lastly, I'll provide some real details.
K_1 takes discrete values from 0 to 100
K_2 takes 1 floating point precision values from 0 to 1
x_input is a subset of (6991, 527, 6) using k-fold CV, using k = 10, so about (6292, 527, 6) and y_input is of shape (6292, 2). Accordingly for testing.
I,ve used pre padding for even length of trajectories and a masking layer that ignores rows with no data.
I've normalized all my features and target values indepedently with MinMax normalization, and inversed transformed model's output and y_test for correct loss calculation.
The best result I've got till now is a MAE loss is:
K_1 (whose range is 0 - 100) = ~6.0 (Even lasso performs better, while in a non linear problem)
K_2 (whose range is 0 - 1) = ~0.003 (Pretty good)

Keras neural network predicting the same output

I need to develop a neural network with Keras to predict a disease using genetic data. It is known, that predicting this disease is possible even with logistic regression (however the predictions, in this case, are of very poor quality). It's worth mentioning that my data is imbalanced, so I introduced class weights later.
I decided to start with the simplest way to predict it - with a network, analogous to a logistic regression - one hidden layer with one neuron and achieved a bad, yet at least some result - 0.12-0.14 F1 score. Then I tried with 2 hidden and 1 output layers with different amount of neurons in the first hidden layer - from 1 to 8.
It turns out that in some cases it learns something, and in some is predicting the same output for every sample. I displayed the accuracy and loss function over the epochs and this is what I get:
Network loss function by epoch. It's clear that the loss function has roughly the same value, for the training data.
Network accuracy by epoch. It's clear that the accuracy is not improving, but fluctuates from 0 to 1
I searched for similar questions and the suggestions were the following:
Make more neurons - I just have to make it work with 1, 2 or more neurons in the first layer, so I can't add neurons to this one. I increased the amount of neurons in the second hidden layer up to 20, but it then stopped predicting anything with any number oh neurons in the first layer configuration.
Make more layers - I tried adding one more layer, but still have the same problem
To introduce dropout and increase it - what dropout are we talking about if it can learn with just one layer and one neuron in it
Reduce learning rate - decreased it from the default 10^(-3) to 10^(-4)
Reduce batch size - varied it from 500 samples in a minibatch to 1 (stochastic gradient descent)
More epochs - isn't 20 to 50 epochs on a 500'000 sample dataset enough?
Here's the model:
def run_nn_class_weights(data, labels, model):
n_iter = 20
predicted = None
true = None
print('Splitting the data')
x_train, x_valid, y_train, y_valid = train_test_split(data, labels, test_size = 0.05)
#model = create_model()
early_stopping_monitor=EarlyStopping(patience=240)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(labels),
labels)
class_weights = dict(enumerate(class_weights))
hist = model.fit(x_train, y_train, validation_data=[x_valid, y_valid], class_weight=class_weights,
epochs=n_iter, batch_size=500, shuffle=True, callbacks=[early_stopping_monitor],verbose=1)
proba = model.predict(data)
predicted = proba.flatten()
true = labels
return(model, proba, hist)
def old_model_n_pred(n_neurons_1st = 1):
model = Sequential()
model.add(Dense(n_neurons_1st, activation='relu', input_shape=(7516,), kernel_initializer='glorot_normal'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Flatten())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

This is a small network that should be able to converge to something that's not an atractor (getting stuck on a single value).
I suggest taking a look at the weights of all the neurons with ReLu activation.
ReLus are great because get quick calculations; but half of the relu has derivate of zero, which doesn't help with gradient descent. This might be your case.
In guess in yout case the enemy would be the first neuron.
In order to overcome this problem, I would try to do regularize inputs (to have all samples centered around 0.5 and scaled by the standard deviation). If you do this to a ReLU, you'll make it ignore anything under between [-inf, sd].
if that does not fix part of the problem, swich to a different activation function in the first layer. A sigmoid will work very good and it's not too expensive for just one neuron.
Also, take a close look at your input distribution. What your network actually does is doing a sigmoid-like classification, then using between 4 to 8 neurons to "zoom"/correct on the important parts of the function that the first transformation didn't account for.

Loss function for class imbalanced multi-class classifier in Keras

I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.

First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.