Explosion in loss function, LSTM autoencoder - python

I am training a LSTM autoencoder, but the loss function randomly shoots up as in the picture below:
I tried multiple to things to prevent this, adjusting the batch size, adjusting the number of neurons in my layers, but nothing seems to help. I checked my input data to see if it contains null / infinity values, but it doesn't, it is normalized also. Here is my code for reference:
model = Sequential()
model.add(Masking(mask_value=0, input_shape=(430, 3)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, activation='relu'))
model.add(RepeatVector(430))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(3)))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
context_paths = loadFile()
X_train, X_test = train_test_split(context_paths, test_size=0.20)
history = model.fit(X_train, X_train, epochs=1, batch_size=4, verbose=1, validation_data=(X_test, X_test))
The loss function explodes at random points in time, sometimes sooner, sometimes later. I read this thread about possible problems, but at this point after trying multiple things I am not sure what to do to prevent the loss function from skyrocketing at random. Any advice is appreciated. Other than this I can see that my accuracy is not increasing very much, so the problems may be interconnected.

Two main points:
1st point As highlighted by Daniel Möller:
Don't use 'relu' for LSTM, leave the standard activation which is 'tanh'.
2nd point: One way to fix the exploding gradient is to use clipnorm or clipvalue for the optimizer
Try something like this for the last two lines
For clipnorm:
opt = tf.keras.optimizers.Adam(clipnorm=1.0)
For clipvalue:
opt = tf.keras.optimizers.Adam(clipvalue=0.5)
See this post for help (previous version of TF):
How to apply gradient clipping in TensorFlow?
And this post for general explanation:
https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/

Two main issues:
Don't use 'relu' for LSTM, leave the standard activation which is 'tanh'. Because LSTM's are "recurrent", it's very easy for them to accumulate growing or decreasing of values to a point of making the numbers useless.
Check the range of your data X_train and X_test. Make sure they're not too big. Something between -4 and +4 is sort of good. You should consider normalizing your data if it's not normalized yet.
Notice that "accuracy" doesn't make any sense for problems that are not classificatino. (I notice your final activation is "linear", so you're not doing classification, right?)
Finally, if the two hints above don't work. Check whether you have an example that is all zeros, this might be creating a "full mask" sequence, and this "might" (I don't know) cause a bug.
(X_train == 0).all(axis=[1,2]).any() #should be false

Related

What Activation Function is appropriate for input range (0,1) and output range (-∞,∞) for Regression Nerwork in Keras

input images are regularized to (0, 1)
and output is float32 values having pseudo gaussian distribution (-∞,∞)
when fitted, both train and validation accuracy says over 0.999
but when predict using train and validation set, it does not reproduce itself.
predicted output shows only negative values( and few positive identical values )
is this problem caused by wrong selection of activation function?
i have tried, instead of 'relu', 'linear', 'sigmoid' too.
the results was same.
model = Sequential()
model.add(Convolution1D(filters=64, kernel_size=2, input_shape=(img_width, img_height)))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size=(2)))
model.add(Convolution1D(filters=32, kernel_size=2))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size=(2)))
model.add(Flatten())
model.add(Dense(256))
model.add(Activation("relu"))
model.add(Dense(1, activation='linear'))
model.compile(loss='mse', optimizer=optimizers.RMSprop(lr=0.0001), metrics=['accuracy'])
Prediction done like this,
model.fit(x_train, y_train, epochs=2,
validation_data=(x_valid, y_valid),
batch_size=2048,
shuffle='batch',
use_multiprocessing=True)
# right after fitting
result = model.predict(x_train, use_multiprocessing=True)
First of all, it's extremely hard to design a model to output in such a big range, the error rate of the model will be extremely high.
I suggest you normalize your outputs in range (0., 1.) and use sigmoid in the last layer.
You can always use an inverse transform to reconstruct the original outputs.
mn = np.min(y_train)
mx = np.max(y_train)
y_train = (y_train - mn)/(mx - mn)
# ... train
# inverse transform
y_train_original = y_train*(mx-mn) + mn
when fitted, both train and validation accuracy says over 0.999 but when
predict using train and validation set, it does not reproduce itself.
reason: overfitting. your data is impossible to learn with such complex output distribution, so the model just blindly memorizes the training data without learning any patterns.
to avoid :
use output normalizing.
model.add(Dense(256)) - reduce number of neurons here, try with 32->64->128
use dropout
Convolution1D are not the standard choice to deal with images, I suggest you Convolution2D
Secondly, 'accuracy' is not the correct metric for regression task, good choice are mean squared error (mse), mean absolute error (mae), root mean squared error (rmse)
when fitted, both train and validation accuracy says over 0.999 but
when predict using train and validation set, it does not reproduce
itself.
This suggests something is going wrong with your prediction code, which you have not included. Either something wrong with your testing data or the way you are predicting (not loading weights?)

Determine number of Nodes and Layers based on shape of the data

Is there a way to determine number of nodes and hidden layers based on shape of the data?
Also, is there a way to determine the best activation function based on the topic?
For example, Im making model for fake news prediction. My features are number of words in text, number of words in title, number of questions, number of capital letters etc.
My dataset has 22 features and around 35000 rows. My output should be 0 or 1.
Based on that, how many layers and nodes should I use and what activation functions are the best for this?
This is my net:
model = Sequential()
model.add(Dense(100, input_dim = features.shape[1], activation = 'relu')) # input layer requires input_dim param
model.add(Dense(100, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid')) # sigmoid instead of relu for final probability between 0 and 1
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss="mean_squared_error", optimizer=sgd, metrics=['accuracy'])
# call the function to fit to the data training the network)
model.fit(x_train, y_train, epochs = 10, shuffle = True, batch_size=32, validation_data=(x_test, y_test), verbose=1)
scores = model.evaluate(features, results)
print(model.metrics_names[1], scores[1]*100)
Selecting those requires prior experience, otherwise we won't need that much ML Engineers trying different architectures and writing papers.
But for a start I would recommend you take a look at autokeras, It will help with your problem as it's kind of a known problem -Text Classification-, you only need to structure your data as input(X and Y) and then feed that to their Text Classifier which will try different models(You could specify that) to choose the best fitting for your case.
You could find more examples in the docs here
https://autokeras.com/tutorial/text_classification/
import autokeras as ak
# Initialize the text classifier.
clf = ak.TextClassifier(max_trials=10) # It tries 10 different models
# Feed the text classifier with training data.
clf.fit(x_train, y_train)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))
Answer is no and no.
Well these are also hyperparameters. You can select a bunch of them and try all of them to get a rough idea of which is giving you the best result. Yes the same statement holds for activation function as well.
You can use more layers than you need and then use regularization to stop producing an overfitted model. Also if it is too less you can clearly understand the underfitting behavior from the loss curve giving high training error.
There is no formula for determining all these. You have to try different things based on the problem at hand and you will see some of it would work better than the others.
For output softmax layer would be good as this will give you a probability of predictions which you can easily convert to one-hot encoding.

Experiment shows that LSTM does worse than Random Forest... Why?

LSTM is supposed to be the right tool to capture path-dependency in time-series data.
I decided to run a simple experiment (simulation) to assess the extent to which LSTM is better able to understand path-dependency.
The setting is very simple. I just simulate a bunch (N=100) of paths coming from 4 different data generating processes. Two of these processes represent a real increase and a real decrease, while the other two fake trends that eventually revert to zero.
The following plot shows the simulated paths for each category:
The candidate machine learning algorithm will be given the first 8 values of the path ( t in [1,8] ) and will be trained to predict the subsequent movement over the last 2 steps.
In other words:
the feature vector is X = (p1, p2, p3, p4, p5, p6, p7, p8)
the target is y = p10 - p8
I compared LSTM with a simple Random Forest model with 20 estimators. Here are the definitions and the training of the two models, using Keras and scikit-learn:
# LSTM
model = Sequential()
model.add(LSTM((1), batch_input_shape=(None, H, 1), return_sequences=True))
model.add(LSTM((1), return_sequences=False))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_X_LS, train_y_LS, epochs=100, validation_data=(vali_X_LS, vali_y_LS), verbose=0)
# Random Forest
RF = RandomForestRegressor(random_state=0, n_estimators=20)
RF.fit(train_X_RF, train_y_RF);
The out-of-sample results are the summarized by the following scatter plots:
As you can see, the Random Forest model is clearly outperforming the LSTM. The latter seems to be not able to distinguish between the real and the fake trends.
Do you have any idea to explain why this is happening?
How would you modify the LSTM model to make it better at this problem?
Some remarks:
The data points are divided by 100 to make sure gradients do not explode
I tried to increase the sample size, but I noticed no differences
I tried to increase the number of epochs over which the LSTM is trained, but I noticed no differences (the loss becomes stagnant after a bunch of epochs)
You can find the code I used to run the experiment here
Update:
Thanks to SaTa's reply, I changed the model and obtained much better results:
# Updated LSTM Model
model = Sequential()
model.add(LSTM((8), batch_input_shape=(None, H, 1), return_sequences=False))
model.add(Dense(4))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
Still, the Random Forest model does better. The point is that RF seems to understand that, conditional on the class, a higher p8 predicts a lower outcome p10-p8 and viceversa because of the way the noise is added. LSTM seems to fail on that, so it predicts the class rather well, but we see that within-class downward-sloping pattern in the final scatter plot.
Any suggestion to improve on that?
I won't expect LSTM to win at all the battles against traditional methods, but I do expect it to perform well for the problem you have posed. Here are couple things you can try:
1) Increase the number of hidden units in the first layer.
model.add(LSTM((32), batch_input_shape=(None, H, 1), return_sequences=True))
2) The output of an LSTM layer is tanh by default which limits the output to (-1, 1) as you can see in the right plot. I recommend either adding a Dense layer or using LSTM with linear activation on the output. Like this:
model.add(LSTM((1), return_sequences=False, activation='linear'))
Or
model.add(LSTM((16), return_sequences=False))
model.add(Dense(1))
Try the above with 10K samples that you have.

Python - Keras Model doesnt converge

I have a network with 32 input nodes, 20 hidden nodes and 65 output nodes. My network input actually is a hash code of length 32 and the output is the word.
The input is the ascii value of each character of the Hash code. The output of the network is a binary representation I have made. Say for example a is equal to 00000 and b is equal to 00001 and so on and so forth. It only includes the alphabet and the space that why it's only 5 bits per character. I have a maximum limit of only 13 characters in my training input, so my output nodes is 13 * 5 = 65. And Im expecting a binary output like 10101010101010101010101010101010101010101010101010101010101001011 . The bit sequence can predict at most 16 characters word given a hash code of 32 length as an input. Below is my current code:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform((train_samples).reshape(-1, 32))
train_labels = train_labels.reshape(-1, 65)
model = Sequential([
Dense(32, input_shape=(32,), activation = 'sigmoid'),
BatchNormalization(),
Dense(25, activation='tanh'),
BatchNormalization(),
Dense(65, input_shape=(65,), activation='sigmoid')
])
overfitCallback = EarlyStopping(monitor='loss', min_delta=0, patience = 1000)
model.summary()
model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_samples, train_labels, batch_size=1000, epochs=1000000, callbacks=[overfitCallback], shuffle = True, verbose=2)
I plan to overfit the model, so that it can memorize all the hash codes of the words in the dictionary. As an initial, my training samples is only 5,000 something. I just wanted to see if it will learn from a small dataset. How will I make network converge faster? I think its running more than one hour, and its loss function is still .5004 something and the accuracy is .7301. It gets up and down but when I check every 10 minutes or so, I can see only alittle improvement. How will I fine tune it?
UPDATE :
The training had already stopped but it didn't converge. It's loss is .4614 and accuracy is .7422
There are some hyper parameters that i would suggest to change first.
Try 'relu' or LeakyReLU() as the activation function for the non-output layers. Basically relu is the standard activation function for baseline models.
The standard optimizer (for most cases) currently is Adam, try using this. Tweak its learning rate when needed. You could get better results with sgd, but it often takes a lot of epochs and a lot of hyper parameter tuning. Adam is basically the quickest (in general) optimizer to reach a 'low' loss.
To prevent overfitting you might also want to implement Dropout(0.5), where the 0.5 is as an example.
Once you have reached the lowest loss, you might start changing these hyper parameters even more, to try and egt a lower loss.
Apart from this, the first thing i actually suggest is trying and add multiple hidden layers with different sizes. This might have a way larger impact then trying to optimize all the hyper parameters.
Edit: Maybe you could post a screenshot of your training loss vs epochs for the train & val data? This might make things more clear for others.

How can I intentionally overfit a convolutional neural net in Keras to make sure the model is working?

I'm trying to diagnose what's causing low accuracies when training my model. At this point, I just want to be able to get to high training accuracies (I can worry about testing accuracy/overfitting problems later). How can I adjust the model to overindex on training accuracy? I want to do this to make sure I didn't make any mistakes in a preprocessing step (shuffling, splitting, normalizing, etc.).
#PARAMS
dropout_prob = 0.2
activation_function = 'relu'
loss_function = 'categorical_crossentropy'
verbose_level = 1
convolutional_batches = 32
convolutional_epochs = 5
inp_shape = X_train.shape[1:]
num_classes = 3
def train_convolutional_neural():
y_train_cat = np_utils.to_categorical(y_train, 3)
y_test_cat = np_utils.to_categorical(y_test, 3)
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=(3, 3), input_shape=inp_shape))
model.add(Conv2D(filters=32, kernel_size=(3, 3)))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Dropout(rate=dropout_prob))
model.add(Flatten())
model.add(Dense(64,activation=activation_function))
model.add(Dense(num_classes,activation='softmax'))
model.summary()
model.compile(loss=loss_function, optimizer="adam", metrics=['accuracy'])
history = model.fit(X_train, y_train_cat, batch_size=convolutional_batches, epochs = convolutional_epochs, verbose = verbose_level, validation_data=(X_test, y_test_cat))
model.save('./models/convolutional_model.h5')
You need to remove the Dropout layer. Here is a small checklist for intentional overfitting:
Remove any regularizations (Dropout, L1 and L2 regularization)
Make sure to set slower learning rate (Adam is adaptive, so in your case it is fine)
You may want to not shuffle the training samples (e.g. all the first 100 samples are class A, the next 100 are class B, the last 100 are class C). Update: as pointed out by petezurich in the answer below, this should be considered with care as it could lead to no training effect at all.
Now, if you model overfit easily, then it is a good sign of a strong model, capable of representing the data. Otherwise, you may consider a deeper/wider model, or you should take a good look at the data and ask the question: "Are there really any pattenrs? Is this trainable?".
In addition to the other valid answers – one very simple way to overfit is to use only a small subset of your data. E.g. only 1 or 2 samples.
See also this extremely helpful post regarding everything that you can check to make sure your model is working: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Categories