Avoid overfitting in sequence to sequence problem using keras - python

I'm having a problem with a model I want to train.
It's a typical seq-to-seq problem with an attention layer, where the input is a string, and the output is a substring from the submitted string.
e.g.
Input Ground Truth
-----------------------------
helloimchuck chuck
johnismyname john
(This is just a dummy data, not a real part of the dataset ^^)
And the model looks like this:
model = Sequential()
model.add(Bidirectional(GRU(hidden_size, return_sequences=True), merge_mode='concat',
input_shape=(None, input_size))) # Encoder
model.add(Attention())
model.add(RepeatVector(max_out_seq_len))
model.add(GRU(hidden_size * 2, return_sequences=True)) # Decoder
model.add(TimeDistributed(Dense(units=output_size, activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])
The problem is this here:
As you can see, there is overfitting.
I'm using early stop criteria on the validation loss with patience=8.
self.Early_stop_criteria = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0,
patience=8, verbose=0,
mode='auto')
And I'm using one-hot-vector to fit the model.
BATCH_SIZE = 64
HIDDEN_DIM = 128
The thing is, I've tried with other batch sizes, other hidden dimensions, a dataset of 10K rows, 15K rows, 25K rows and now 50K rows. However, there is always overfitting, and I don't know why.
The test_size = 0.2 and the validation_split=0.2. Those are the only parameters I haven't changed.
I'm also made me sure that the dataset properly build.
The only idea that I have is trying with another validation split, maybe 0.33 instead of 0.2.
I don't know if cross-validation would help.
Maybe anyone has a better idea, what I could try. Thanks in advance.

As kvish proposed, dropout was a good solution.
I first tried with a dropout of 0.2.
model = Sequential()
model.add(Bidirectional(GRU(hidden_size, return_sequences=True, dropout=0.2), merge_mode='concat',
input_shape=(None, input_size))) # Encoder
model.add(Attention())
model.add(RepeatVector(max_out_seq_len))
model.add(GRU(hidden_size * 2, return_sequences=True)) # Decoder
model.add(TimeDistributed(Dense(units=output_size, activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])
And with 50K rows, it worked, but still had overfitting.
So, I tried with a dropout of 0.33, and it worked perfectly.

Related

Predicting high values with keras and regression

I am trying to build a regression model but the mse and mae are very high. I filter and normalize the data (both the input and output, and also the test and train set). I think the problem comes because I have very high values in one column: the minimum is 1 and the maximum is 9100000 (without normalizing), but I actually need to predict these high values.
The model looks like this: I have 6 input columns and 800000 rows. And I have tried with more neurons and layers, or changing the sigmoid function, but the loss and the error keep being around 0.8 for mse and 0.3 for mae. The predictions are also way lower than they should be, never achieving the high values.
model = Sequential()
model.add(Dense(7, input_dim=num_input, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='rmsprop', metrics=['mse', 'mae'])
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(x_val, y_val))
A few remarks/advices:
RMSProp is generally not used with fully connected layers, I recommend switching to Adam or SGD.
If you have a skewed distribution with many large values, you might consider using the log of these values instead.
First try with a shallow model with few neurons. Then gradually increase the number of neurons in order to overfit the dataset. You should be able to reach perfect score on the train set. At that point you can start decrease the number of neurons and add layers with dropout to improve generalisation.
As already mentioned in the comments, the output activation for regression should be "linear". Sigmoid is for binary classification.

Poor performance when using Batch Normalization with batch size

I noticed a strange effect, When training a Keras model, let's say this one
model9 = Sequential([
Dense(512, activation='tanh', input_shape = X_train[0].shape),
Dense(512//2, activation='tanh'),
tf.keras.layers.BatchNormalization(),
Dense(512//4, activation='tanh'),
Dense(512//8, activation='tanh'),
Dense(32, activation='relu'),
Dense(3, activation='softmax')
])
model9.compile(optimizer='sgd',loss='categorical_crossentropy', metrics=['acc', 'mse'])
hist9 = model9.fit(X_train, y_train, epochs=350, validation_data=(X_test,y_test), verbose=2)
It will perform okay on the validation set. i.e
loss9, acc9, mse9 = model9.evaluate(X_test, y_test)
print(f"Loss is {loss9},\nAccuracy is {acc9 * 100},\nMSE is {mse9}")
And accuracy is 92%.
But as soon as I add batch_size=128 or some other batch_size, my model performance becomes very poor.
i.e accracy drops to 70% or 60%. What is the reason? Thanks
The default setting is batch_size=32. So even if you don't set that parameter, you still have batches.
After every batch, your weights get updated. If you increase your batch_size 4-fold (128), but keep the same number of epochs, you are effectively quartering the number of updates. This can explain the worse performance.
Try increasing the batch-size only a little bit (or even lowering it) and see what happens. Alternatively, increase the number of epochs.

How to improve prediction with keras and tensorflow

I am using tensorflow with keras to perform regression on some historical data. Data type as follows:
id,timestamp,ratio
"santalucia","2018-07-04T16:55:59.020000",21.8
"santalucia","2018-07-04T16:50:58.043000",22.2
"santalucia","2018-07-04T16:45:56.912000",21.9
"santalucia","2018-07-04T16:40:56.572000",22.5
"santalucia","2018-07-04T16:35:56.133000",22.5
"santalucia","2018-07-04T16:30:55.767000",22.5
And I am reformulating it as a time series problem (25 time steps) so that I can predict (make a regression) for the next values of the series (variance should not be high). I am using also sklearn.preprocessing MinMaxScaler to scale the data to range (-1,1) or (0,1) depending if I use LSTM or Dense (respectively).
I am training with two different architectures:
Dense is as follows:
def get_model(self, layers, activation='relu'):
model = Sequential()
# Input arrays of shape (*, layers[1])
# Output = arrays of shape (*, layers[1] * 16)
model.add(Dense(units=int(64), input_shape=(layers[1],), activation=activation))
model.add(Dense(units=int(64), activation=activation))
# model.add(Dropout(0.2))
model.add(Dense(units=layers[3], activation='linear'))
# activation=activation))
# opt = optimizers.Adagrad(lr=self.learning_rate, epsilon=None, decay=self.decay_lr)
opt = optimizers.rmsprop(lr=0.001)
model.compile(optimizer=opt, loss=self.loss_fn, metrics=['mae'])
model.summary()
return model
Which more or less provides with good results (same architecture as in tensorflows' tutorial for predicting house prices).
However, LSTM is not giving good results, it usually ends up stuck around a value (for example, 40 (40.0123123, 40.123123,41.09090...) and I do not see why or how to improve it. Architecture is:
def get_model(self, layers, activation='tanh'):
model = Sequential()
# Shape = (Samples, Timesteps, Features)
model.add(LSTM(units=128, input_shape=(layers[1], layers[2]),
return_sequences=True, activation=activation))
model.add(LSTM(64, return_sequences=True, activation=activation))
model.add(LSTM(layers[2], return_sequences=False, activation=activation))
model.add(Dense(units=layers[3], activation='linear'))
# activation=activation))
opt = optimizers.Adagrad(lr=0.001, decay=self.decay_lr)
model.compile(optimizer=opt, loss='mean_squared_error', metrics=['accuracy'])
model.summary()
return model
I currently train with a batch size of 200 that increases by a rate of 1.5 every fit. Each fit is made of 50 epochs, and I use a keras earlystopping callback with at least 20 epoch.
I have tried adding more layers, more units, reducing layers, units, increasing and decreasing learning rate, etc, but every time it gets stuck around a value. Any reason for this?
Also, do you know any good practices that can be applied to this problem?
Cheers
Have you tried holding back a validation set seeing how well the model performance on the training set tracks with the validation set? This is often how I catch myself overfitting.
A simple function for doing this (adapted from here) can help you do that:
hist = model.fit_generator(...)
def gen_graph(history, title):
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title(title)
gen_graph(hist, "Accuracy, training vs. validation scores")
Also, do you have enough samples? If you're really, really sure that you have done as much as you can in terms of preprocessing, and in terms of hyperparameter tuning... generating some synthetic data or doing some data augmentation has occasionally helped me.

Different model performance using LSTM keras sample_weight, when padding before or after the actual data

I am a building a model in Keras with input data X of variable length (N_sample, 50, 128). Each sample has 50 time-steps and at each time step, I have 128 features. However, I have used zero-padding to generate the input X, because not all Samples have 50 time-steps.
There are two ways of padding zeros.
For each sample, I feed the true data, say (20,128) in the beginning and then the remaining (30,128), I pad zero.
I pad the first 30 rows with zero and add data to the last 20 rows.
I then use sample_weight to assign a zero weight to the padded time steps.
However, in these two settings, I get completely different AUC on the test set. What happens if zero padded samples are fed before or after the true data in an LSTM network with sample_weights? Is it due to the initialization of the hidden state in the LSTM?
How would I know, which is correct? Thank you.
My model is as below:
model = Sequential()
model.add(TimeDistributed(Dense(64, activation='sigmoid'), input_shape=(50, 128)))
model.add(LSTM(32, return_sequences=True))
model.add(TimeDistributed(Dense(8, activation='sigmoid')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='rmsprop',sample_weight_mode='temporal', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=2, sample_weight=Sample_weight_train)

Training Keras autoencoder without bottleneck does not return original data

I'm trying to make an autoencoder using Keras with a tensorflow backend. In particular, I have data of a vector of n_components (i.e. 200) sampled n_times (i.e. 20000). It is key that when I train time t, that I compare it only to time t. It appears that it is shuffling the sampling times. I removed the bottleneck and find that the network is doing a pretty bad job of predicting the n_components, instead representing something more like the mean of the input scaled by each component.
Here is my network with the bottleneck commented out:
model = keras.models.Sequential()
# Make a 7-layer autoencoder network
model.add(keras.layers.Dense(n_components, activation='relu', input_shape=(n_components,)))
model.add(keras.layers.Dense(n_components, activation='relu'))
# model.add(keras.layers.Dense(50, activation='relu'))
# model.add(keras.layers.Dense(3, activation='relu'))
# model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(n_components, activation='relu'))
model.add(keras.layers.Dense(n_components, activation='relu'))
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])
# act is a numpy matrix of size (n_components, n_times)
model.fit(act.T, act.T, epochs=15, batch_size=100, shuffle=False)
newact = model.predict(act.T).T
I have tested shuffling the second component of act, n_times, and passing it as model.fit(act.T, act_shuffled.T) and see no difference from model.fit(act.T, act.T). Am I doing something wrong? How can I force it to learn from the specific time?
Many thanks,
Arthur
I believe that I have solved the problem, but more knowledgeable users of Keras might be able to correct me. I had tried many different values for the argument batch_size of fit, but I didn't try a value of 1. When I changed it to 1, it did a good job of reproducing the input data.
I believe that the batch size, even if shuffle is set to False, allows the autoencoder to train one input time against an unrelated input time.
So, I have ammended my code to:
model.fit(act.T, act.T, epochs=15, batch_size=1, shuffle=False)

Categories