Keras network producing inverse predictions

Keras network producing inverse predictions - python

I have a timeseries dataset and I am trying to train a network so that it overfits (obviously, that's just the first step, I will then battle the overfitting).
The network has two layers:
LSTM (32 neurons) and Dense (1 neuron, no activation)
Training/model has these parameters:
epochs: 20, steps_per_epoch: 100, loss: "mse", optimizer: "rmsprop".
TimeseriesGenerator produces the input series with: length: 1, sampling_rate: 1, batch_size: 1.
I would expect the network would just memorize such a small dataset (I have tried even much more complicated network to no avail) and the loss on training dataset would be pretty much zero. It is not and when I visualize the results on the training set like this:
y_pred = model.predict_generator(gen)
plot_points = 40
epochs = range(1, plot_points + 1)
pred_points = numpy.resize(y_pred[:plot_points], (plot_points,))
target_points = gen.targets[:plot_points]
plt.plot(epochs, pred_points, 'b', label='Predictions')
plt.plot(epochs, target_points, 'r', label='Targets')
plt.legend()
plt.show()
I get:
The predictions have somewhat smaller amplitude but are precisely inverse to the targets. Btw. this is not memorized, they are inversed even for the test dataset which the algorithm hasn't trained on at all.It appears that instead of memorizing the dataset, my network just learned to negate the input value and slightly scale it down. Any idea why this is happening? It doesn't seem like the solution the optimizer should have converged to (loss is pretty big).
EDIT (some relevant parts of my code):
train_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x,
y,
length=1,
sampling_rate=1,
batch_size=1,
shuffle=False
)
model = Sequential()
model.add(LSTM(32, input_shape=(1, 1), return_sequences=False))
model.add(Dense(1, input_shape=(1, 1)))
model.compile(
loss="mse",
optimizer="rmsprop",
metrics=[keras.metrics.mean_squared_error]
)
history = model.fit_generator(
train_gen,
epochs=20,
steps_per_epoch=100
)
EDIT (different, randomly generated dataset):
I had to increase number of LSTM neurons to 256, with the previous setting (32 neurons), the blue line was pretty much flat. However, with the increase the same pattern arises - inverse predictions with somewhat smaller amplitude.
EDIT (targets shifted by +1):
Shifting the targets by one compared to predictions doesn't produce much better fit. Notice the highlighted parts where the graph isn't just alternating, it's more apparent there.
EDIT (increased length to 2 ... TimeseriesGenerator(length=2, ...)):
With length=2 the predictions stop tracking the targets so closely but the overall pattern of inversion still stands.

You say that your network "just learned to negate the input value and slightly scale it down". I don't think so. It is very likely that all you are seeing is the network performing poorly, and just predicting the previous value (but scaled as you say). This issue is something I've seen again and again. Here is another example, and another, of this issue. Also, remember it is very easy to fool yourself by shifting the data by one. It is very likely you are simply shifting the poor prediction back in time and getting an overlap.

EDIT: After author's comments I do not believe this is the correct answer but I will keep it posted for posterity.
Great question and the answer is due to how the Time_generator works! Apparently instead of grabbing x,y pairs with the same index (e.g input x[0] to output target y[0]) it grabs target with offset 1 (so x[0] to y[1]).
Thus plotting y with offset 1 will produce the desired fit.
Code to simulate:
import keras
import matplotlib.pyplot as plt
x=np.random.uniform(0,10,size=41).reshape(-1,1)
x[::2]*=-1
y=x[1:]
x=x[:-1]
train_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x,
y,
length=1,
sampling_rate=1,
batch_size=1,
shuffle=False
)
model = keras.models.Sequential()
model.add(keras.layers.LSTM(100, input_shape=(1, 1), return_sequences=False))
model.add(keras.layers.Dense(1))
model.compile(
loss="mse",
optimizer="rmsprop",
metrics=[keras.metrics.mean_squared_error]
)
model.optimizer.lr/=.1
history = model.fit_generator(
train_gen,
epochs=20,
steps_per_epoch=100
)
Proper plotting:
y_pred = model.predict_generator(train_gen)
plot_points = 39
epochs = range(1, plot_points + 1)
pred_points = np.resize(y_pred[:plot_points], (plot_points,))
target_points = train_gen.targets[1:plot_points+1] #NOTICE DIFFERENT INDEXING HERE
plt.plot(epochs, pred_points, 'b', label='Predictions')
plt.plot(epochs, target_points, 'r', label='Targets')
plt.legend()
plt.show()
Output, Notice how the fit is no longer inverted and is mostly very accurate:
This is how it looks when the offset is incorrect:

Related

Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!

Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow

Searching for an LSTM architecture to be used for regression

The lack of good intuition on LSTMs and how they work paired with
an awkward dataset and regression problem leaves me with questions on how to approach and solve my scenario.
I don't want any in depth answers, I seek just for intuition and suggestions.
My dataset consists of:
X flights, each flight has Y timesteps where each timestep has Z features. Every flight is characterized by K which is a 2 value vector (K_1, K_2) and that's the target of regression, predicting these 2 variables.
I've tried several different regression methods and they happen to perform really well. Because I have time dependent trajectories, the other methods calculated stats across each trajectory for the Z features and transformed each whole trajectory to just [Z*l,] - [K_1, K_2] supervised data, where l is just a factor that implies that we have new calculated features. (for example one of these features could be the mean of a feature across all trajectory).
Problem:
I want to implement an LSTM regression pipeline that takes as input the raw dataset (X,Y,Z) and after a dense layer outputs 2 values, K_1 and K_2 (with 1 model or 2 seperate models for each value) and backpropagate correctly the loss of K_1_target, K_2_target.
I've tried it and it seems that it performs really poorly and I don't know if it's a technical mistake or a theoritical mistake.
Below I provide the architecture I use at the moment.
samples, timesteps, features = x_train.shape[0], x_train.shape[1], x_train.shape[2]
model1 = Sequential()
model1.add(Masking(mask_value=-10.0))
model1.add(LSTM(hidden_units, return_sequences = True))
model1.add(Flatten())
model1.add(Dense(hidden_units, activation = "relu"))
model1.add(Dense(1, activation = "linear"))
model1.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model1.fit(x_train, y_train[:,0], validation_data=(x_test, y_test[:,0]), epochs=epochs, batch_size=batch, shuffle=False)
model2 = Sequential()
model2.add(Masking(mask_value=-10.0))
model2.add(LSTM(hidden_units, return_sequences=True))
model2.add(Flatten())
model2.add(Dense(hidden_units, activation = "relu"))
model2.add(Dense(1, activation = "linear"))
model2.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model2.fit(x_train, y_train[:,1], validation_data=(x_test, y_test[:,1]), epochs=epochs, batch_size=batch, shuffle=False)
In my mind it makes sense but it seems that it is not working well..
I'm not entirely sure how the trainable params of LSTM are learning on backpropagation using a dense layer after and backwarding the loss of 2 values which are the same again and again for the same trajectory.
Any kind of clarification, correction or intuition will help me a lot!
Lastly, I'll provide some real details.
K_1 takes discrete values from 0 to 100
K_2 takes 1 floating point precision values from 0 to 1
x_input is a subset of (6991, 527, 6) using k-fold CV, using k = 10, so about (6292, 527, 6) and y_input is of shape (6292, 2). Accordingly for testing.
I,ve used pre padding for even length of trajectories and a masking layer that ignores rows with no data.
I've normalized all my features and target values indepedently with MinMax normalization, and inversed transformed model's output and y_test for correct loss calculation.
The best result I've got till now is a MAE loss is:
K_1 (whose range is 0 - 100) = ~6.0 (Even lasso performs better, while in a non linear problem)
K_2 (whose range is 0 - 1) = ~0.003 (Pretty good)

Regression with LSTM - python and Keras

I am trying to use a LSTM network in Keras to make predictions of timeseries data one step into the future. The data I have is of 5 dimensions, and I am trying to use the previous 3 periods of readings to predict the a future value in the next period. I have normalised the data and removed all NaN etc, and this is the code I am trying to use to train the network:
def Network_ii(IN, OUT, TIME_PERIOD, EPOCHS, BATCH_SIZE, LTSM_SHAPE):
length = len(OUT)
train_x = IN[:int(0.9 * length)]
validation_x = IN[int(0.9 * length):]
train_y = OUT[:int(0.9 * length)]
validation_y = OUT[int(0.9 * length):]
# Define Network & callback:
train_x = train_x.reshape(train_x.shape[0],3, 5)
validation_x = validation_x.reshape(validation_x.shape[0],3, 5)
model = Sequential()
model.add(LSTM(units=128, return_sequences= True, input_shape=(train_x.shape[1],3)))
model.add(LSTM(units=128))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')
train_y = np.asarray(train_y)
validation_y = np.asarray(validation_y)
history = model.fit(train_x, train_y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(validation_x, validation_y))
# Score model
score = model.evaluate(validation_x, validation_y, verbose=0)
print('Test loss:', score)
# Save model
model.save(f"models/new_model")
I am attempting to roughly follow the steps outlined here- https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
However, no matter what adjustments I have made in terms of changing the number of dimensions used to train the network or the length of the time period I cannot get the output of the model to give predictions that are not either 1 or 0. This is even though the target data, in the array 'OUT' is made up of data continuous on [0,1].
I think there may be something wrong with how I am setting up the .Sequential() function, but I cannot see what to adjust. I am relatively new to this so any help would be greatly appreciated.

You are probably using a prediction function that is not the standard. Maybe you are using predict_classes?
The one that is well documented and the standard is model.predict.

Creating a neural network in keras to multiply two input integers

I am playing around with Keras v2.0.8 in Python v2.7 (Tensorflow backend) to create small neural networks that calculate simple arithmetic functions (add, subtract, multiply, etc.), and am a bit confused. The below code is my network which generates a random training dataset of integers with the corresponding labels (the two inputs added together):
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] + b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
X, y = create_data(0, 500, 10000)
model = Sequential()
model.add(Dense(3, input_dim=2))
model.add(Dense(5, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=10)
test_data, _ = create_data(0, 500, 10)
results = model.predict(test_data, batch_size=2)
sq_error = []
for i in range(0, len(test_data)):
print 'test value:', test_data[i], 'result:', results[i][0], 'error:',\
'%.2f' %(results[i][0] - (test_data[i][0] + test_data[i][1]))
sq_error.append((results[i][0] - (test_data[i][0] + test_data[i][1])))
print '\n total rmse error: ', sqrt(np.sum(np.array(sq_error)))
This trains perfectly well and produces no unexpected results. However, when I create the training data by multiplying the two inputs together the model's loss for each epoch stays around 7,000,000,000 and the model does not converge at all. The data creation function for this is as follows:
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] * b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
I also had the same problem when I had training data of a single input integer and created the label by squaring the input data. However, it worked fine when I only multiplied the single input by a constant value or added/subtracted by a constant.
I have two questions:
1) Why is this the case? I assume it has something to do with the fundamentals of neural networks, but I can't work it out.
2) How could I adapt this code to train a model that multiplies two input numbers together.
The network architecture (2 - 3 - 5 - 3 - 5 - 1) is fairly random right now. I've tried lots of different ones varying in layers and neurons, this one just happened to be on my screen as I write this and got an accuracy of 100% for adding two inputs.

It is due to large gradient updates caused by large numbers in training data. When using a neural network, you should first ensure that the training data falls in a small range (usually [-1,1] or [0,1]) to help the optimization process and prevent disruptive gradient updates. Therefore, you should first normalize data. In this case, one good candidate would be log-normalization.
Further, the 'accuracy' as a metric in Keras is used in case of a classification problem. In a regression problem, using it does not make sense, and instead it's better to use a relevant metric like "mean absolute error" or 'mae'.

For Keras LSTM, what is the difference in passing in lag features vs timesteps of features?

I'm getting acquainted with LSTMs and I need clarity on something. I'm modeling a time series using t-300:t-1 to predict t:t+60. My first approach was to set up an LSTM like this:
# fake dataset to put words into code:
X = [[1,2...299,300],[2,3,...300,301],...]
y = [[301,302...359,360],[302,303...360,361],...]
# LSTM requires (num_samples, timesteps, num_features)
X = X.reshape(X.shape[0],1,X.shape[1])
model = Sequential()
model.add(LSTM(n_neurons[0], batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(y.shape[1]))
model.compile(loss='mse', optimizer='adam')
model.fit(X, y, epochs=1, batch_size=1, verbose=1, shuffle=False)
With my real dataset, the results have been suboptimal, and on CPU it was able to train 1 epoch of around 400,000 samples in 20 minutes. The network converged quickly after a single epoch, and for any set of points I fed it, the same results would come out.
My latest change has been to reshape X in the following way:
X = X.reshape(X.shape[0],X.shape[1],1)
Training seems to be going slower (I have not tried on the full dataset), but it is noticably slower. It takes about 5 minutes to train over a single epoch of 2,800 samples. I toyed around with a smaller subset of my real data and a smaller number of epochs and it seems to be promising. I am not getting the same output for different inputs.
Can anyone help me understand what is happening here?

In Keras, timesteps in (num_samples, timesteps, num_features) determine how many steps BPTT will propagate the error back.
This, in turn, takes more time to do hence the slow down that you are observing.
X.reshape(X.shape[0], X.shape[1], 1) is the right thing to do in your case, since what you have is a single feature, with 300 timesteps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.