LSTM predicting constant value throughout

LSTM predicting constant value throughout - python

I understand that it is a long post, but help in any of the sections is appreciated.
I have some queries about the prediction method of my LSTM model. Here is a general summary of my approach:
I used a dataset having 50 time series for training. They start with a value of 1.09 all the way up to 0.82, with each time series having between 570 to 2000 datapoints (i.e, each time series has a different length, but similar trend).
I converted them to the dataset accepted by keras' LSTM/Bi-LSTM layers in the format:
[1, 0.99, 0.98, 0.97] ==Output==> [0.96]
[0.99, 0.98, 0.97, 0.96] ==Output==> [0.95]
and so on..
Shapes of the input and output containers (arrays): input(39832, 5, 1) and output(39832, )
Error-free training
Prediction on an initial points of data (window) having shape (1, 5, 1). This has been taken from the actual data.
The predicted output is one value, which is appended to a separate list (for plotting), as well as appended to the window, and the first value of the window dropped out. This window is then fed as input to the model to generate the next prediction point.
Continue this until I get the whole curve for both models (LSTM and Bi-LSTM)
However, the prediction is not even close to the actual data. It flatlines to a fixed value, whereas it should be somewhat like the black curve (which is the actual data)
Figure:https://i.stack.imgur.com/Ofw7m.png
Model (similar code goes for Bi-LSTM model):
model_lstm = Sequential()
model_lstm.add(LSTM(128, input_shape=(timesteps, 1), return_sequences= True))
model_lstm.add(Dropout(0.2))
model_lstm.add(LSTM(128, return_sequences= False))
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(1))
model_lstm.compile(loss = 'mean_squared_error', optimizer = optimizers.Adam(0.001))
Curve prediction initialize:
start = cell_to_test[0:timesteps].reshape(1, timesteps, 1)
y_curve_lstm = list(start.flatten())
y_window = start
Curve prediction:
while len(y_curve_lstm) <= len(cell_to_test):
yhat = model_lstm.predict(y_window)
yhat = float(yhat)
y_curve_lstm.append(yhat)
y_window = list(y_window.flatten())
y_window.append(yhat)
y_window.remove(y_window[0])
y_window = np.array(y_window).reshape(1, timesteps, 1)
#print(yhat)
Model summary:
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_5 (LSTM) (None, 5, 128) 66560
_________________________________________________________________
dropout_5 (Dropout) (None, 5, 128) 0
_________________________________________________________________
lstm_6 (LSTM) (None, 128) 131584
_________________________________________________________________
dropout_6 (Dropout) (None, 128) 0
_________________________________________________________________
dense_5 (Dense) (None, 1) 129
=================================================================
Total params: 198,273
Trainable params: 198,273
Non-trainable params: 0
_________________________________________________________________
And in addition to diagnosing the problem, I am really trying to find the answers to the following questions (I looked up other sources, but in vain):
Is my data enough to train the LSTM model? I have been told that it requires thousands of data points, so I feel that my current dataset more than suffices the condition.
Is my model less/more complex than it needs to be?
Does increasing the number of epochs, layers, and the neurons per layer always lead to a 'better' model, or are there optimal values for the same? If the latter, then is there a method to find this optimal point, or is hit-and-trail the only way?
I trained with the number of epochs=25, which gave me a loss of 1.25 * 10e-4. Should the loss be lower for the model to predict the trend? (I am focused on getting the shape first, accuracy later, because the training takes too long with higher epochs)
In continuation to the previous question, does loss have the same unit as the data? The reason why I am asking this is because the data has a resolution of up to 10e-7.
Once again, I understand that it has been a long post, but help in any of the sections is appreciated.

Related

Keras LSTM - Many to many with embedding layer

I am training a LSTM to predict event occurences. For each day I have a vector like [1,0,1] to denote that the first and third event occured, whereas the second one did not.
I want to extend this problem to work for multiple people, where each person has a distinct agent_id. This means that somehow I need to present my model with the agent_id as a feature. Although I'm not sure if this is the best way, I made the first entry of my vector the agent_id, so it looks like for example [123456, 1, 0 ,1].
Now what the LSTM model does is for each event output a probability of it occuring on the next day. So how I see the input/output would be: [agent_id, did event 1 occur today?, did event 2 occur today? did event 3 occur today?] -> LSTM -> [probability of event 1 occuring tomorrow, probability of event 2 occuring tomorrow, probability of event 3 occuring tomorrow]
Now the input has a longer length than the output. As far as I understood, from the answer to this post https://stats.stackexchange.com/questions/305863/how-to-train-lstm-model-on-multiple-time-series-data , I need to have an embedding layer that can change the size of my input so that the LSTM gives me the desired output.
For this, I tried to do the following:
from keras.models import Sequential
from keras.layers import *
xin = Input(batch_shape=(batch_size, window_length), dtype='int32')
xemb = Embedding(x_traindict[123456].shape[2], x_traindict[123456].shape[2]-1)(xin) #from what I give in to what I want to get out # 3dim (batch,time,feat)
seq = LSTM(x_traindict[123456].shape[2]-1, return_sequences=True)(xemb)
mlp = TimeDistributed(Dense(y_traindict[123456].shape[1], activation='softmax'))(seq)
model = tf.keras.Model(inputs=xin, outputs=mlp)
model.compile(optimizer='Adam', loss='categorical_crossentropy')
print(f"batch size is {batch_size}, window_length = {window_length}, x_train.shape is {x_traindict[123456].shape} and y_train.shape is {y_traindictalt[123456].shape}")
model.summary()
model.fit(x_traindict[123456], y_traindict[123456], epochs=20)
------------------------------------------------------------------------------------------------
batch size is 358, window_length = 7, x_train.shape is (358, 7, 149) and y_train.shape is (358, 148)
Model: "model_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_12 (InputLayer) [(358, 7)] 0
embedding_14 (Embedding) (358, 7, 148) 22052
lstm_16 (LSTM) (358, 7, 148) 175824
time_distributed_11 (TimeDi (358, 7, 149) 22201
stributed)
=================================================================
Total params: 220,077
Trainable params: 220,077
Non-trainable params: 0
_________________________________________________________________
My idea was that the Embedding would take the input from x_train, including the agent_id , and would learn to encode it to an input of the size of y_train, which does not include the agent_id. The LSTM will then learn to deal with what it receives from the embedding to correctly predict y_train. However, the code above gives me the following error:
ValueError: Exception encountered when calling layer "model_7" (type Functional).
Input 0 of layer "lstm_16" is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 7, 149, 148)
I do not understand this error.
What I would thus like to ask is:
Does my idea even make sense? By implementing the agent_id directly with the events, can the LSTM learn the time series prediction for different agents?
How can I fix the error in my code? If it helps, I basically filled in the template from the first answer to this post: https://github.com/keras-team/keras/issues/2654
EDIT:
I have tried changing xin to xin = Input(batch_shape=(window_length,), dtype='int32') but now get a
Value error in the line where I say seq = ...: Input 0 of layer "lstm_26" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (7, 133).
I also tried changing xin to xin = Input(batch_shape=(window_length,number_of_transactions+1), dtype='int32') but this produces ValueError: Input 0 of layer "model_11" is incompatible with the layer: expected shape=(None, 134), found shape=(None, 7, 134)
Note: I had to take a new sample today. The 134 replaces the 149 events from above

Yes, I think the idea is valid. By placing the agent-id as the first element in the sequence the RNN will encode this information in the state which is subsequently used to predict the probabilities for an event. One thing to watch out for is that the model will try to generate a prediction given just the first sequence element - the agent-id.
I think your issue is that you included the batch size in the input call, but batch size is implied and doesn't need to be defined, so
xin = Input(batch_shape=(batch_size, window_length), dtype='int32')
should become xin = Input(batch_shape=(window_length), dtype='int32')

ML wrong prediction on Japan Crossword puzzle

I’m trying to study machine learning in hands-on way. I found exercise for myself to create neural network that solves “Japan crosswords” for fixed size images (128*128).
Very simple example (4*4) demonstrates the conception: black & white picture encoded by top and left matrices. Number in matrix means continues length of black line. Easy to prove left and top matrix have dimension at max (N*(N/2)) and ((N/2)*N) correspondingly.
I have a python generator that creates random b&w images and 2 reduced matrices. Top and left matrices are fed as input (left is transposed to match top) and b&w as an expected output. Input is treated as 3-dim (128 * 64 * 2) where 2 – is top and left correspondingly.
Following is my current topology that try to build function (128 * 64 * 2) -> (128, 128, 1)
Model: "model"
Layer (type) Output Shape Param #
interlaced_reduce (InputLaye [(None, 128, 64, 2)] 0
small_conv (Conv2D) (None, 128, 64, 32) 288
leaky_re_lu (LeakyReLU) (None, 128, 64, 32) 0
medium_conv (Conv2D) (None, 128, 64, 64) 8256
leaky_re_lu_1 (LeakyReLU) (None, 128, 64, 64) 0
large_conv (Conv2D) (None, 128, 64, 128) 32896
leaky_re_lu_2 (LeakyReLU) (None, 128, 64, 128) 0
up_sampling2d (UpSampling2D) (None, 128, 128, 128) 0
dropout (Dropout) (None, 128, 128, 128) 0
dense (Dense) (None, 128, 128, 1) 129
Total params: 41,569
Trainable params: 41,569
Non-trainable params: 0
After train on 50 images I got the following statistic (please note, I tried to normalize input matrices to [0,1] without any success, current statistic demonstrate non-normalized case) :
...
Epoch 50/50 2/2 [==============================] - 1s 687ms/step -loss: 18427.2871 - mae: 124.9277
Then prediction produces following:
You can see left – expected random image and right – result of prediction. In prediction I intentionally use grey-scaled image to understand how close my result to target. But as you can see – the prediction is far from expected and is close to source form of top/left reduce matrices.
So my questions:
1) What layers I’m missing?
2) What should be improved in existing topology?
p.s. this is cross post from Cross Validated Stackexchange, because nobody even viewed question that site

So it's hard to say what model would work best without training and testing the actual model, but from the results you've gotten so far here's a few options you could try.
Try adding a fully connected hidden layer
From the model you posted, it seems that you have a few convolution layers, followed by an up-sampling and dropout layer, and finally a single dense layer for your output nodes. Potentially, adding additional dense layers (for e.g. 128 or more or less nodes) before your final output layer might help. While the multiple convolution layers help the neural net to build up a sort of hierarchical understanding of the image, the hypothesis class might not be complex enough. Adding one or more dense layers might help with this.
Try using a multilayer perceptron
Convolution layers are often used to process images because they help build up a hierarchical understanding of the image that is somewhat scale/shift/rotation invariant. However, considering the problem that you're solving, a global understanding of the input might be more beneficial than identifying shift-invariant features.
As such, one possible option would be to remove the convolution layers and to use a multilayer perceptron (MLP).
Let us think of the input as two matrices of numbers, and the output is a matrix of 1s and 0s that correspond to 'black' and 'white'. You could then try a model with the following layers:
A Flatten layer that takes in your two reduced matrices as inputs and flattens
them
A hidden dense layer, maybe with something like 128 nodes and relu activation. You should experiment with the number of layers, nodes, and activation.
An output dense layer with 16384 (128x128) nodes. You could apply a softmax activation to this layer which could help the optimiser during the training process. Then, when creating your final image, set values < 0.5 to 0 and values >= 0.5 to 1, and reshape and reformat the matrix into a square image.
Of course, no guarantees that an MLP would work well, but if often does especially when given sufficient amounts of data (perhaps in the 1000s or more number of training examples).
Try using a deterministic algorithm
Looking at the structure of this problem, it seems that it could be solved more appropriately with a deterministic algorithm, which would fall under more the branch of traditional artificial intelligence rather than deep learning. This is also another potential route to explore.

The model you build is a conventional model (seen by the use of Conv2D). This layer are good in analyzing something given its neighbors. Making them very powerful for image classification or segmentation.
In your case the result of a pixels is depending on the whole line and column.
Neural networks seems to be unsuited for your problem, but if you want to continue look in to replacing conv layers with Conv(1xN) and Conv(Nx1). It will still be very hard to make it work.
The hard way: These puzzle exist out of a strong recurrent process. Each step the correct spots get filled int with a zero or one. Based on those the next get filled in. So a recurrent neural network would make most sense to me. Where the convolution is used to have the prediction of the neighbors influence its current prediction

LSTM having a systematic offset between predictions and ground truth

Currently i think i'm experiencing a systematic offset in a LSTM model, between the predictions and the ground truth values. What's the best approach to continue further from now on?
The model architecture, along with the predictions & ground truth values are shown below. This is a regression problem where the historical data of the target plus 5 other correlated features X are used to predict the target y. Currently the input sequence n_input is of length 256, where the output sequence n_out is one. Simplified, the previous 256 points are used to predict the next target value.
X is normalized. The mean squared error is used as the loss function. Adam with a cosine annealing learning rate is used as the optimizer (min_lr=1e-7, max_lr=6e-2).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
cu_dnnlstm_8 (CuDNNLSTM) (None, 256) 270336
_________________________________________________________________
batch_normalization_11 (Batc (None, 256) 1024
_________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 256) 0
_________________________________________________________________
dropout_11 (Dropout) (None, 256) 0
_________________________________________________________________
dense_11 (Dense) (None, 1) 257
=================================================================
Total params: 271,617
Trainable params: 271,105
Non-trainable params: 512
_________________________________________________________________
Increasing the node size in the LSTM layer, adding more LSTM layers (with return_sequences=True) or adding dense layers after the LSTM layer(s) only seems to lower the accuracy. Any advice would be appreciated.
Additional information on the image. The y-axis is a value, x-axis is the time (in days). NaNs have been replaced with zero, because the ground truth value in this case can never reach zero. That's why the odd outliers are in the data.
Edit:
I made some changes to the model, which increased accuracy. The architecture is the same, however the features used have changed. Currently only the historical data of the target sequence itself is used as a feature. Along with this, n_input got changed so 128. Switched Adam for SGD, mean squared error with the mean absolute error and finally the NaNs have been interpolated instead of being replaced with 0.
One step ahead predictions on the validation set look fine:
However, the offset on the validation set remains:
It might be worth noting that this offset also appears on the train set for x < ~430:

It looks like your model is overfitting and is simply always returning the value from the last timestep as a prediction. Your dataset is probably too small to have a model with this amount of parameters converge. You'll need to resort to techniques that combat overfitting: agressive dropout, adding more data, or try simpler, less overparameterized methods.
This phenomenon (LSTMs returning a shifted version of the input) has been a recurring theme in many stackoverflow questions. The answers there might contain some useful information:
LSTM Sequence Prediction in Keras just outputs last step in the input
LSTM model just repeats the past in forecasting time series
LSTM NN produces “shifted” forecast (low quality result)
Keras network producing inverse predictions
Stock price predictions of keras multilayer LSTM model converge to a constant value
Keras LSTM predicted timeseries squashed and shifted
LSTM Time series shifted predictions on stock market close price
Interesting results from LSTM RNN : lagged results for train and validation data
Finally, be aware that, depending on the nature of your dataset, there simply might be no pattern to be discovered in your data at all. You see this a lot with people trying to predict the stock market with LSTMs (there is a question on stackoverflow on how to predict the lottery numbers).

The answer is much simpler than we thought...
I saw multiple people saying this is due to overfitting and datasize. Some other people stated this is due to rescaling.
After several try, I found the solution: Try to do detrending before feed the data to RNN.
For example, you can do a simple degree-2 polynomial fitting of the data which will give you a polynomial formula. And it is possible to reduce the each data value corresponding to the formula value. Then we got a new dataset and we can feed it to the LSTM, after prediction we can just add the trend back to the result and the results should look better.

Simple stateful nn Keras, can fit model but not predict

I'm trying to create a simple stateful neural network in keras to wrap my head around how to connect Embedding layers and LSTM's. I have a piece of text where I have mapped every character to a integer and would like to send in one character at a time to predict the next character. I have done this earlier where I have sent in 8 characters at a time and got that to work well (using return_sequences=True and TimeDistributed(Dense)). But this time I want to only send in 1 character at a time and this is where my problem arises.
The code I use to set up my model:
n_fac = 32
vocab_size = len(chars)
n_hidden = 256
batch_size=64
model = Sequential()
model.add(Embedding(vocab_size,n_fac,input_length=1,batch_input_shape=(batch_size,1)))
model.add(BatchNormalization())
model.add(LSTM(n_hidden,stateful=True))
model.add(Dense(vocab_size,activation='softmax'))
model.summary() gives me the following:
Layer (type) Output Shape Param # Connected to
embedding_1 (Embedding) (64, 1, 32) 992 embedding_input_1[0][0]
batchnormalization_1 (BatchNorma (64, 1, 32) 128 embedding_1[0][0]
lstm_1 (LSTM) (64, 256) 295936 batchnormalization_1[0][0]
dense_1 (Dense) (64, 31) 7967 lstm_1[0][0]
Total params: 305,023
Trainable params: 304,959
Non-trainable params: 64
The code I use to set up my training data:
text = ... #Omitted for simplicity. Just setting text to some kind of literature work
text = text.lower() #Simple model, therefor only using lower case characters
idx2char = list(set(list(text)))
char2idx = {char:idx for idx,char in enumerate(idx2char)}
text_in_idx = [char2idx[char] for char in text]
x = text_idx[:-1]
y = text_idx[1:]
Compiling and training my network:
model.compile(optimizer=Adam(lr=1e-4),loss='sparse_categorical_crossentropy')
nb_epoch = 10
for i in range(nb_epoch):
model.reset_states()
model.fit(x,y,nb_epoch=1,batch_size=batch_size,shuffle=False)
Training works as it should, the loss is reduced with each epoch.
Now I want to try out my trained network but have no idea how to give it a character to predict the next. I start out by resetting its states and then want to start feeding it one char at a time.
I tried a couple of different inputs but all of them failed. These are not qualified guesses.
#The model uses integers for characters, therefor integers are sent as input
model.predict([1]) #Type error
model.predict(np.array([1])) #Value error
model.predict(np.array([1])[np.newaxis,:]) #Value error
model.predict(np.array([1])[:,np.newaxis]) #Value error
Am I forced to send in something of length batch_size or how am I supposed to send in data for the model to predict something?
The error text for Value error is very long and obscure so I omitted it. I can supply it if needed.
Using theano backend with keras.

How to train a network in Keras for varying output size

I have basic neural network created with Keras. I train the network successfully with vectors of data and corresponding output data that is a vector with two elements. It represents a coordinate (x, y). So in goes an array, out comes an array.
Problem is that I am unable to use training data where a single input vector should correspond to many coordinates. Effectively, I desire a vector of coordinates as output, without prior knowledge of the number of coordinates.
Network is created by
model = Sequential()
model.add(Dense(20, input_shape=(196608,)))
model.add(Dense(10))
model.add(Dense(2))
and model summary shows the output dimensions for each layer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 20) 3932180
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
_________________________________________________________________
dense_3 (Dense) (None, 2) 22
=================================================================
I realize the network structure only allows a length 2 vector as output. Dense layers also do not accept None as their size. How do I modify the network so that it can train on and output a vector of vectors (list of coordinates)?

A recurrent neural networks (RNNs) would be much more appropriate, this models are typicall called seq2seq, that is, sequence to sequence. Recurrent nets use layers like LSTM and GRU, and can input and output variable length sequences. Just look at things like Machine Translation done with RNNs.
This can be done directly with keras, and there are many examples lying around the internet, for example this one.

An rnn is not what you want for predicting coordinates. Instead, I would recommend using a model that predicts coordinates and associated confidences. So you would have 100 coordinate predictions for every forward pass through your model. Each of those predictions would have another associated prediction that determines if it is correct or not. Only predictions that are above a certain confidence threshold would count. That confidence threshold is what allows the model to choose how many points it wants to use each time (with a maximum number set by the number of outputs which in this example is 100).
r-cnn is a model that does just that. Here is the first keras implementaion I found on github https://github.com/yhenon/keras-frcnn.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.