How do I select train data for LSTM network training - python

I'm basically new to RNNs, but I'm trying to predict signals based on recordings.
I have two sets of data A and B - A is the raw data recording, and B is the binary labeled data marking '1' for every active event on A, both with shape (1895700,1)
Could you help me figure out what should be used as x and y train?
I been reading about this and understood to loop through A and extract x and y from here. did thi and got input shape of x_train - (189555, 150, 1) y_train - (189555, 150, 1) but getting accuracy of: 0.0000e+00 and negative loss.
My other approach was using A as x_train and B as y_train with input shapes of (12638,150,1) but from first step of epoch 1, had accuracy of: 96 and around .10 loss. they didnt vary much throughout training
So I'm not really sure what data should be my input
model:
model = Sequential()
model.add(LSTM(128, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.4, return_sequences=True))
model.add(LSTM(128, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.3, return_sequences=True))
model.add(LSTM(64, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.3, return_sequences=True))
model.add(Dense(features, input_shape=(ts, features), activation="sigmoid"))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
Thanks in advance!

Your X_train is the data that represent your features. While Y_train is the data that represents the output for the X_train features.
you can split your data by simply providing a parameter validation_split to the fit function:
model.fit(X_data, Y_data, batch_size=4, epochs=5, verbose=1, validation_split=0.2)
in this case it will split 20% of the data for validation.

Related

How to pick data for LSTM model in Keras

I'm trying to create a model to predict the stock price in a sequence of data related to some market pair. I'm just starting out with LSTM models and I'm not sure about of the importance of the order in the data.
Consider this frame of my data:
df[55:65]
These data are the features each candle ordered by time appearing and some trading technical indicators I added to test. Finally my value to predict.
I tried to train my model
model = Sequential()
model.add(LSTM(
units=50,
return_sequences=True,
input_shape=(shape_examples, shape_features)
))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.25))
model.add(LSTM(units=50))
model.add(Dropout(0.25))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')
I am creating batches length 60 to feed the model
def get_dataset(train_data):
data= []
for i in range(60, len(train_data)):
data.append(train_data[i-60:i, :-1])
return np.array(data)
And I am splitting the data in two ways:
# first way
training_data_len = math.ceil(len(Y) * .75)
x_train = X[:training_data_len]
y_train = Y[:training_data_len]
x_test = X[training_data_len:-1]
y_test = Y[training_data_len:-1]
# second way
X_train_, X_test_, Y_train_, Y_test_ = train_test_split(
X, Y, test_size=0.33, random_state=42)
I fitted my model and I got these results:
First and second splitting respectively
The questions are:
Is the order in the data important in this example?
Why did I get those differents results?
Is it the second approach underfitting? Why the test data was easier to predict?

Keras LSTM model

I cannot find a hands on tutorial on how to structure the data for use with keras LSTM.
Data
x_train = 7300 rows where each vector is length 64.
y_train = array of 7300 items either 0's or 1's (the class).
Model
model = Sequential()
model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2, input_shape = (1, 64)))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train,
epochs = 5,
batch_size = 32,
validation_split = 0.1,
callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
My question is simply, why doesn't this work? Why isn't is as simple as giving an 2d array of vectors and similar length y values to fit.
Keras LSTM expects input of shape [batch_size, timesteps, features]. Your data is of shape [batch_size, features].
To add the timestep dimension (where number of timesteps is 1), do the following:
x_train = np.expand_dims(x_train, axis=1)

is this correctly work on predict next value in keras?

here is my code
...
look_back = 20
train_size = int(len(data) * 0.80)
test_size = len(data) - train_size
train = data[0:train_size]
test = data[train_size:len(data)]
x_train, y_train = create_dataset(train, look_back)
x_test, y_test = create_dataset(test, look_back)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))
y_train=np.repeat(y_train.reshape(-1,1), 20, axis=1).reshape(-1,20,1)
y_test=np.repeat(y_test.reshape(-1,1), 20, axis=1).reshape(-1,20,1)
...
model = Sequential()
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(1, return_sequences=True))
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])
model.summary()
model.fit(x_train, y_train, epochs=10, batch_size=64)
p = model.predict(x_test)
and I want to predict the next value So,
predictions = model.predict(x_train) and shape is (62796, 20, 1)
and I coded the following site how to use the Keras model to forecast for future dates or events?
future = []
currentStep = predictions[-20:, :, :] # -20 is last look_back number
for i in range(10):
currentStep = model.predict(currentStep)
future.append(currentStep)
in this code future's result is
but p = model.predict(x_test)'s [:4000] result is
The difference between the two results is very large.
is this right way to Predict the next value??
I don't know where it went wrong or the code went wrong.
I hope for your opinion.
full source is https://gist.github.com/Lay4U/654f70bd1fb9c4f7d5bdb21ddcb588ab
According to your code you are trying to predict next value using lstm.
So here you have to reshape your input data correctly to reflect the time steps and features.
model.add(LSTM(512, return_sequences=True))
instead of this code you have to write :
model.add(LSTM(512, input_shape=(look_back,x)))
x = input features in your training data.
I guess this article will help to moderate your code and predict the future value:
enter link description here
This article will help you to understand more about how to predict future value:
enter link description here
Thank you
There are multiple methods you can try. There is no one right way at the moment. You can train a seperate model for predicting t+1, t+2 ... t+n. One LSTM model predicts t+1 while another predicts t+n. That is called a DIRMO strategy.
Your strategy (recursive strategy) is particularly risky because the model can propagate the error through multiple time horizons.
You can find a good comparison of alternative strategies in this paper.
https://www.sciencedirect.com/science/article/pii/S0957417412000528?via%3Dihub

Keras LSTM Autoencoder time-series reconstruction

I am trying to reconstruct time series data with LSTM Autoencoder (Keras).
Now I want train autoencoder on small amount of samples (5 samples, every sample is 500 time-steps long and have 1 dimension). I want to make sure that model can reconstruct that 5 samples and after that I will use all data (6000 samples).
window_size = 500
features = 1
data = data.reshape(5, window_size, features)
model = Sequential()
model.add(LSTM(256, input_shape=(window_size, features),
return_sequences=True))
model.add(LSTM(128, input_shape=(window_size, features),
return_sequences=False))
model.add(RepeatVector(window_size))
model.add(LSTM(128, input_shape=(window_size, features),
return_sequences=True))
model.add(LSTM(256, input_shape=(window_size, features),
return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
model.fit(data, data, epochs=100, verbose=1)
Model
Training:
Epoch 1/100
5/5 [==============================] - 2s 384ms/step - loss: 0.1603
...
Epoch 100/100
5/5 [==============================] - 2s 388ms/step - loss: 0.0018
After training, I tried reconstruct one of 5 samples:
yhat = model.predict(np.expand_dims(data[1,:,:], axis=0), verbose=0)
Reconstitution: Blue
Input: Orange
Why is reconstruction so bad when loss is small? How can I make model better? Thanks.
Update:
The answer below is based on the old version and based on the current LSTM doc, the input should be shaped as [batch, timesteps, feature]!
See this: https://github.com/keras-team/keras/blob/b80dd12da9c0bc3f569eca3455e77762cf2ee8ef/keras/layers/rnn/lstm.py#L481
Old Answer:
It seems to me, a time series should be given to the LSTMs in this format:
(samples, features , window_size)
So, if you change the format, for example I exchanged the variables, and look at the results:
[![enter image description here][1]][1]
Code for reproducing the result(I didn't change the name of the variables, so please don't be confused :)):
import numpy as np
import keras
from keras import Sequential
from keras.layers import Dense, RepeatVector, TimeDistributed
from keras.layers import LSTM
N = 10000
data = np.random.uniform(-0.1, 0.1, size=(N, 500))
data = data.cumsum(axis=1)
print(data.shape)
window_size = 1
features = 500
data = data.reshape(N, window_size, features)
model = Sequential()
model.add(LSTM(32, input_shape=
(window_size,features),
return_sequences=True))
model.add(LSTM(16, input_shape=(window_size,
features),
return_sequences=False))
model.add(RepeatVector(window_size))
model.add(LSTM(16, input_shape=(window_size,
features),
return_sequences=True))
model.add(LSTM(32, input_shape=(window_size,
features),
return_sequences=True))
model.add(TimeDistributed(Dense(500)))
model.compile(optimizer='adam', loss='mse')
model.fit(data, data, epochs=100, verbose=1)
yhat = model.predict(np.expand_dims(data[1,:,:], axis=0), verbose=0)
plot(np.arange(500), yhat[0,0,:])
plot(np.arange(500), data[1,0,:])
Credit to sobe86: I used the proposed data by him/her.
[1]: https://i.stack.imgur.com/5JUDN.png
I tried running your code on the following data
data = np.random.uniform(-0.1, 0.1, size=(5, 500))
data = data.cumsum(axis=1)
so the data is just the cumalative sum of some random uniform noise. I ran for 1000 epochs, and my results are not as bad as yours, the LSTM seems to make some effort to follow the line, though it seems to just be hovering around the running mean (as one might expect).
Note that this is running the model on the TRAINING data (which you seem to imply you were doing in your question) - if we try to look at performance on data that the model was not trained on, we can get bad results.
This is not surprising in the least, with such a small training set, we should fully expect the model to overfit, and not generalise to new data.
One thing I understood from my experience trying to fit auto encoders, is that they are not easy to fit. But I would check these elements:
LSTM doesn't do good with non-stationary data. Instead of learning the variability in the data it would try to learn the trend. So de-trending would be a good step to add to your data before hand. Now, to do that, one easy way is to calculate the difference of data with its previous timestamp. Then at each timestep you would have x[i]-x[i-1] instead of x[i]. You can experiment with different orders of de-trending based on your data and its trend/seasonality. For example, if you expect the data has weekly seasonality, another order to check would be 7 days (if each timestep is a day) and your data would be x[i]-x[i-7].
Experiment with the architecture of the auto-encoder. depending on the sequence length 32 hidden units might not be enough to encode the data properly and keep enough information.
Use Bidirectional layers. Sometimes I use Conv1D as well.
Don't need to be Symmetrical. So be creative.

Different model performance using LSTM keras sample_weight, when padding before or after the actual data

I am a building a model in Keras with input data X of variable length (N_sample, 50, 128). Each sample has 50 time-steps and at each time step, I have 128 features. However, I have used zero-padding to generate the input X, because not all Samples have 50 time-steps.
There are two ways of padding zeros.
For each sample, I feed the true data, say (20,128) in the beginning and then the remaining (30,128), I pad zero.
I pad the first 30 rows with zero and add data to the last 20 rows.
I then use sample_weight to assign a zero weight to the padded time steps.
However, in these two settings, I get completely different AUC on the test set. What happens if zero padded samples are fed before or after the true data in an LSTM network with sample_weights? Is it due to the initialization of the hidden state in the LSTM?
How would I know, which is correct? Thank you.
My model is as below:
model = Sequential()
model.add(TimeDistributed(Dense(64, activation='sigmoid'), input_shape=(50, 128)))
model.add(LSTM(32, return_sequences=True))
model.add(TimeDistributed(Dense(8, activation='sigmoid')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='rmsprop',sample_weight_mode='temporal', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=2, sample_weight=Sample_weight_train)

Categories