How to pick data for LSTM model in Keras - python

I'm trying to create a model to predict the stock price in a sequence of data related to some market pair. I'm just starting out with LSTM models and I'm not sure about of the importance of the order in the data.
Consider this frame of my data:
df[55:65]
These data are the features each candle ordered by time appearing and some trading technical indicators I added to test. Finally my value to predict.
I tried to train my model
model = Sequential()
model.add(LSTM(
units=50,
return_sequences=True,
input_shape=(shape_examples, shape_features)
))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.25))
model.add(LSTM(units=50))
model.add(Dropout(0.25))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')
I am creating batches length 60 to feed the model
def get_dataset(train_data):
data= []
for i in range(60, len(train_data)):
data.append(train_data[i-60:i, :-1])
return np.array(data)
And I am splitting the data in two ways:
# first way
training_data_len = math.ceil(len(Y) * .75)
x_train = X[:training_data_len]
y_train = Y[:training_data_len]
x_test = X[training_data_len:-1]
y_test = Y[training_data_len:-1]
# second way
X_train_, X_test_, Y_train_, Y_test_ = train_test_split(
X, Y, test_size=0.33, random_state=42)
I fitted my model and I got these results:
First and second splitting respectively
The questions are:
Is the order in the data important in this example?
Why did I get those differents results?
Is it the second approach underfitting? Why the test data was easier to predict?

Related

How to train an unlabled data in machine learning?

I have the data of almost 9000 entities, I want to train my model and detect anomalies from data.
I tried few of things to did my work, and one thing I done is
def create_sequences(values, time_steps=TIME_STEPS):
output = []
for i in range(len(values) - time_steps):
output.append(values[i : (i + time_steps)])
return np.stack(output)
here I start splitting my training data
x_train = create_sequences(data['HR'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['PULSE'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['SpO2'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['ABPDias'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['ABPMean'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['RESP'].values)
x_train = np.expand_dims(x_train,axis=2)
and here is my model for training
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=7, padding="same", strides=2, input_shape=(x_train.shape[1],x_train.shape[2])))
model.add(MaxPooling1D(pool_size=1,padding="valid"))
model.add(Dropout(0.2))
model.add(Conv1D(filters=16, kernel_size=7, padding="same", strides=2))
model.add(LSTM(units=20, return_sequences=True))
model.add(Dropout(0.2))
model.add(Conv1DTranspose(filters=16, kernel_size=7, padding="same",strides=2))
model.add(Conv1D(filters=32, kernel_size=7, padding="same"))
model.add(MaxPooling1D(pool_size=2,padding="valid"))
model.add(Conv1DTranspose(filters=32, kernel_size=7, padding="same",strides=4,activation="relu"))
model.add(Conv1DTranspose(filters=1, kernel_size=7, padding="same"))
model.compile(optimizer="adam", loss="mse")
model.summary()
history = model.fit(
x_train,
x_train,
epochs=150,
batch_size=128,
validation_split=0.1
)
But this took a lot of time. What I am missing?, Can anyone guide me?
And one thing more is, should I use train_test_split for unlabeled data?
You cannot make supervised learning without labeled data. It is not preferable to use features both as an input and label. What you are looking for is clustering-based anomaly detection, which falls under the category of unsupervised learning. DBSCAN might be a good choice for this task, which is available in scikit-learn.
You use some layer to encode and then decode the data. The technique you applied is supervised machine learning(ML). Since your dataset is unlabeled, you need to employ unsupervised ML approaches.
Clustering is a technique for finding patterns in unlabelled data with many dimensions.
There are two different approaches to clustering-based anomaly detection.
1- Unsupervised clustering where the anomaly detection model is trained using unlabelled data that consists of both normal as well as attack traffics.
2- Semi-supervised clustering where the model is trained using normal data only to build a profile of normal activity.

How do I select train data for LSTM network training

I'm basically new to RNNs, but I'm trying to predict signals based on recordings.
I have two sets of data A and B - A is the raw data recording, and B is the binary labeled data marking '1' for every active event on A, both with shape (1895700,1)
Could you help me figure out what should be used as x and y train?
I been reading about this and understood to loop through A and extract x and y from here. did thi and got input shape of x_train - (189555, 150, 1) y_train - (189555, 150, 1) but getting accuracy of: 0.0000e+00 and negative loss.
My other approach was using A as x_train and B as y_train with input shapes of (12638,150,1) but from first step of epoch 1, had accuracy of: 96 and around .10 loss. they didnt vary much throughout training
So I'm not really sure what data should be my input
model:
model = Sequential()
model.add(LSTM(128, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.4, return_sequences=True))
model.add(LSTM(128, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.3, return_sequences=True))
model.add(LSTM(64, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.3, return_sequences=True))
model.add(Dense(features, input_shape=(ts, features), activation="sigmoid"))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
Thanks in advance!
Your X_train is the data that represent your features. While Y_train is the data that represents the output for the X_train features.
you can split your data by simply providing a parameter validation_split to the fit function:
model.fit(X_data, Y_data, batch_size=4, epochs=5, verbose=1, validation_split=0.2)
in this case it will split 20% of the data for validation.

RNN model predicting only one class?

I am trying to use GloVe embeddings to train a rnn model based on this article.
I have a labeled data: text(tweets) on one column, labels on another (hate, offensive or neither).
However the model seems to predict only one class in the result.
This is the LSTM model:
model = Sequential()
hidden_layer = 3
gru_node = 32
# model embedding matrix here....
for i in range(0,hidden_layer):
model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(GRU(gru_node, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(Dense(64, activation='softmax'))
model.add(Dense(nclasses, activation='softmax'))
start=time.time()
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
fitting the model:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
X_train_Glove,X_test_Glove, word_index, embeddings_index = loadData_Tokenizer(X_train, X_test)
model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)
model_RNN.fit(X_train_Glove,y_train,
validation_data=(X_test_Glove, y_test),
epochs=4,
batch_size=128,
verbose=2)
y_preds = model_RNN.predict_classes(X_test_Glove)
print(metrics.classification_report(y_test, y_preds))
Results:
classification report
Confusion matrix
Am I missing something here?
Update:
this is what the distribution looks like
and the model summary, more or less
How the distribution of your data looks like? The first suggestion is to stratify train/test split (here is the link for the documentation).
The second question is how much data do you have in comparison with the complexity of the model? Maybe, your model is so complex, that just do overfitting. You can use the command model.summary() to see the number of trainable parameters.

Predict future values using Keras

I am pretty new to creating neural networks, and I am trying to use Keras in python to create a model that will predict a sequence based on the first ~100 inputs.
I am using 11x10000 arrays to predict a single output value, and I have created a NN that will fit that regression pretty well. However, I want to create another NN that will predict the inputs for the regression NN.
I have a very large dataset at my disposal for training, and I am trying to predict a 11x10000 array using the first few inputs of an array.
I can get it to predict the expected value at the current time step, but I want it to predict far future values based on current and near future values.
Here is some code:
names = getNames()
test = getNames()
for index in range(len(names)):
print(index)
X, Y = getData(names, index, False)
X_test, Y_test = getData(test, index, True)
NN = makeNN(X,Y, X_test, X_test)
def getModel():
keras.backend.clear_session()
model = Sequential()
model.add(Dense(100, input_dim=11, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(11, init='uniform', activation='relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
return model
def makeNN(X, Y, X_test, Y_test):
try:
model = keras.models.load_model('./trajSweepNNVacc.h5')
print("previous model loaded")
except:
model = getModel()
print("new model created")
X_train = X[0:]
print('Training -----------')
model.fit(X_train, X_train, epochs=2)
print('\nTesting ------------')
cost = model.evaluate(X_test, Y_test)
print('test cost:', cost)
W, b = model.layers[0].get_weights()
model.save('trajSweepNNVacc.h5')
return model
I want to get it to predict the future 10000 values, but currently I can only have it predict a current value

is this correctly work on predict next value in keras?

here is my code
...
look_back = 20
train_size = int(len(data) * 0.80)
test_size = len(data) - train_size
train = data[0:train_size]
test = data[train_size:len(data)]
x_train, y_train = create_dataset(train, look_back)
x_test, y_test = create_dataset(test, look_back)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))
y_train=np.repeat(y_train.reshape(-1,1), 20, axis=1).reshape(-1,20,1)
y_test=np.repeat(y_test.reshape(-1,1), 20, axis=1).reshape(-1,20,1)
...
model = Sequential()
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(1, return_sequences=True))
model.compile(loss='mean_squared_error', optimizer='rmsprop', metrics=['accuracy'])
model.summary()
model.fit(x_train, y_train, epochs=10, batch_size=64)
p = model.predict(x_test)
and I want to predict the next value So,
predictions = model.predict(x_train) and shape is (62796, 20, 1)
and I coded the following site how to use the Keras model to forecast for future dates or events?
future = []
currentStep = predictions[-20:, :, :] # -20 is last look_back number
for i in range(10):
currentStep = model.predict(currentStep)
future.append(currentStep)
in this code future's result is
but p = model.predict(x_test)'s [:4000] result is
The difference between the two results is very large.
is this right way to Predict the next value??
I don't know where it went wrong or the code went wrong.
I hope for your opinion.
full source is https://gist.github.com/Lay4U/654f70bd1fb9c4f7d5bdb21ddcb588ab
According to your code you are trying to predict next value using lstm.
So here you have to reshape your input data correctly to reflect the time steps and features.
model.add(LSTM(512, return_sequences=True))
instead of this code you have to write :
model.add(LSTM(512, input_shape=(look_back,x)))
x = input features in your training data.
I guess this article will help to moderate your code and predict the future value:
enter link description here
This article will help you to understand more about how to predict future value:
enter link description here
Thank you
There are multiple methods you can try. There is no one right way at the moment. You can train a seperate model for predicting t+1, t+2 ... t+n. One LSTM model predicts t+1 while another predicts t+n. That is called a DIRMO strategy.
Your strategy (recursive strategy) is particularly risky because the model can propagate the error through multiple time horizons.
You can find a good comparison of alternative strategies in this paper.
https://www.sciencedirect.com/science/article/pii/S0957417412000528?via%3Dihub

Categories