Shape of data and LSTM Input for varying timesteps

Shape of data and LSTM Input for varying timesteps - python

For my master thesis, I want to predict the price of a stock in the next hour using a LSTM model. My X data contains 30.000 rows with 6 dimensions (= 6 features), my Y data contains 30.000 rows and only 1 dimension (=target variable). For my first LSTM model, I reshaped the X data to (30.000x1x6), the Y data to (30.000x1) and determined the input like this:
input_nn = Input(shape=(1, 6))
I am not sure how to reshape the data and to determine the input shape for the model if I want to increase the timesteps. I still want to predict the stock price in the next hour, but include more previous time steps.
Do I have to add the data from previous timesteps in my X data in the second dimension?
Can you explain what the number of units of a LSTM exactly refers to? Should it be the same as the number of timesteps in my case?

You are on the right track but confusing the number of units with timesteps. The units is a hyper-parameter that controls the output dimension of the LSTM. It is the dimension of the LSTM output vector, so if input is (1,6) and you have 32 units you will get (32,) as in the LSTM will traverse the single timestep and produce a vector of size 32.
Timesteps refers to the size of the history you can your LSTM to consider. So it isn't the same as units at all. Instead of processing the data yourself, Keras has a handy TimeseriesGenerator which will take a 2D data like yours and use a sliding window of some timestep size to generate timeseries data. From the documentation:
from keras.preprocessing.sequence import TimeseriesGenerator
import numpy as np
data = np.array([[i] for i in range(50)])
targets = np.array([[i] for i in range(50)])
data_gen = TimeseriesGenerator(data, targets,
length=10, sampling_rate=2,
batch_size=2)
assert len(data_gen) == 20
batch_0 = data_gen[0]
x, y = batch_0
assert np.array_equal(x,
np.array([[[0], [2], [4], [6], [8]],
[[1], [3], [5], [7], [9]]]))
assert np.array_equal(y,
np.array([[10], [11]]))
which you can use directory in model.fit_generator(data_gen,...) giving you the option to try out different sampling_rates, timesteps etc. You should probably investigate these parameters and how they affect the result in your thesis.

Update with code that is roughly 5 times quicker than the last one:
x = np.load(nn_input + "/EOAN" + "/EOAN_X" + ".npy")
y = np.load(nn_input + "/EOAN" + "/EOAN_Y" + ".npy")
num_features = x.shape[1]
num_time_steps = 500
for train_index, test_index in tscv.split(x):
# Split into train and test set
print("Fold:", fold_counter, "\n" + "Train Index:", train_index, "Test Index:", test_index)
x_train_raw, y_train, x_test_raw, y_test = x[train_index], y[train_index], x[test_index], y[test_index]
# Scaling the data
scaler = StandardScaler()
scaler.fit(x_train_raw)
x_train_raw = scaler.transform(x_train_raw)
x_test_raw = scaler.transform(x_test_raw)
# Creating Input Data with variable timesteps
x_train = np.zeros((x_train_raw.shape[0] - num_time_steps + 1, num_time_steps, num_features), dtype="float32")
x_test = np.zeros((x_test_raw.shape[0] - num_time_steps + 1, num_time_steps, num_features), dtype="float32")
for row in range(len(x_train)):
for timestep in range(num_time_steps):
x_train[row][timestep] = x_train_raw[row + timestep]
for row in range(len(x_test)):
for timestep in range(num_time_steps):
x_test[row][timestep] = x_test_raw[row + timestep]
y_train = y_train[num_time_steps - 1:]
y_test = y_test[num_time_steps - 1:]

Related

Problem with dimensionality in Keras RNN - reshape isn't working?

Let's consider this random dataset on which I want to perform RNN:
import random
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
from keras.optimizers import SGD
import numpy as np
df_train = random.sample(range(1, 100), 50)
I want to apply RNN with lag equal to 1. I'll use my own function:
def create_dataset(dataset, lags):
dataX, dataY = [], []
for i in range(lags):
subdata = dataset[i:len(dataset) - lags + i]
dataX.append(subdata)
dataY.append(dataset[lags:len(dataset)])
return np.array(dataX), np.array(dataY)
which narrows dataframe with respect to number of lags. It outputs two numpy arrays - first is independent variables, and second one is dependent variable.
x_train, y_train = create_dataset(df_train, lags = 1)
But now when I'm trying to run the function:
model = Sequential()
model.add(SimpleRNN(1, input_shape=(1, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=SGD(lr = 0.1))
history = model.fit(x_train, y_train, epochs=1000, batch_size=50, validation_split=0.2)
I obtain error:
ValueError: Error when checking input: expected simple_rnn_18_input to have 3 dimensions, but got array with shape (1, 49)
I've read about it and the solution is just to apply reshape:
x_train = np.reshape(x_train, (x_train.shape[0], 1, x_train.shape[1]))
but when I apply it I obtain error:
ValueError: Error when checking input: expected simple_rnn_19_input to have shape (1, 1) but got array with shape (1, 49)
and I'm not sure where is the mistake. Could you please tell me what I'm doing wrong?

What you are calling lags is called look back in literature. This technique allow to feed the RNN with more contextual data and learn mid/long range dependencies.
The error is telling you that you are feeding the layer (shape: 1x1) with the dataset (shape: 1x49)
There are 2 reasons behind the error:
The first is due to your create_dataset which is building a stack of 1x(50 - lags) = 1x49 vectors, which is the opposite of what you want 1x(lags) = 1x1.
In particular this line is the responsible:
subdata = dataset[i:len(dataset) - lags + i]
# with lags = 1 you have just one
# iteration in range(1): i = 0
subdata = dataset[0:50 - 1 + 0]
subdata = dataset[0:49] # which is a 1x49 vector
# In order to create the right vector
# you need to change your function:
def create_dataset(dataset, lags = 1):
dataX, dataY = [], []
# iterate to a max of (50 - lags - 1) times
# because we need "lags" element in each vector
for i in range(len(dataset) - lags - 1):
# get "lags" elements from the dataset
subdata = dataset[i:i + lags]
dataX.append(subdata)
# get only the last label representing
# the current element iteration
dataY.append(dataset[i + lags])
return np.array(dataX), np.array(dataY)
If you use look back in your RNN you also need to increase the input dimensions, because you are looking also at precendent samples.
The network indeed is looking to more data than just 1 sample, because it needs to "look back" to more samples to understand mid/long range dependencies.
This is more conceptual than actual, in your code is fine because lags = 1:
model.add(SimpleRNN(1, input_shape=(1, 1)))
# you should use lags in the input shape
model.add(SimpleRNN(1, input_shape=(1, LAGS)))

KERAS Classification only use some of the digits on Mnist dataset

# Creating a Sequential Model and adding the layers
input_img = Input(shape=(28, 28, 1))
#63 kernels - Conv of 3X3
conv_1 = Conv2D(63, kernel_size=(3,3),activation='relu', padding='same')(input_img)
#Then pooling of 2X2
encoded = MaxPooling2D((2, 2), padding='same')(conv_1)
#model.add(Dropout(0.2))
###Classification###
# Flattening the 2D arrays for fully connected layers
flatten = Flatten()(encoded)
# Adding dense layer
fc = Dense(1000, activation='relu')(flatten)
fc1 = (Dropout(0.2))(fc)
#A6 = model.add(Dropout(0.2),name = 'A6') #Combat Overfitting, drop random elements
#Softmax layer must have neurons = range of labels, 0-9 for this case
softmax = Dense(5, activation='softmax', name='classification')(fc1)
model = Model(inputs=input_img, outputs=[softmax])
when i run and model.fit the model , the follow error occurs:
ValueError: Data cardinality is ambiguous:
x sizes: 30703
y sizes: 30703, 51660
Please provide data which shares the same first dimension.
What i am trying to achieve is that , i am trying to run keras classification on mnist dataset, and i have removed some of the digits , leaving only 0,1,2,3,9, a total of 5 integers , i need to index the integers so i could output a dense layer of 5 outputs , instead of having to stick to 10 (covering integer 9). I have done the below , but error above occurs, kindly advise thanks
# Transform y_train (and similarly y_test).
uniquetrain, index = np.unique(y_train, return_inverse=True)
y_train = np.arange(len(uniquetrain))[index]
# To get back the original labels, just index into the unique values.
unique[y_train]
# Transform y_train (and similarly y_test).
uniquetest, index1 = np.unique(y_test, return_inverse=True)
y_test = np.arange(len(uniquetest))[index1]
# To get back the original labels, just index into the unique values.
unique[y_test]

You can create a mask to keep the instances of the desired labels and also create a mapping from old labels to new ones. See the code below:
X_train = ... # Tensor of shape [ELEMS, 28, 28, 1]
y_train = ... # Tensor of shape [ELEMS]
X_test = ... # Tensor of shape [TEST_ELEMS, 28, 28, 1]
y_test = ... # Tensor of shape [TEST_ELEMS]
# Labels you want to keep
keep_labels = [0, 1, 2, 3, 9]
# Map old labels to new labels, for instance the label 9 on the new set of labels,
# is going to be 4
labels_to_index = {l: i for i,l in enumerate(keep_labels)}
# Masks to keep training and test instance. Trues keeps the instances
train_keep_mask = np.zeros(y_train.shape[0], dtype=np.bool)
test_keep_mask = np.zeros(y_test.shape[0], dtype=np.bool)
for l in keep_labels:
train_keep_mask |= y_train == l
test_keep_mask |= y_test == l
# Apply masks to filter the training and test instances
X_train = X_train[train_keep_mask]
y_train = y_train[train_keep_mask]
X_test = X_test[test_keep_mask]
y_test = y_test[test_keep_mask]
# From now on X_train, y_train, X_test, y_test only contain the desired labels
# which are defined in `keep_labels`
# Map old labels to new ones
new_y_train = np.array([labels_to_index[l] for l in y_train.tolist()])
new_y_test = np.array([labels_to_index[l] for l in y_test.tolist()])
# To invert the labels use `keep_labels[new_label]`
again_old_y_train = np.array([keep_labels[l] for l in new_y_train.tolist()])
again_old_y_test = np.array([keep_labels[l] for l in new_y_test.tolist()])

I wonder if I was right about the implement of lstm layer using keras

Here is my model definition:
model = Sequential()
model.add(LSTM(i, input_shape=(None, 1), return_sequences=True))
model.add(Dropout(l))
model.add(LSTM(j))
model.add(Dropout(l))
model.add(Dense(k))
model.add(Dropout(l))
model.add(Dense(1))
and here is result
p = model.predict(x_test)
plt.plot(y_test)
plt.plot(p)
The sequential input represents the past signal in previous time-steps, the output is predicting the signal in next time-step. After splitting the training and testing data, the predictions on the test data is as follows:
The figure shows almost a perfect match with gold test data and the predictions. Is it possible to predict with such high accuracy?
I think something is wrong because there's no volatility. So I wonder if it's been implemented properly.
If the implementation is correct, how can you get the following(next) value?
Is it right to do this implement?
a = x_test[-1]
b = model.predict(a)
c = model.predict(b)
...
To sum up the question:
Is the implementation right way?
I wonder how to get the value of the next data.
def create_dataset(signal_data, look_back=1):
dataX, dataY = [], []
for i in range(len(signal_data) - look_back):
dataX.append(signal_data[i:(i + look_back), 0])
dataY.append(signal_data[i + look_back, 0])
return np.array(dataX), np.array(dataY)
train_size = int(len(signal_data) * 0.80)
test_size = len(signal_data) - train_size - int(len(signal_data) * 0.05)
val_size = len(signal_data) - train_size - test_size
train = signal_data[0:train_size]
val = signal_data[train_size:train_size+val_size]
test = signal_data[train_size+val_size:len(signal_data)]
x_train, y_train = create_dataset(train, look_back)
x_val, y_val = create_dataset(val, look_back)
x_test, y_test = create_dataset(test, look_back)
I use create_dataset with look_back=20.
signal_data is preprocessed with min-max normalisation MinMaxScaler(feature_range=(0, 1)).

Is the implementation right way?
Your code seems correct. I think you are not getting surprising results. You need to compare the results with a baseline that next prediction is randomly sampled from the range of possible day-to-day change. This way at least you can understand if your model is doing better than random sampling.
delta_train = train[1][1:] - train[1][:-1]
delta_range_train = delta_train.max()-delta_train.min()
# generating the baseline based on the change range in training:
random_p = test[0][:, -1] + (np.random.rand(test[0].shape[0])-0.5)*delta_range_train
You can check if your results are better than just a random sample random_p.
I wonder how to get the value of the next data.
this gives you the last data point in the test set:
a = x_test[-1:]
then, here you are predicting the next point day:
b = model.predict(a)
based on look_back you may need to keep some of the datapoints from to predict the next-next point:
c = model.predict(np.array([list(a[0,1:])+[b]])

Can't interpret prediction with neural network use

I'm trying to use TensorFlow in python, to make some prediction with cryptocurrency data. The problem is that the output of the prediction is like a 0.1-0.9 number whereas the cryptocurrency data should be a 10000-10100 format, and I don't find a solution to convert the 0.* number to the real one.
I've try to create a ratio, with substrat max - min from predicted values, and max-min from tested data, and divide to have a ratio but when I multiply this ratio with prediction there is a big rate of error ( found a 14000 number instead of a 10000 one )
Here some code :
train_start = 0
train_end = int(np.floor(0.7*n))
test_start = train_end
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]
Scale data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)
Build X and y:
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]
.
.
.
n_data = 10
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1
X = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None, n_data])
Y = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None])
Hidden layer
..
Output layer (must be transposed)
..
Cost function
..
Optimizer
..
Make Session:
sess = tf.compat.v1.Session()
Run initializer:
sess.run(tf.compat.v1.global_variables_initializer())
Setup interactive plot:
plt.ion()
fig = plt.figure()
ax1 = fig.add_subplot(111)
line1, = ax1.plot(y_test)
line2, = ax1.plot(y_test*0.5)
plt.show()
epochs = 10
batch_size = 256
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
sess.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 5) == 0:
# Prediction
pred = sess.run(out, feed_dict={X: X_test})
#This pred var is the output of the prediction
I persiste my result in a file and this is what its looks like :
2019-08-21 06-AM;15310.444858356934;0.50021994;
2019-08-21 12-PM;14287.717187390663;0.46680558;
2019-08-21 06-PM;14104.63871795706;0.46082407;
For example, the last prediction is 0,46 but when I try to convert it I found 14104 whereas it should be nearer a 10000 value
Does anyone have an idea how to convert those predictions?
Thanks!

You will have to make use of inverse_transform of MinMaxScaler to convert back the output you are getting in range of 0-1.
You have not given your model, but I believe you are making use of regression task with few dense layers. You will have to keep minimizing your loss. If you are using mean squared error, the larger the loss, more is the likelihood your output will be far away from the desired set of results.
Even after your loss is a small number and the result is coming good for train samples, but the prediction is bad for test dataset, you may have to consider increasing your train dataset so that more possibilities are covered. If that is not possible, consider reducing the number of neurons in your neural network so that it stops over-fitting.
You can do some postprocessing to restrict the output to some desired range.

LSTM timeseries prediction with multiple outputs

I have a dataset with 3 features in a timeseries. The dimension of the dataset is 1000 x 3 (1000 timesteps and 3 features). Basically, 1000 rows and 3 columns
The data looks like this:
A B C
131 111 100
131 110 120
131 100 100
...
131 100 100
The problem is how to train the first 25 steps and predict the next 25 steps in order to get the output of 3 features predictions which is (A, B and C). I successful train and predict 1-D (1 features(A)) array. But I have no idea how to predict the 3 features using same the dataset.
And I got this error:
Error when checking target: expected dense_1 to have shape (None, 3) but got array with shape (21, 1)
The code as below:
# -*- coding: utf-8 -*-
import numpy as np
import numpy
import matplotlib.pyplot as plt
import pandas
import math
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset) - look_back - 1):
a = dataset[i:(i + look_back):]
dataX.append(a)
dataY.append(dataset[i + look_back, :])
return numpy.array(dataX), numpy.array(dataY)
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset
dataframe = pandas.read_csv('v77.csv', engine='python',skiprows=0)
dataset = dataframe.values
print dataset
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = 10
test_size = 10
train, test = dataset[0:train_size, :], dataset[train_size:train_size+test_size, :]
print (train_size,test_size)
# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
print trainX
# reshape input to be [samples, time steps, features]
#trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 3))
#testX = numpy.reshape(testX, (testX.shape[0],look_back, 3))
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(32, input_shape=(3,3)))
model.add(Dense(3))
model.compile(loss='mean_squared_error', optimizer='adam')
history= model.fit(trainX, trainY,validation_split=0.33, nb_epoch=10, batch_size=16)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# print testPredict
# print np.shape(testPredict)
# Get something which has as many features as dataset
trainPredict_extended = numpy.zeros((len(trainPredict),3))
print trainPredict_extended
print np.shape(trainPredict_extended[:,2])
print np.shape(trainPredict[:,0])
# Put the predictions there
trainPredict_extended[:,2] = trainPredict[:,0]
# Inverse transform it and select the 3rd coumn.
trainPredict = scaler.inverse_transform(trainPredict_extended) [:,2]
# print(trainPredict)
# Get something which has as many features as dataset
testPredict_extended = numpy.zeros((len(testPredict),3))
# Put the predictions there
testPredict_extended[:,2] = testPredict[:,0]
# Inverse transform it and select the 3rd column.
testPredict = scaler.inverse_transform(testPredict_extended)[:,2]
# print testPredict_extended
trainY_extended = numpy.zeros((len(trainY),3))
trainY_extended[:,2]=trainY
trainY=scaler.inverse_transform(trainY_extended)[:,2]
testY_extended = numpy.zeros((len(testY),3))
testY_extended[:,2]=testY
testY=scaler.inverse_transform(testY_extended)[:,2]
# print
# print testY
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY, trainPredict))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY, testPredict))
print('Test Score: %.2f RMSE' % (testScore))
Sample data:
v77.txt
Help Needed. Thanks

Your Y shape does not match up with the last layer in your model. Your Y is in the form of (num_samples, 1), which means that for every sample it outputs a vector of length 1.
Your last layer, however, is a Dense(3) layer, which outputs (num_samples, 3), which means that for every sample it outputs a vector of length 3.
Since the output of your neural network and your y-data aren't in the same format, the neural network cannot train.
You can fix this in two ways:
1.Convert the output of your neural network to the shape of your y data by replacing Dense(3) with Dense(1):
model = Sequential()
model.add(LSTM(32, input_shape=(3,3)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')history= model.fit(trainX, trainY,validation_split=0.33, nb_epoch=10, batch_size=16)
2.Convert the shape of your y data to the output of your neural network by modifying your create_dataset() function such that all of the features are added to the y instead of just one:
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset) - look_back - 1):
a = dataset[i:(i + look_back):]
dataX.append(a)
dataY.append(dataset[i + look_back, :])
return numpy.array(dataX), numpy.array(dataY)
Since you stated that you wanted to predict 3 feature most likely you will be using the second option. Note that the second option does break the last part of your code to extend the y, but your model trains fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Shape of data and LSTM Input for varying timesteps - python

Related

Problem with dimensionality in Keras RNN - reshape isn't working?

KERAS Classification only use some of the digits on Mnist dataset

I wonder if I was right about the implement of lstm layer using keras

Can't interpret prediction with neural network use

LSTM timeseries prediction with multiple outputs

Categories

Resources